# Training a linear regressor

### Background

The dataset which was created for part 1 has been constructed so that each borough can be identified.  This means that a linear regressor could be made for each indivual borough.  As the code uses the dataset with the same structure, what works for one, would work for all.  However, in order to simplifiy this notebook, only one borough will be used.  That borough is **Manhatten**.

To begin with, the data must first be loaded.  Once loaded, using the query method of the pandas dataframe the rows which are in Mahattan can be extracted.

It should be noted, for the linear regressor the non one hot endoded data is used.  For the DNN, the one hot encoded data will be used in place.

In [1]:
# import pandas
import pandas as pd

# load the data as per normal
# df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv', index_col=0)
df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv')

# now setup to query the dataframe for only the borough noted
borough_df = df.query('BOROUGH=="MANHATTAN"')

# ignore this for now
# remove 2012
remove_2012 = borough_df.query('YEAR!=2012')

# remove 2020
remove_2020 = remove_2012.query('YEAR!=2020')

#remove 2021
remove_2021 = remove_2020.query('YEAR!=2021')

# make a new dataframe, which by now shall only contain data for Manhattan in the years 2013 to 2019
manhat_df = remove_2021

year_2013 = manhat_df.query("YEAR==2013")


#### Data Verification

Once the data has been loaded, it is a good idea to check this.  Printing only 6 rows shows that all are Manhattan.

In [2]:
# check the data had loaded by printing it
print(manhat_df[:6])


           DATE    BOROUGH  WEEKDAY  ...  PERS_KILL  PERS_INJD  NUM_COLS
924  04-02-2013  MANHATTAN        1  ...          0         14        93
929  19-01-2013  MANHATTAN        6  ...          0         19       108
931  24-01-2013  MANHATTAN        4  ...          0         31       125
936  24-05-2013  MANHATTAN        5  ...          0         41       153
941  24-04-2013  MANHATTAN        3  ...          1         21       117
946  07-03-2013  MANHATTAN        4  ...          0         18       110

[6 rows x 28 columns]


## Import numpy and shuffle

Create a new shuffled data by using the random permutation function of numpy using the length of the original mahattan data frame.  In this particular data set this is important as the data is time series (or time series derived) and there could be patterns within the data.

In [3]:
import numpy as np
shuffle_manhatten = manhat_df.iloc[np.random.permutation(len(manhat_df))]
print(shuffle_manhatten[:5])

# setup constant for use later
SCALE_NUM_COLS = 1.0

             DATE    BOROUGH  WEEKDAY  ...  PERS_KILL  PERS_INJD  NUM_COLS
13528  18-12-2019  MANHATTAN        3  ...          0         23       110
7285   17-11-2016  MANHATTAN        4  ...          0         27       167
12635  03-11-2019  MANHATTAN        7  ...          0         16        85
9703   2017-10-23  MANHATTAN        1  ...          0         34       148
11259  30-10-2018  MANHATTAN        2  ...          0         26       115

[5 rows x 28 columns]


## Predictors, Training set and Testing set

Firstly the predictors are created.  These are created from the shuffled data and only the third, fourth, fifth, seventh and last column is used.  These are, in order, year, month, day, temp and num_cols (number of collisions).  If the data was not being filtered by borough at the start, then the borough would need to be brought in also.

The Target is then defined - this is the last column in the shuffled dataset.

The training set will be 80% of the full data set (0.8) and the testing data will be the remainder - so in this case, 100 - 80 = 20%

Constants are setup for the number of predictors (3 in this case: year, month and day) and the number of outputs (or targets, which is 1 here.)

In [4]:
# select the day, month, year and number of collisions columns.  
predictors_manhatten = shuffle_manhatten.iloc[:,[3,4,5]]
print(predictors_manhatten[:6])

# We want the last column (the NUM_COLS)
targets_manhattan = shuffle_manhatten.iloc[:,-1]
print(targets_manhattan[:6])

# split data into training set
training_size_manhattan = int(len(shuffle_manhatten['NUM_COLS']) * 0.8)

# test size is the size of the data - the training size (in this case 20%)
testing_size_manhattan = len(shuffle_manhatten['NUM_COLS']) - training_size_manhattan

# define the number of input params, day, month and year = 3 (predictors)
NO_PREDICTORS_MANHATTAN = 3

# define the number of output params, collisions = 1 (targets)
NO_TARGETS = 1


       YEAR  MONTH  DAY
13528  2019     12   18
7285   2016     11   17
12635  2019     11    3
9703   2017     10   23
11259  2018     10   30
6591   2016     11   20
13528    110
7285     167
12635     85
9703     148
11259    115
6591     118
Name: NUM_COLS, dtype: int64


### Verification

Just a simple print to ensure the predictor and values are populated. 

In [5]:
print(predictors_manhatten.values)
print(predictors_manhatten)

[[2019   12   18]
 [2016   11   17]
 [2019   11    3]
 ...
 [2014    3   28]
 [2013    4    6]
 [2016    5   17]]
       YEAR  MONTH  DAY
13528  2019     12   18
7285   2016     11   17
12635  2019     11    3
9703   2017     10   23
11259  2018     10   30
...     ...    ...  ...
4048   2014      7    2
9423   2017      8   28
4508   2014      3   28
2547   2013      4    6
7905   2016      5   17

[2556 rows x 3 columns]


## The Tensorflow bit..

In [6]:
%tensorflow_version 1.x
import tensorflow as tf
import os

# check tensor version
print(tf.__version__)

import shutil

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# setup some variables to hold the file path
manhattan_dir = '/tmp/linear_regression_trained_model'

# remove the last training model - if it is present
if os.path.isdir(manhattan_dir):
  print("\n// Removing old model directory ....")
  shutil.rmtree(manhattan_dir, ignore_errors=True)
else:
  print("\n// No model directory to remove ....")

# estimators for each borough
estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(
    model_dir=manhattan_dir, 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.1), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhatten.values)
    )
)

# # Prints a log to show model is starting to train
print("// Starting to train Manhattan model............\n");

# Train the model. Pass in predictor values and target values.
estimator_manhattan.fit(
    predictors_manhatten[:training_size_manhattan].values, 
    targets_manhattan[:training_size_manhattan].values.reshape(training_size_manhattan,NO_TARGETS)/SCALE_NUM_COLS, steps=10000
)

# Next, we can check our predictions based on our predictors.
preds = estimator_manhattan.predict(x=predictors_manhatten[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - predslistscale)**2))
rmse = np.sqrt(np.mean((targets_manhattan[training_size_manhattan:].values - predslistscale) ** 2))
print('\n\n// Lnear Regression has RMSE of {0}'.format(rmse));

# Calculate the mean of the NUM_COLS Values.
avg = np.mean(shuffle_manhatten['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using NUM_COLS Values and the mean of all target values.
rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('\n\n// Just using average = {0} has RMSE of {1}'.format(avg, rmse));

TensorFlow 1.x selected.
1.15.2

// No model directory to remove ....
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please access pandas data directly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please convert numpy dtypes explicitly.
Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please switch to tf.contrib.estimator.*_head.
Instructions for updating:
Please replace uses of any Estimator from tf.contrib.l

## Initial Output

The RSME is quite large, which would indicate a not accurate model - however this model is only trying to predict based on a day.  Not using any extra conditions.  Temperature could be added in and this may affect the the output.

The model needs to be tested first, to do this some values will be fed in to be the input.

## Inital Prediction

```python
input = pd.DataFrame.from_dict(data = 
				{'YEAR' : [2013,2016,2022],
                 'MONTH' : [1, 6, 12],
                 'DAY' : [1, 1, 1]
                 })
```

Setting up the values for preduction, 3 have been chosen.  1/1/2013, 1/6/2016 and 1.12/2022 will be fed into model.

In [7]:
input = pd.DataFrame.from_dict(data = 
				{'YEAR' : [2013,2016,2022],
         'MONTH' : [1, 6, 12],
         'DAY' : [1, 1, 1]
        })

estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir=manhattan_dir, 
                                                                                 enable_centered_bias=False, 
                                                                                 feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)
                                                                                 )
)

preds = estimator_manhattan.predict(x=input.values)

predslistnorm = preds['scores']
predslistscale = preds['scores'] * 1.0
prednorm = format(str(predslistnorm))
pred = format(str(predslistscale))

print(prednorm)
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f888b070b90>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

### Result of First Run

The output (at the time of writing) was 

[117.42061  123.957085 131.939   ]

The model is being asked to predict 3 dates.  Two of these dates are known, these can be used as "control" dates - the number of collisions is known - so these should be fairly accurate (or at least, that is the theory...)

The dates are:

Date|Prediction|Actual
:--:|:--------:|:----:
1/1/2013|102|78
1/6/2016|110|121
1/12/2020|118|N/A

While the trend is, generally, going upwards, which does **sort of** look ok.. As noted above, it would be a good idea to try to add in an additional predictor.  The Temp value should suffice.

**Note** these values may change if the notebook is run at a later stage.

## Adding in Temp

The code cells will be copied from above, this time the predictors will be ammeded to include temperature.

**Note:** For brevity, the additional prints will not be inluded and all code will be added to the one cell.

In [8]:
# select the day, month, year and number of collisions columns.  
predictors_manhatten = shuffle_manhatten.iloc[:,[3,4,5,7]]
print(predictors_manhatten[:6])

# We want the last column (the NUM_COLS)
targets_manhattan = shuffle_manhatten.iloc[:,-1]
print(targets_manhattan[:6])

# split data into training set
training_size_manhattan = int(len(shuffle_manhatten['NUM_COLS']) * 0.8)

# test size is the size of the data - the training size (in this case 20%)
testing_size_manhattan = len(shuffle_manhatten['NUM_COLS']) - training_size_manhattan

# define the number of input params, day, month, year and now temp = 4 (predictors)
NO_PREDICTORS_MANHATTAN = 4

# define the number of output params, collisions = 1 (targets)
NO_TARGETS = 1

%tensorflow_version 1.x
import tensorflow as tf

# check tensor version
print(tf.__version__)

import shutil

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# setup some variables to hold the file path
manhattan_dir = '/tmp/linear_regression_trained_model_including_temp'

# remove the last training model
shutil.rmtree(manhattan_dir, ignore_errors=True)

# estimators for each borough
estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(
    model_dir=manhattan_dir, 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.1), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhatten.values)
    )
)

# # Prints a log to show model is starting to train
print("// Starting to train Manhattan model............\n");

# Train the model. Pass in predictor values and target values.
estimator_manhattan.fit(
    predictors_manhatten[:training_size_manhattan].values, 
    targets_manhattan[:training_size_manhattan].values.reshape(training_size_manhattan,NO_TARGETS)/SCALE_NUM_COLS, steps=10000
)

# Next, we can check our predictions based on our predictors.
preds = estimator_manhattan.predict(x=predictors_manhatten[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - predslistscale)**2))
rmse = np.sqrt(np.mean((targets_manhattan[training_size_manhattan:].values - predslistscale) ** 2))
print('\n\n// Lnear Regression has RMSE of {0}'.format(rmse));

# Calculate the mean of the NUM_COLS Values.
avg = np.mean(shuffle_manhatten['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using NUM_COLS Values and the mean of all target values.
rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('\n\n// Just using average = {0} has RMSE of {1}'.format(avg, rmse));

       YEAR  MONTH  DAY  TEMP
13528  2019     12   18  39.3
7285   2016     11   17  48.6
12635  2019     11    3  51.0
9703   2017     10   23  59.5
11259  2018     10   30  47.9
6591   2016     11   20  47.0
13528    110
7285     167
12635     85
9703     148
11259    115
6591     118
Name: NUM_COLS, dtype: int64
1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f888d24d090>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': N

In [9]:
input = pd.DataFrame.from_dict(data = 
				{'YEAR' : [2013,2016,2022],
         'MONTH' : [1, 6, 12],
         'DAY' : [1, 1, 1],
         'TEMP': [35,60,30]
        })


estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir=manhattan_dir, 
                                                                                 enable_centered_bias=False, 
                                                                                 feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)
                                                                                 )
)

preds = estimator_manhattan.predict(x=input.values)
# print(preds)

# The Number of collisions scale will be 106.072969 - which is the average number of collisions in Manhattan (359057 (collisions)/ 3385 (days))
predslistnorm = preds['scores']
predslistscale = preds['scores'] * 106.072969
prednorm = format(str(predslistnorm))
pred = format(str(predslistscale))

print(prednorm)
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f888ae5a6d0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_including_temp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained

### After Temp

After adding in the temp, the values have changed:


Date|Prediction (no temp)|Prediction (with temp)|Actual
:--:|:------------------:|:--------------------:|:----:
1/1/2013|102|102|78|
1/6/2016|110|115|121|
1/12/2020|118|110|N/A|

So while the values are pulling closer (on this current run) they are not **exact**

Looking at the data from part 1 of the assignment, year, month and day do not affect the outputs, at least not directly.  Temperature and Weekday do.  So, the model could be changed to use these as predictors. 


## Using Temperature and Weekday 

Taking the code from above, it will be modified to use these two parameters as predictors.

In [10]:
# select the day, month, year and number of collisions columns.  
predictors_manhatten = shuffle_manhatten.iloc[:,[2,7]]
print("// Printing predictors ...")
print(predictors_manhatten[:6])

# We want the last column (the NUM_COLS)
targets_manhattan = shuffle_manhatten.iloc[:,-1]
print("// Printing targets .....")
print(targets_manhattan[:6])

# split data into training set
training_size_manhattan = int(len(shuffle_manhatten['NUM_COLS']) * 0.8)

# test size is the size of the data - the training size (in this case 20%)
testing_size_manhattan = len(shuffle_manhatten['NUM_COLS']) - training_size_manhattan

# define the number of input params, day, month, year and now temp = 4 (predictors)
NO_PREDICTORS_MANHATTAN = 2

# define the number of output params, collisions = 1 (targets)
NO_TARGETS = 1

%tensorflow_version 1.x
import tensorflow as tf

# check tensor version
print(tf.__version__)

import shutil

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# setup some variables to hold the file path
manhattan_dir = '/tmp/linear_regression_trained_model_including_temp_and_weekday'

# remove the last training model
shutil.rmtree(manhattan_dir, ignore_errors=True)

# estimators for each borough
estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(
    model_dir=manhattan_dir, 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.1), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhatten.values)
    )
)

# # Prints a log to show model is starting to train
print("// Starting to train Manhattan model............\n");

# Train the model. Pass in predictor values and target values.
estimator_manhattan.fit(
    predictors_manhatten[:training_size_manhattan].values, 
    targets_manhattan[:training_size_manhattan].values.reshape(training_size_manhattan,NO_TARGETS)/SCALE_NUM_COLS, steps=10000
)

# Next, we can check our predictions based on our predictors.
preds = estimator_manhattan.predict(x=predictors_manhatten[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - predslistscale)**2))
rmse = np.sqrt(np.mean((targets_manhattan[training_size_manhattan:].values - predslistscale) ** 2))
print('\n\n// Lnear Regression has RMSE of {0}'.format(rmse));

# Calculate the mean of the NUM_COLS Values.
avg = np.mean(shuffle_manhatten['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using NUM_COLS Values and the mean of all target values.
rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('\n\n// Just using average = {0} has RMSE of {1}'.format(avg, rmse));


// Printing predictors ...
       WEEKDAY  TEMP
13528        3  39.3
7285         4  48.6
12635        7  51.0
9703         1  59.5
11259        2  47.9
6591         7  47.0
// Printing targets .....
13528    110
7285     167
12635     85
9703     148
11259    115
6591     118
Name: NUM_COLS, dtype: int64
1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f888b083bd0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_sav

This still has quite a large RMSE.  However, running a prediction, again using a day which is known (but picking the weekday and temp values)

Using 24-10-2018 in the Mathattan borough, the *WEEKDAY* was **3** and the *TEMP*  was **48.4**

In [11]:
input = pd.DataFrame.from_dict(data = 
				{'WEEKDAY' : [3],
         'TEMP' : [48.4]
        })


estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir=manhattan_dir, 
                                                                                 enable_centered_bias=False, 
                                                                                 feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)
                                                                                 )
)

preds = estimator_manhattan.predict(x=input.values)
# print(preds)

# The Number of collisions scale will be 106.072969 - which is the average number of collisions in Manhattan (359057 (collisions)/ 3385 (days))
predslistnorm = preds['scores']
predslistscale = preds['scores'] * 1
prednorm = format(str(predslistnorm))
pred = format(str(predslistscale))

print(prednorm)
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f888af1c510>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_including_temp_and_weekday', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regres

Which returns [120.75173] - the actual value of collisions was **120**  

This value is pretty good for a linear model. 

Next up is to attempt to train and use a DNN

# Training a DNN


## Background

For the DNN section, the data which was created and used one hot encoding will be used.


In [12]:
# Import pandas to use dataframes
import pandas as pd

# create data frame from csv file we hosted on our github
# df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv')
df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/dnn_data.csv')

# now setup to query the dataframe for only the boroughs noted.
# Again worth noting here that by the use of sqldf the values can be extracted as needed without having to extract and store multiple CSV files.
manhat_df = df.query('BOROUGH=="MANHATTAN"')

SCALE_NUM_COLS = 1.0

### Checking data loaded


In [35]:
# Ensure data has loaded
print(manhat_df[:6])

    Unnamed: 0        DATE    BOROUGH  ...  PERS_KILL  PERS_INJD  NUM_COLS
0            1  21-08-2012  MANHATTAN  ...          0         28       109
6            7  27-10-2012  MANHATTAN  ...          0         20       122
13          14  25-08-2012  MANHATTAN  ...          0         22        97
16          17  09-09-2012  MANHATTAN  ...          0         29       109
20          21  17-09-2012  MANHATTAN  ...          1         27       123
28          29  23-11-2012  MANHATTAN  ...          0          5        66

[6 rows x 46 columns]


## Importing Numpy and setting preductors

For the predictors, all the columns which have integer values are used.  The borough has been excluded (as only Manhattan is being selected) and the collision_date is being excluded.  However the date is being stored in the one hot encoded columns.

In [36]:
import numpy as np
shuffle_manhattan = manhat_df.iloc[np.random.permutation(len(manhat_df))]

# setup predictors
# Removing the data fields as the DNN should be able to extract this.  Also exluding the borough field as only dealing with Manhattan
predictors_manhattan = shuffle_manhattan.iloc[:,[3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26,27,28,29,30,31,32,33,34,35,36]]

# Check -1 (for all columns) has worked
print(predictors_manhattan[:6])

       Fri  Mon  Sat  Sun  Thu  Tue  ...   GUST   MAX   MIN  PRCP   SNDP  FOG
10152    0    0    0    1    0    0  ...   34.0  45.0  32.0  0.01  999.9    0
12860    0    1    0    0    0    0  ...  999.9  66.9  48.0  0.00  999.9    0
11382    0    1    0    0    0    0  ...   22.0  75.9  66.0  0.00  999.9    1
10515    0    0    1    0    0    0  ...  999.9  77.0  57.0  0.00  999.9    0
12722    0    0    0    0    0    1  ...   18.1  30.0  19.9  0.00  999.9    0
9502     0    0    0    0    0    1  ...  999.9  75.0  60.1  0.00  999.9    0

[6 rows x 33 columns]


In [37]:
# print first 5 rows of the shuffle
print(shuffle_manhattan[:5])

       Unnamed: 0        DATE    BOROUGH  ...  PERS_KILL  PERS_INJD  NUM_COLS
10152       10153  26-03-2018  MANHATTAN  ...          0         30       117
12860       12861  01-10-2019  MANHATTAN  ...          0         23        85
11382       11383  17-07-2018  MANHATTAN  ...          0         17       145
10515       10516  26-08-2018  MANHATTAN  ...          0         16        77
12722       12723  20-02-2019  MANHATTAN  ...          0         24       112

[5 rows x 46 columns]


### Defining the target

For this test model, only the number of collisions (NUM_COLS) is of interest.  This needs to be defined.

In [38]:
# Define the target (the NUM_COLS)
target_manhatan = shuffle_manhattan.iloc[:,-1]

# print the targets
print(target_manhatan[:6])

10152    117
12860     85
11382    145
10515     77
12722    112
9502     163
Name: NUM_COLS, dtype: int64


### Training and Testing dataset

As with the linear regressor, a training and testing dataset of 80% and 20% of the results is defined.

Also worth noting that there are a total of 33 predictors here.

In [39]:
# Split our data into a training set i.e. 80% of the length of the shuffle array
training_size_manhattan = int(len(shuffle_manhattan['NUM_COLS'] ) *0.8)

# The test set size is 100% - 80% = 20% of the length of the shuffle array.
testing_size_manhattan = len(shuffle_manhattan['NUM_COLS']) - training_size_manhattan

# Define the number of input values (predictors) - won't be 27, it will not include the borough or date - but should it include the borough?
no_predictors = 33

# Define the number of output values (targets)
no_outputs = 1

## The tensorflow bit...

Now the tensorflow model can be setup.  This is mostly the same as the regressor (except the DNN uses the DNNRegressor, of course.)

In [40]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf
import os

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
model_dir = '/tmp/DNN_NY_ALL_BOROUGHS_regression_trained_model'

# remove the last training model - if it is present
if os.path.isdir(model_dir):
  print("\n// Removing old model directory ....")
  shutil.rmtree(model_dir, ignore_errors=True)
else:
  print("\n// No model directory to remove ....")

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(
    model_dir=model_dir, 
    hidden_units=[20,18,14], 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.01), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhattan.values)
    )
)

# Prints a log to show model is starting to train
print("starting to train");

# Train the model. Pass in predictor values and target values.
estimator.fit(predictors_manhattan[:training_size_manhattan].values, 
              target_manhatan[:training_size_manhattan].values.reshape(training_size_manhattan, no_outputs) / SCALE_NUM_COLS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_manhattan[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS


# Calculate RMSE i.e. how good the model works using the predictions and targets.
rmse = np.sqrt(np.mean((target_manhatan[training_size_manhattan:].values - predslistscale) ** 2))
print('DNNRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the Number of collisions values.
avg = np.mean(shuffle_manhattan['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using Numbe of collision values and the mean of all target values.

# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((shuffle_manhattan['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

1.15.2

// Removing old model directory ....
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f8885b0e0d0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_NY_ALL_BOROUGHS_regression_trained_model', '_session_creation_timeout_secs': 7200}
starting to train
INFO:tensorflow:Create CheckpointSaverHook.
INFO

### Testing the model

As a test for the model, a date is selected from the data and the values plugged in.  For this test, the 05/05/2017 in MANHATTAN has been selected

In [41]:
# As a test, use something which is known - this case the values for MANHATTAN on the 05/05/2016
input = pd.DataFrame.from_dict(data = 
				{
        'Fri': [0],
        'Mon': [0],
        'Sat': [0],
        'Sun': [0],
        'Thu': [0],
        'Tue': [0],	
        'Wed': [1],	
        'YEAR': [2016],
        'Apr': [0],	
        'Aug': [0],	
        'Dec': [0],	
        'Feb': [0],	
        'Jan': [0],	
        'Jul': [0],	
        'Jun': [0],	
        'Mar': [0],	
        'May': [1],	
        'Nov': [0],	
        'Oct': [0],	
        'Sep': [0],	
        'DAY': [4],
        'TEMP': [45.5],
        'DEWP': [43.9],	
        'SLP': [1003.7],	
        'VISIB': [5.9],
        'WDSP': [21.5],	
        'MXPSD': [26],
        'GUST': [35],	
        'MAX': [50],
        'MIN': [45],	
        'PRCP': [0.52],
        'SNDP': [999.9],
        'FOG': [0]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir=model_dir, hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)
print(predslistnorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f88e3079550>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_NY_ALL_BOROUGHS_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_NY_ALL_BOROUGHS_regress

This returns **[128.98424]** (the value may change on subsequent runs) - and this value is a _little_ higher than the recorded value of **128**

That is quite a good result from a first run!

(note on a second run the result came out as shwon as **[127.09423]** - the original 128 value is being kept to show the model accuracy.


### Testing an unknown date.

When the data was constructed the year 2021 was removed (due to covid) however, the weather data for 2021 is still available.  Selecting a more recent date (in this case 21-11-2021) and using the weather data from then will give a different result.

date      |year|mo|da|temp|dewp|slp   |visib|wdsp|mxpsd|gust |max |min|prcp|sndp |fog
----------|----|--|--|----|----|------|-----|----|-----|-----|----|---|----|-----|---
2021-11-21|2021|11|21|52.3|42.2|1028.0|10.0 |11.3|14   |999.9|57.9|39.0|0.0|999.9|1

This will result in an ```input``` which looks like this:

```python
input = pd.DataFrame.from_dict(data = 
				{
         'Fri': [0],
         'Mon': [0],
         'Sat': [1],
         'Sun': [0],
         'Thu': [0],
         'Tue': [0],	
         'Wed': [0],	
         'YEAR': [2021],
         'Apr': [0],	
         'Aug': [0],	
         'Dec': [0],	
         'Feb': [0],	
         'Jan': [0],	
         'Jul': [0],	
         'Jun': [0],	
         'Mar': [0],	
         'May': [1],	
         'Nov': [0],	
         'Oct': [0],	
         'Sep': [0],	
         'DAY': [4],       
         'TEMP' : [52.3],
         'DEWP' : [42.2],
         'SLP' : [1028.0],
         'VISIB' : [10.0],
         'WDSP' : [11.3],
         'MXPSD' : [14],
         'GUST' : [999.9],
         'MAX' : [57.9],
         'MIN' : [39],
         'PRCP' : [0.0],
         'SNDP' : [999.9],
         'FOG' : [1]
        })
```

creating a new code block which incorporates all the above code items (for brevity) is below:

In [42]:
import numpy as np
shuffle_manhattan = manhat_df.iloc[np.random.permutation(len(manhat_df))]

# setup predictors
# Removing the data fields as the DNN should be able to extract this.  Also exluding the borough field as only dealing with Manhattan
predictors_manhattan = shuffle_manhattan.iloc[:,[3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26,27,28,29,30,31,32,33,34,35,36]]

# Check -1 (for all columns) has worked
print(predictors_manhattan[:6])

# print first 5 rows of the shuffle
print(shuffle_manhattan[:5])

# Define the target (the NUM_COLS)
target_manhatan = shuffle_manhattan.iloc[:,-1]

# print the targets
print(target_manhatan[:6])

# Split our data into a training set i.e. 80% of the length of the shuffle array
training_size_manhattan = int(len(shuffle_manhattan['NUM_COLS'] ) *0.8)

# The test set size is 100% - 80% = 20% of the length of the shuffle array.
testing_size_manhattan = len(shuffle_manhattan['NUM_COLS']) - training_size_manhattan

# Define the number of input values (predictors) - won't be 27, it will not include the borough or date - but should it include the borough?
no_predictors = 33

# Define the number of output values (targets)
no_outputs = 1

# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf
import os

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
model_dir = '/tmp/DNN_NY_ALL_BOROUGHS_regression_trained_model'

# remove the last training model - if it is present
if os.path.isdir(model_dir):
  print("\n// Removing old model directory ....")
  shutil.rmtree(model_dir, ignore_errors=True)
else:
  print("\n// No model directory to remove ....")

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(
    model_dir=model_dir, 
    hidden_units=[20,18,14], 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.01), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhattan.values)
    )
)

# Prints a log to show model is starting to train
print("starting to train");

# Train the model. Pass in predictor values and target values.
estimator.fit(predictors_manhattan[:training_size_manhattan].values, 
              target_manhatan[:training_size_manhattan].values.reshape(training_size_manhattan, no_outputs) / SCALE_NUM_COLS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_manhattan[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# i.e. take the difference between the actual and the forecast then square the difference, 
# find the average of all the squares and then find the square root. 
# The RMSE essentially punishes larger errors i.e. it puts a heavier weight on larger errors.
rmse = np.sqrt(np.mean((target_manhatan[training_size_manhattan:].values - predslistscale) ** 2))
print('DNNRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the Life Satisfaction Values.
avg = np.mean(shuffle_manhattan['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using Life Satisfaction Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((shuffle_manhattan['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

# As a test, use something which is known - this case the values for MANHATTAN on the 05/05/2017
input = pd.DataFrame.from_dict(data = 
        {
         'Fri': [0],
         'Mon': [0],
         'Sat': [1],
         'Sun': [0],
         'Thu': [0],
         'Tue': [0],	
         'Wed': [0],	
         'YEAR': [2021],
         'Apr': [0],	
         'Aug': [0],	
         'Dec': [0],	
         'Feb': [0],	
         'Jan': [0],	
         'Jul': [0],	
         'Jun': [0],	
         'Mar': [0],	
         'May': [1],	
         'Nov': [0],	
         'Oct': [0],	
         'Sep': [0],	
         'DAY': [4],       
         'TEMP' : [52.3],
         'DEWP' : [42.2],
         'SLP' : [1028.0],
         'VISIB' : [10.0],
         'WDSP' : [11.3],
         'MXPSD' : [14],
         'GUST' : [999.9],
         'MAX' : [57.9],
         'MIN' : [39],
         'PRCP' : [0.0],
         'SNDP' : [999.9],
         'FOG' : [1]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir=model_dir, hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)
print(predslistnorm)

       Fri  Mon  Sat  Sun  Thu  Tue  ...   GUST   MAX   MIN  PRCP   SNDP  FOG
13817    0    0    1    0    0    0  ...   38.1  48.9  36.0  0.38  999.9    1
3535     0    0    0    1    0    0  ...  999.9  46.9  28.9  0.01  999.9    0
15968    0    0    0    0    0    0  ...  999.9  37.0  23.0  0.00  999.9    1
9990     0    0    0    0    0    1  ...   31.1  51.1  37.0  0.00  999.9    0
14285    0    0    0    0    0    0  ...  999.9  71.1  53.1  0.00  999.9    1
10239    0    1    0    0    0    0  ...   39.0  61.0  37.0  0.13  999.9    0

[6 rows x 33 columns]
       Unnamed: 0        DATE    BOROUGH  ...  PERS_KILL  PERS_INJD  NUM_COLS
13817       13818  05-01-2020  MANHATTAN  ...          0         16        57
3535         3536  13-01-2014  MANHATTAN  ...          0         17        97
15968       15969  18-02-2021  MANHATTAN  ...          0         10        22
9990         9991  2017-12-20  MANHATTAN  ...          0         24       119
14285       14286  18-06-2020  MANHATTAN 

### Prediction

This returns [63.714706]  - while there is no way to tell if this is correct or not, and with covid still a large part of daily lives, the NY collisions data has been updated to show that 27 accidents have been recorded in the Manhattan area on the date selected.  This value was found using the followng query:

```sql
SELECT *  FROM `bigquery-public-data.new_york_mv_collisions.nypd_mv_collisions`
where timestamp between '2021-11-20'  and '2021-11-22'
ORDER BY timestamp DESC
```

The result of this was saved (file name "pred_check_results.csv" and the location column removed (as this was a comma seperated value it caused issues) as well as the header.  A new python script was created (check_results.py) which used the getBoroghFromLatLong function used when creating the data.  The result was saved to a new file which then had the header inserted and the data filtered.

While the result obtained is less than the "recorded" result, COVID is still very mich a factor in people's daily lives and the predicticion could be affected by this.

## Additional

While the current implentation using tensorflow 1.x does not support, at least out of the box, multiple dimension arrays, getting a prediction using collisions and injuries could still be possible (even if outside the scope of this assignment) - it would simply be a case of setting a different target for each run.  So one model for collisions and a different model for each of the injuries. 

As an example:


### Cyclists killed Model

In [44]:
import numpy as np
shuffle_manhattan = manhat_df.iloc[np.random.permutation(len(manhat_df))]

# setup predictors
# Removing the data fields as the DNN should be able to extract this.  Also exluding the borough field as only dealing with Manhattan
predictors_manhattan = shuffle_manhattan.iloc[:,[3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26,27,28,29,30,31,32,33,34,35,36]]

# Check -1 (for all columns) has worked
print(predictors_manhattan[:6])

# print first 5 rows of the shuffle
print(shuffle_manhattan[:5])

# Define the target (the NUM_COLS)
target_manhatan = shuffle_manhattan.iloc[:,37]

# print the targets
print(target_manhatan[:6])

# Split our data into a training set i.e. 80% of the length of the shuffle array
training_size_manhattan = int(len(shuffle_manhattan['CYC_KILL'] ) *0.8)

# The test set size is 100% - 80% = 20% of the length of the shuffle array.
testing_size_manhattan = len(shuffle_manhattan['CYC_KILL']) - training_size_manhattan

# Define the number of input values (predictors) - won't be 27, it will not include the borough or date - but should it include the borough?
no_predictors = 33

# Define the number of output values (targets)
no_outputs = 1

# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf
import os

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
model_dir = '/tmp/DNN_NY_ALL_BOROUGHS_regression_trained_model_Cyclist_killed'

# remove the last training model - if it is present
if os.path.isdir(model_dir):
  print("\n// Removing old model directory ....")
  shutil.rmtree(model_dir, ignore_errors=True)
else:
  print("\n// No model directory to remove ....")

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(
    model_dir=model_dir, 
    hidden_units=[20,18,14], 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.01), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhattan.values)
    )
)

# Prints a log to show model is starting to train
print("starting to train");

# Train the model. Pass in predictor values and target values.
estimator.fit(predictors_manhattan[:training_size_manhattan].values, 
              target_manhatan[:training_size_manhattan].values.reshape(training_size_manhattan, no_outputs) / SCALE_NUM_COLS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_manhattan[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# i.e. take the difference between the actual and the forecast then square the difference, 
# find the average of all the squares and then find the square root. 
# The RMSE essentially punishes larger errors i.e. it puts a heavier weight on larger errors.
rmse = np.sqrt(np.mean((target_manhatan[training_size_manhattan:].values - predslistscale) ** 2))
print('DNNRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the Life Satisfaction Values.
avg = np.mean(shuffle_manhattan['CYC_KILL'][:training_size_manhattan])

# Calculate the RMSE using Life Satisfaction Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((shuffle_manhattan['CYC_KILL'][training_size_manhattan:] - avg) ** 2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

# As a test, use something which is known - this case the values for MANHATTAN on the 05/05/2017
input = pd.DataFrame.from_dict(data = 
        {
         'Fri': [0],
         'Mon': [0],
         'Sat': [1],
         'Sun': [0],
         'Thu': [0],
         'Tue': [0],	
         'Wed': [0],	
         'YEAR': [2021],
         'Apr': [0],	
         'Aug': [0],	
         'Dec': [0],	
         'Feb': [0],	
         'Jan': [0],	
         'Jul': [0],	
         'Jun': [0],	
         'Mar': [0],	
         'May': [1],	
         'Nov': [0],	
         'Oct': [0],	
         'Sep': [0],	
         'DAY': [4],       
         'TEMP' : [52.3],
         'DEWP' : [42.2],
         'SLP' : [1028.0],
         'VISIB' : [10.0],
         'WDSP' : [11.3],
         'MXPSD' : [14],
         'GUST' : [999.9],
         'MAX' : [57.9],
         'MIN' : [39],
         'PRCP' : [0.0],
         'SNDP' : [999.9],
         'FOG' : [1]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir=model_dir, hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)
print(predslistnorm)

       Fri  Mon  Sat  Sun  Thu  Tue  ...   GUST   MAX   MIN  PRCP   SNDP  FOG
4041     0    0    0    0    0    1  ...   24.1  55.9  45.0  0.55  999.9    1
16886    1    0    0    0    0    0  ...   28.0  44.1  30.9  0.00  999.9    0
7337     0    0    0    1    0    0  ...  999.9  66.9  54.0  0.00  999.9    0
3910     0    0    0    0    0    0  ...   20.0  64.9  53.6  0.00  999.9    0
9190     0    0    0    0    0    0  ...  999.9  80.1  64.0  0.00  999.9    1
5265     0    0    1    0    0    0  ...  999.9  62.1  45.0  0.00  999.9    0

[6 rows x 33 columns]
       Unnamed: 0        DATE    BOROUGH  ...  PERS_KILL  PERS_INJD  NUM_COLS
4041         4042  24-12-2014  MANHATTAN  ...          0         18        87
16886       16887  06-02-2021  MANHATTAN  ...          1          9        37
7337         7338  20-06-2016  MANHATTAN  ...          0         27       138
3910         3911  15-05-2014  MANHATTAN  ...          0         32       148
9190         9191  2017-07-13  MANHATTAN 

Which returns **[0.07594194]** (essentially ) therefore the model predcts no cyclists would be killed on the 21/11/2021

### Motorists Injured

Rather than create models for each, the code will now attempt to look at the number of motorists injured.

In [45]:
import numpy as np
shuffle_manhattan = manhat_df.iloc[np.random.permutation(len(manhat_df))]

# setup predictors
# Removing the data fields as the DNN should be able to extract this.  Also exluding the borough field as only dealing with Manhattan
predictors_manhattan = shuffle_manhattan.iloc[:,[3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26,27,28,29,30,31,32,33,34,35,36]]

# Check -1 (for all columns) has worked
print(predictors_manhattan[:6])

# print first 5 rows of the shuffle
print(shuffle_manhattan[:5])

# Define the target (the NUM_COLS)
target_manhatan = shuffle_manhattan.iloc[:,40]

# print the targets
print(target_manhatan[:6])

# Split our data into a training set i.e. 80% of the length of the shuffle array
training_size_manhattan = int(len(shuffle_manhattan['MOTO_INJD'] ) *0.8)

# The test set size is 100% - 80% = 20% of the length of the shuffle array.
testing_size_manhattan = len(shuffle_manhattan['MOTO_INJD']) - training_size_manhattan

# Define the number of input values (predictors) - won't be 27, it will not include the borough or date - but should it include the borough?
no_predictors = 33

# Define the number of output values (targets)
no_outputs = 1

# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf
import os

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
model_dir = '/tmp/DNN_NY_ALL_BOROUGHS_regression_trained_model_Motorists_injured'

# remove the last training model - if it is present
if os.path.isdir(model_dir):
  print("\n// Removing old model directory ....")
  shutil.rmtree(model_dir, ignore_errors=True)
else:
  print("\n// No model directory to remove ....")

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(
    model_dir=model_dir, 
    hidden_units=[20,18,14], 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.01), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhattan.values)
    )
)

# Prints a log to show model is starting to train
print("starting to train");

# Train the model. Pass in predictor values and target values.
estimator.fit(predictors_manhattan[:training_size_manhattan].values, 
              target_manhatan[:training_size_manhattan].values.reshape(training_size_manhattan, no_outputs) / SCALE_NUM_COLS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_manhattan[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# i.e. take the difference between the actual and the forecast then square the difference, 
# find the average of all the squares and then find the square root. 
# The RMSE essentially punishes larger errors i.e. it puts a heavier weight on larger errors.
rmse = np.sqrt(np.mean((target_manhatan[training_size_manhattan:].values - predslistscale) ** 2))
print('DNNRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the Life Satisfaction Values.
avg = np.mean(shuffle_manhattan['MOTO_INJD'][:training_size_manhattan])

# Calculate the RMSE using Life Satisfaction Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((shuffle_manhattan['MOTO_INJD'][training_size_manhattan:] - avg) ** 2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

# As a test, use something which is known - this case the values for MANHATTAN on the 05/05/2017
input = pd.DataFrame.from_dict(data = 
        {
         'Fri': [0],
         'Mon': [0],
         'Sat': [1],
         'Sun': [0],
         'Thu': [0],
         'Tue': [0],	
         'Wed': [0],	
         'YEAR': [2021],
         'Apr': [0],	
         'Aug': [0],	
         'Dec': [0],	
         'Feb': [0],	
         'Jan': [0],	
         'Jul': [0],	
         'Jun': [0],	
         'Mar': [0],	
         'May': [1],	
         'Nov': [0],	
         'Oct': [0],	
         'Sep': [0],	
         'DAY': [4],       
         'TEMP' : [52.3],
         'DEWP' : [42.2],
         'SLP' : [1028.0],
         'VISIB' : [10.0],
         'WDSP' : [11.3],
         'MXPSD' : [14],
         'GUST' : [999.9],
         'MAX' : [57.9],
         'MIN' : [39],
         'PRCP' : [0.0],
         'SNDP' : [999.9],
         'FOG' : [1]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir=model_dir, hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)
print(predslistnorm)

       Fri  Mon  Sat  Sun  Thu  Tue  ...   GUST   MAX   MIN  PRCP   SNDP  FOG
8716     0    0    1    0    0    0  ...   21.0  55.0  39.0   0.0  999.9    0
14878    0    0    0    1    0    0  ...   22.9  75.0  60.1   0.0  999.9    1
3067     0    1    0    0    0    0  ...   28.0  46.9  42.1   0.0  999.9    0
3735     0    1    0    0    0    0  ...   20.0  37.0  18.0   0.0  999.9    0
11695    0    0    0    0    1    0  ...   33.0  30.9  19.0   0.0  999.9    0
3360     1    0    0    0    0    0  ...  999.9  78.1  60.1   0.0  999.9    0

[6 rows x 33 columns]
       Unnamed: 0        DATE    BOROUGH  ...  PERS_KILL  PERS_INJD  NUM_COLS
8716         8717  2017-04-09  MANHATTAN  ...          0         21       111
14878       14879  06-07-2020  MANHATTAN  ...          0         20        39
3067         3068  29-04-2014  MANHATTAN  ...          0         20       123
3735         3736  18-03-2014  MANHATTAN  ...          0         22       119
11695       11696  23-11-2018  MANHATTAN 

The model predicts 10 motorists would be injured in the accidents on that day

# Conclusion

This notebook has attemted to document the creation of both a linear and a DNN regressor using TensorFlow.