# Training a linear regressor

### Background

The dataset which was created for part 1 has been constructed so that each borough can be identified.  This means that a linear regressor could be made for each indivual borough.  As the code uses the dataset with the same structure, what works for one, would work for all.  However, in order to simplifiy this notebook, only one borough will be used.  That borough is **Manhatten**.

To begin with, the data must first be loaded.  Once loaded, using the query method of the pandas dataframe the rows which are in Mahattan can be extracted.

In [None]:
# import pandas
import pandas as pd

# load the data as per normal
# df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv', index_col=0)
df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv')

# now setup to query the dataframe for only the borough noted
borough_df = df.query('BOROUGH=="MANHATTAN"')

# ignore this for now
# remove 2012
remove_2012 = borough_df.query('YEAR!=2012')

# remove 2020
remove_2020 = remove_2012.query('YEAR!=2020')

#remove 2021
remove_2021 = remove_2020.query('YEAR!=2021')

# make a new dataframe, which by now shall only contain data for Manhattan in the years 2013 to 2019
manhat_df = remove_2021

year_2013 = manhat_df.query("YEAR==2013")


#### Data Verification

Once the data has been loaded, it is a good idea to check this.  Printing only 6 rows shows that all are Manhattan.

In [None]:
# check the data had loaded by printing it
print(manhat_df[:6])


           DATE    BOROUGH  WEEKDAY  ...  PERS_KILL  PERS_INJD  NUM_COLS
924  04-02-2013  MANHATTAN        1  ...          0         14        93
929  19-01-2013  MANHATTAN        6  ...          0         19       108
931  24-01-2013  MANHATTAN        4  ...          0         31       125
936  24-05-2013  MANHATTAN        5  ...          0         41       153
941  24-04-2013  MANHATTAN        3  ...          1         21       117
946  07-03-2013  MANHATTAN        4  ...          0         18       110

[6 rows x 28 columns]


## Import numpy and shuffle

Create a new shuffled data by using the random permutation function of numpy using the length of the original mahattan data frame.  In this particular data set this is important as the data is time series (or time series derived) and there could be patterns within the data.

In [None]:
import numpy as np
shuffle_manhatten = manhat_df.iloc[np.random.permutation(len(manhat_df))]
print(shuffle_manhatten[:5])

# setup constant for use later
SCALE_NUM_COLS = 1.0

             DATE    BOROUGH  WEEKDAY  ...  PERS_KILL  PERS_INJD  NUM_COLS
941    24-04-2013  MANHATTAN        3  ...          1         21       117
4916   10-02-2015  MANHATTAN        2  ...          0         15       116
11328  30-08-2018  MANHATTAN        4  ...          0         24       127
1364   06-05-2013  MANHATTAN        1  ...          0         25       122
12958  28-12-2019  MANHATTAN        6  ...          0         11        58

[5 rows x 28 columns]


## Predictors, Training set and Testing set

Firstly the predictors are created.  These are created from the shuffled data and only the third, fourth, fifth, seventh and last column is used.  These are, in order, year, month, day, temp and num_cols (number of collisions).  If the data was not being filtered by borough at the start, then the borough would need to be brought in also.

The Target is then defined - this is the last column in the shuffled dataset.

The training set will be 80% of the full data set (0.8) and the testing data will be the remainder - so in this case, 100 - 80 = 20%

Constants are setup for the number of predictors (3 in this case: year, month and day) and the number of outputs (or targets, which is 1 here.)

In [None]:
# select the day, month, year and number of collisions columns.  
predictors_manhatten = shuffle_manhatten.iloc[:,[3,4,5]]
print(predictors_manhatten[:6])

# We want the last column (the NUM_COLS)
targets_manhattan = shuffle_manhatten.iloc[:,-1]
print(targets_manhattan[:6])

# split data into training set
training_size_manhattan = int(len(shuffle_manhatten['NUM_COLS']) * 0.8)

# test size is the size of the data - the training size (in this case 20%)
testing_size_manhattan = len(shuffle_manhatten['NUM_COLS']) - training_size_manhattan

# define the number of input params, day, month and year = 3 (predictors)
NO_PREDICTORS_MANHATTAN = 3

# define the number of output params, collisions = 1 (targets)
NO_TARGETS = 1


       YEAR  MONTH  DAY
941    2013      4   24
4916   2015      2   10
11328  2018      8   30
1364   2013      5    6
12958  2019     12   28
10554  2018      8   25
941      117
4916     116
11328    127
1364     122
12958     58
10554     86
Name: NUM_COLS, dtype: int64


### Verification

Just a simple print to ensure the predictor and values are populated. 

In [None]:
print(predictors_manhatten.values)
print(predictors_manhatten)

[[2013    4   24]
 [2015    2   10]
 [2018    8   30]
 ...
 [2016    1    9]
 [2015    6   19]
 [2018   10   24]]
       YEAR  MONTH  DAY
941    2013      4   24
4916   2015      2   10
11328  2018      8   30
1364   2013      5    6
12958  2019     12   28
...     ...    ...  ...
9812   2017     11   14
12578  2019      8   21
7483   2016      1    9
5247   2015      6   19
11731  2018     10   24

[2556 rows x 3 columns]


## The Tensorflow bit..

In [None]:
%tensorflow_version 1.x
import tensorflow as tf
import os

# check tensor version
print(tf.__version__)

import shutil

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# setup some variables to hold the file path
manhattan_dir = '/tmp/linear_regression_trained_model'

# remove the last training model - if it is present
if os.path.isdir(manhattan_dir):
  print("\n// Removing old model directory ....")
  shutil.rmtree(manhattan_dir, ignore_errors=True)
else:
  print("\n// No model directory to remove ....")

# estimators for each borough
estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(
    model_dir=manhattan_dir, 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.1), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhatten.values)
    )
)

# # Prints a log to show model is starting to train
print("// Starting to train Manhattan model............\n");

# Train the model. Pass in predictor values and target values.
estimator_manhattan.fit(
    predictors_manhatten[:training_size_manhattan].values, 
    targets_manhattan[:training_size_manhattan].values.reshape(training_size_manhattan,NO_TARGETS)/SCALE_NUM_COLS, steps=10000
)

# Next, we can check our predictions based on our predictors.
preds = estimator_manhattan.predict(x=predictors_manhatten[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - predslistscale)**2))
rmse = np.sqrt(np.mean((targets_manhattan[training_size_manhattan:].values - predslistscale) ** 2))
print('\n\n// Lnear Regression has RMSE of {0}'.format(rmse));

# Calculate the mean of the NUM_COLS Values.
avg = np.mean(shuffle_manhatten['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using NUM_COLS Values and the mean of all target values.
rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('\n\n// Just using average = {0} has RMSE of {1}'.format(avg, rmse));

TensorFlow 1.x selected.
1.15.2

// No model directory to remove ....
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please access pandas data directly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please convert numpy dtypes explicitly.
Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please switch to tf.contrib.estimator.*_head.
Instructions for updating:
Please replace uses of any Estimator from tf.contrib.l

## Initial Output

The RSME is quite large, which would indicate a not accurate model - however this model is only trying to predict based on a day.  Not using any extra conditions.  Temperature could be added in and this may affect the the output.

The model needs to be tested first, to do this some values will be fed in to be the target.

## Inital Prediction

```python
input = pd.DataFrame.from_dict(data = 
				{'YEAR' : [2013,2016,2022],
         'MONTH' : [1, 6, 12],
         'DAY' : [1, 1, 1]
        })

```

Setting up the values for preduction, 3 have been chosen.  1/1/2013, 1/6/2016 and 1.12/2022 will be fed into model.

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'YEAR' : [2013,2016,2022],
         'MONTH' : [1, 6, 12],
         'DAY' : [1, 1, 1]
        })

estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir=manhattan_dir, 
                                                                                 enable_centered_bias=False, 
                                                                                 feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)
                                                                                 )
)

preds = estimator_manhattan.predict(x=input.values)

predslistnorm = preds['scores']
predslistscale = preds['scores'] * 1.0
prednorm = format(str(predslistnorm))
pred = format(str(predslistscale))

print(prednorm)
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f789f62c410>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

### Result of First Run

The output (at the time of writing) was 

[117.42061  123.957085 131.939   ]

The model is being asked to predict 3 dates.  Two of these dates are known, these can be used as "control" dates - the number of collisions is known - so these should be fairly accurate (or at least, that is the theory...)

The dates are:

Date|Prediction|Actual
:--:|:--------:|:----:
1/1/2013|117|78
1/6/2016|123|121
1/12/2020|131|N/A

While the trend is, generally, going upwards, which does **sort of** look ok.. As noted above, it would be a good idea to try to add in an additional predictor.  The Temp value should suffice.

**Note** these values may change if the notebook is run at a later stage.

## Adding in Temp

The code cells will be copied from above, this time the predictors will be ammeded to include temperature.

**Note:** For brevity, the additional prints will not be inluded and all code will be added to the one cell.

In [None]:
# select the day, month, year and number of collisions columns.  
predictors_manhatten = shuffle_manhatten.iloc[:,[3,4,5,7]]
print(predictors_manhatten[:6])

# We want the last column (the NUM_COLS)
targets_manhattan = shuffle_manhatten.iloc[:,-1]
print(targets_manhattan[:6])

# split data into training set
training_size_manhattan = int(len(shuffle_manhatten['NUM_COLS']) * 0.8)

# test size is the size of the data - the training size (in this case 20%)
testing_size_manhattan = len(shuffle_manhatten['NUM_COLS']) - training_size_manhattan

# define the number of input params, day, month, year and now temp = 4 (predictors)
NO_PREDICTORS_MANHATTAN = 4

# define the number of output params, collisions = 1 (targets)
NO_TARGETS = 1

%tensorflow_version 1.x
import tensorflow as tf

# check tensor version
print(tf.__version__)

import shutil

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# setup some variables to hold the file path
manhattan_dir = '/tmp/linear_regression_trained_model_including_temp'

# remove the last training model
shutil.rmtree(manhattan_dir, ignore_errors=True)

# estimators for each borough
estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(
    model_dir=manhattan_dir, 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.1), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhatten.values)
    )
)

# # Prints a log to show model is starting to train
print("// Starting to train Manhattan model............\n");

# Train the model. Pass in predictor values and target values.
estimator_manhattan.fit(
    predictors_manhatten[:training_size_manhattan].values, 
    targets_manhattan[:training_size_manhattan].values.reshape(training_size_manhattan,NO_TARGETS)/SCALE_NUM_COLS, steps=10000
)

# Next, we can check our predictions based on our predictors.
preds = estimator_manhattan.predict(x=predictors_manhatten[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - predslistscale)**2))
rmse = np.sqrt(np.mean((targets_manhattan[training_size_manhattan:].values - predslistscale) ** 2))
print('\n\n// Lnear Regression has RMSE of {0}'.format(rmse));

# Calculate the mean of the NUM_COLS Values.
avg = np.mean(shuffle_manhatten['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using NUM_COLS Values and the mean of all target values.
rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('\n\n// Just using average = {0} has RMSE of {1}'.format(avg, rmse));

       YEAR  MONTH  DAY  TEMP
941    2013      4   24  47.5
4916   2015      2   10  36.4
11328  2018      8   30  76.2
1364   2013      5    6  49.2
12958  2019     12   28  44.5
10554  2018      8   25  66.5
941      117
4916     116
11328    127
1364     122
12958     58
10554     86
Name: NUM_COLS, dtype: int64
1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7841e57590>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': N

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'YEAR' : [2013,2016,2022],
         'MONTH' : [1, 6, 12],
         'DAY' : [1, 1, 1],
         'TEMP': [35,60,30]
        })


estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir=manhattan_dir, 
                                                                                 enable_centered_bias=False, 
                                                                                 feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)
                                                                                 )
)

preds = estimator_manhattan.predict(x=input.values)
# print(preds)

# The Number of collisions scale will be 106.072969 - which is the average number of collisions in Manhattan (359057 (collisions)/ 3385 (days))
predslistnorm = preds['scores']
predslistscale = preds['scores'] * 106.072969
prednorm = format(str(predslistnorm))
pred = format(str(predslistscale))

print(prednorm)
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7841e51e50>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_including_temp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained

### After Temp

After adding in the temp, the values have changed:


Date|Prediction (no temp)|Prediction (with temp)|Actual
:--:|:------------------:|:--------------------:|:----:
1/1/2013|122|112|78|
1/6/2016|128|125|121|
1/12/2020|135|119|N/A|

So while the values are pulling closer (on this current run) they are not **exact**

Looking at the data from part 1 of the assignment, year, month and day do not affect the outputs, at least not directly.  Temperature and Weekday do.  So, the model could be changed to use these as predictors. 


## Using Temperature and Weekday 

Taking the code from above, it will be modified to use these two parameters as predictors.

In [None]:
# select the day, month, year and number of collisions columns.  
predictors_manhatten = shuffle_manhatten.iloc[:,[2,7]]
print("// Printing predictors ...")
print(predictors_manhatten[:6])

# We want the last column (the NUM_COLS)
targets_manhattan = shuffle_manhatten.iloc[:,-1]
print("// Printing targets .....")
print(targets_manhattan[:6])

# split data into training set
training_size_manhattan = int(len(shuffle_manhatten['NUM_COLS']) * 0.8)

# test size is the size of the data - the training size (in this case 20%)
testing_size_manhattan = len(shuffle_manhatten['NUM_COLS']) - training_size_manhattan

# define the number of input params, day, month, year and now temp = 4 (predictors)
NO_PREDICTORS_MANHATTAN = 2

# define the number of output params, collisions = 1 (targets)
NO_TARGETS = 1

%tensorflow_version 1.x
import tensorflow as tf

# check tensor version
print(tf.__version__)

import shutil

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# setup some variables to hold the file path
manhattan_dir = '/tmp/linear_regression_trained_model_including_temp_and_weekday'

# remove the last training model
shutil.rmtree(manhattan_dir, ignore_errors=True)

# estimators for each borough
estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(
    model_dir=manhattan_dir, 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.1), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhatten.values)
    )
)

# # Prints a log to show model is starting to train
print("// Starting to train Manhattan model............\n");

# Train the model. Pass in predictor values and target values.
estimator_manhattan.fit(
    predictors_manhatten[:training_size_manhattan].values, 
    targets_manhattan[:training_size_manhattan].values.reshape(training_size_manhattan,NO_TARGETS)/SCALE_NUM_COLS, steps=10000
)

# Next, we can check our predictions based on our predictors.
preds = estimator_manhattan.predict(x=predictors_manhatten[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - predslistscale)**2))
rmse = np.sqrt(np.mean((targets_manhattan[training_size_manhattan:].values - predslistscale) ** 2))
print('\n\n// Lnear Regression has RMSE of {0}'.format(rmse));

# Calculate the mean of the NUM_COLS Values.
avg = np.mean(shuffle_manhatten['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using NUM_COLS Values and the mean of all target values.
rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('\n\n// Just using average = {0} has RMSE of {1}'.format(avg, rmse));


// Printing predictors ...
       WEEKDAY  TEMP
941          3  47.5
4916         2  36.4
11328        4  76.2
1364         1  49.2
12958        6  44.5
10554        6  66.5
// Printing targets .....
941      117
4916     116
11328    127
1364     122
12958     58
10554     86
Name: NUM_COLS, dtype: int64
1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f789f624690>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_sav

This still has quite a large RMSE.  However, running a prediction, again using a day which is known (but picking the weekday and temp values)

Using 24-10-2018 in the Mathattan borough, the *WEEKDAY* was **3** and the *TEMP*  was **48.4**

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'WEEKDAY' : [3],
         'TEMP' : [48.4]
        })


estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir=manhattan_dir, 
                                                                                 enable_centered_bias=False, 
                                                                                 feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)
                                                                                 )
)

preds = estimator_manhattan.predict(x=input.values)
# print(preds)

# The Number of collisions scale will be 106.072969 - which is the average number of collisions in Manhattan (359057 (collisions)/ 3385 (days))
predslistnorm = preds['scores']
predslistscale = preds['scores'] * 1
prednorm = format(str(predslistnorm))
pred = format(str(predslistscale))

print(prednorm)
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f78475fdb10>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_including_temp_and_weekday', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regres

Which returns [120.53093] - the actual value of collisions was **120**  

One thing to note is that the model always returns slightly more accidents than happened - in this case, that is probably a wise thing, better to be prepared for too many than too few accidents...

# Training a DNN


## Background

The same dataset which was constructed during the first part of this assignment will be used here.  As before, the data will be limited to the Manhattan borough.


In [None]:
# Import pandas to use dataframes
import pandas as pd

# create data frame from csv file we hosted on our github
df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv')

# now setup to query the dataframe for only the boroughs noted
manhat_df = df.query('BOROUGH=="MANHATTAN"')

SCALE_NUM_COLS = 1.0

### Checking data loaded


In [None]:
# Ensure data has loaded
print(manhat_df[:6])

          DATE    BOROUGH  WEEKDAY  ...  PERS_KILL  PERS_INJD  NUM_COLS
0   21-08-2012  MANHATTAN        2  ...          0         28       109
6   27-10-2012  MANHATTAN        6  ...          0         20       122
13  25-08-2012  MANHATTAN        6  ...          0         22        97
16  09-09-2012  MANHATTAN        7  ...          0         29       109
20  17-09-2012  MANHATTAN        1  ...          1         27       123
28  23-11-2012  MANHATTAN        5  ...          0          5        66

[6 rows x 28 columns]


## Importing Numpy and setting preductors

For the predictors, all the columns which have integer values are used.  The borough has been excluded (as only Manhattan is being selected) and the collision_date is being exluded (but those values are present in the DAY, MONTH and YEAR columns)

In [None]:
import numpy as np
shuffle_manhattan = manhat_df.iloc[np.random.permutation(len(manhat_df))]

# setup predictors
# Removing the data fields as the DNN should be able to extract this.  Also exluding the borough field as only dealing with Manhattan
predictors_manhattan = shuffle_manhattan.iloc[:,[2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18]]

# Check -1 (for all columns) has worked
print(predictors_manhattan[:6])

       WEEKDAY  YEAR  MONTH  DAY  TEMP  ...   MAX   MIN  PRCP   SNDP  FOG
6113         6  2015      6    6  58.0  ...  63.0  46.9  0.05  999.9    1
12922        7  2019     11   10  48.3  ...  54.0  32.0  0.00  999.9    0
4916         2  2015      2   10  36.4  ...  37.9  35.1  0.07  999.9    0
8454         3  2017      2   15  36.6  ...  44.1  21.9  0.00  999.9    0
8456         4  2017      2   16  36.1  ...  44.1  21.9  0.63  999.9    1
16223        3  2021      6    9  68.9  ...  81.0  62.1  0.00  999.9    1

[6 rows x 16 columns]


In [None]:
# print first 5 rows of the shuffle
print(shuffle_manhattan[:5])

             DATE    BOROUGH  WEEKDAY  ...  PERS_KILL  PERS_INJD  NUM_COLS
6113   06-06-2015  MANHATTAN        6  ...          0         18       112
12922  10-11-2019  MANHATTAN        7  ...          0         15        84
4916   10-02-2015  MANHATTAN        2  ...          0         15       116
8454   2017-02-15  MANHATTAN        3  ...          0         18       111
8456   2017-02-16  MANHATTAN        4  ...          0         23       112

[5 rows x 28 columns]


### Defining the target

For this test model, only the number of collisions (NUM_COLS) is of interest.  This needs to be defined.

In [None]:
# Define the target (the NUM_COLS)
target_manhatan = shuffle_manhattan.iloc[:,-1]

# print the targets
print(target_manhatan[:6])

6113     112
12922     84
4916     116
8454     111
8456     112
16223     51
Name: NUM_COLS, dtype: int64


### Training and Testing dataset

As with the linear regressor, a training and testing dataset of 80% and 20% of the results is defined.

In [None]:
# Split our data into a training set i.e. 80% of the length of the shuffle array
training_size_manhattan = int(len(shuffle_manhattan['NUM_COLS'] ) *0.8)

# The test set size is 100% - 80% = 20% of the length of the shuffle array.
testing_size_manhattan = len(shuffle_manhattan['NUM_COLS']) - training_size_manhattan

# Define the number of input values (predictors) - won't be 27, it will not include the borough or date - but should it include the borough?
no_predictors = 17

# Define the number of output values (targets)
no_outputs = 1

## The tensorflow bit...

Now the tensorflow model can be setup.  This is mostly the same as the regressor (except the DNN uses the DNNRegressor, of course.)

In [None]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf
import os

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
model_dir = '/tmp/DNN_NY_ALL_BOROUGHS_regression_trained_model'

# remove the last training model - if it is present
if os.path.isdir(model_dir):
  print("\n// Removing old model directory ....")
  shutil.rmtree(model_dir, ignore_errors=True)
else:
  print("\n// No model directory to remove ....")

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(
    model_dir=model_dir, 
    hidden_units=[20,18,14], 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.01), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhattan.values)
    )
)

# Prints a log to show model is starting to train
print("starting to train");

# Train the model. Pass in predictor values and target values.
estimator.fit(predictors_manhattan[:training_size_manhattan].values, 
              target_manhatan[:training_size_manhattan].values.reshape(training_size_manhattan, no_outputs) / SCALE_NUM_COLS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_manhattan[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# i.e. take the difference between the actual and the forecast then square the difference, 
# find the average of all the squares and then find the square root. 
# The RMSE essentially punishes larger errors i.e. it puts a heavier weight on larger errors.
rmse = np.sqrt(np.mean((target_manhatan[training_size_manhattan:].values - predslistscale) ** 2))
print('DNNRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the Life Satisfaction Values.
avg = np.mean(shuffle_manhattan['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using Life Satisfaction Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((shuffle_manhattan['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

TensorFlow 1.x selected.
1.15.2

// No model directory to remove ....
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please access pandas data directly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please convert numpy dtypes explicitly.
Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please switch to tf.contrib.estimator.*_head.
Instructions for updating:
Please replace uses of any Estimator from tf.contrib.l

### Testing the model

As a test for the model, a date is selected from the data and the values plugged in.  For this test, the 05/05/2017 in MANHATTAN has been selected

In [None]:
# As a test, use something which is known - this case the values for MANHATTAN on the 05/05/2017
input = pd.DataFrame.from_dict(data = 
				{
         'WEEKDAY' : [5],
         'YEAR' : [2017],
         'MONTH' : [5],
         'DAY' : [5],
         'TEMP' : [49.4],
         'DEWP' : [44],
         'SLP' : [1019.6],
         'VISIB' : [7.6],
         'WDSP' : [14.9],
         'MXPSD' : [25.1],
         'GUST' : [31.1],
         'MAX' : [57.9],
         'MIN' : [45],
         'PRCP' : [0],
         'SNDP' : [999.9],
         'FOG' : [0]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir=model_dir, hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)
print(predslistnorm)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa508643690>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_NY_ALL_BOROUGHS_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_NY_ALL_BOROUGHS_regress

This returns **110.7177]** (the value may change on subsequent runs) - and this value is a little lower than the recorded value of **172**



### Testing an unknown date.

When the data was constructed the year 2021 was removed (due to covid) however, the weather data for 2021 is still available.  Selecting a more recent date (in this case 21-11-2021) and using the weather data from then will give a different result.

date      |year|mo|da|temp|dewp|slp   |visib|wdsp|mxpsd|gust |max |min|prcp|sndp |fog
----------|----|--|--|----|----|------|-----|----|-----|-----|----|---|----|-----|---
2021-11-21|2021|11|21|52.3|42.2|1028.0|10.0 |11.3|14   |999.9|57.9|39.0|0.0|999.9|1

This will result in an ```input``` which looks like this:

```python
input = pd.DataFrame.from_dict(data = 
				{
         'WEEKDAY' : [7],
         'YEAR' : [2021],
         'MONTH' : [11],
         'DAY' : [21],
         'TEMP' : [52.3],
         'DEWP' : [42.2],
         'SLP' : [1028.0],
         'VISIB' : [10.0],
         'WDSP' : [11.3],
         'MXPSD' : [14],
         'GUST' : [999.9],
         'MAX' : [57.9],
         'MIN' : [39],
         'PRCP' : [0.0],
         'SNDP' : [999.9],
         'FOG' : [1]
        })
```

creating a new code block which incorporates all the above code items (for brevity) is below:

In [None]:
import numpy as np
shuffle_manhattan = manhat_df.iloc[np.random.permutation(len(manhat_df))]

# setup predictors
# Removing the data fields as the DNN should be able to extract this.  Also exluding the borough field as only dealing with Manhattan
predictors_manhattan = shuffle_manhattan.iloc[:,[2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18]]

# Check -1 (for all columns) has worked
print(predictors_manhattan[:6])

# print first 5 rows of the shuffle
print(shuffle_manhattan[:5])

# Define the target (the NUM_COLS)
target_manhatan = shuffle_manhattan.iloc[:,-1]

# print the targets
print(target_manhatan[:6])

# Split our data into a training set i.e. 80% of the length of the shuffle array
training_size_manhattan = int(len(shuffle_manhattan['NUM_COLS'] ) *0.8)

# The test set size is 100% - 80% = 20% of the length of the shuffle array.
testing_size_manhattan = len(shuffle_manhattan['NUM_COLS']) - training_size_manhattan

# Define the number of input values (predictors) - won't be 27, it will not include the borough or date - but should it include the borough?
no_predictors = 17

# Define the number of output values (targets)
no_outputs = 1

# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf
import os

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
model_dir = '/tmp/DNN_NY_ALL_BOROUGHS_regression_trained_model'

# remove the last training model - if it is present
if os.path.isdir(model_dir):
  print("\n// Removing old model directory ....")
  shutil.rmtree(model_dir, ignore_errors=True)
else:
  print("\n// No model directory to remove ....")

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(
    model_dir=model_dir, 
    hidden_units=[20,18,14], 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.01), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhattan.values)
    )
)

# Prints a log to show model is starting to train
print("starting to train");

# Train the model. Pass in predictor values and target values.
estimator.fit(predictors_manhattan[:training_size_manhattan].values, 
              target_manhatan[:training_size_manhattan].values.reshape(training_size_manhattan, no_outputs) / SCALE_NUM_COLS, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_manhattan[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# i.e. take the difference between the actual and the forecast then square the difference, 
# find the average of all the squares and then find the square root. 
# The RMSE essentially punishes larger errors i.e. it puts a heavier weight on larger errors.
rmse = np.sqrt(np.mean((target_manhatan[training_size_manhattan:].values - predslistscale) ** 2))
print('DNNRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the Life Satisfaction Values.
avg = np.mean(shuffle_manhattan['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using Life Satisfaction Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((shuffle_manhattan['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));

# As a test, use something which is known - this case the values for MANHATTAN on the 05/05/2017
input = pd.DataFrame.from_dict(data = 
        {
         'WEEKDAY' : [7],
         'YEAR' : [2021],
         'MONTH' : [11],
         'DAY' : [21],
         'TEMP' : [52.3],
         'DEWP' : [42.2],
         'SLP' : [1028.0],
         'VISIB' : [10.0],
         'WDSP' : [11.3],
         'MXPSD' : [14],
         'GUST' : [999.9],
         'MAX' : [57.9],
         'MIN' : [39],
         'PRCP' : [0.0],
         'SNDP' : [999.9],
         'FOG' : [1]
        })

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir=model_dir, hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator.predict(x=input.values)

predslistnorm = preds['scores']
prednorm = format(str(predslistnorm))
print(prednorm)
print(predslistnorm)

       WEEKDAY  YEAR  MONTH  DAY  TEMP  ...   MAX   MIN  PRCP   SNDP  FOG
460          5  2012      7    6  81.9  ...  91.0  66.9  0.00  999.9    0
10896        3  2018      5    9  51.2  ...  61.0  46.9  0.00  999.9    1
5846         1  2015      6   15  57.8  ...  59.0  55.0  0.00  999.9    0
15251        4  2020      4    9  43.0  ...  51.1  30.0  0.27  999.9    1
14308        4  2020      1   23  32.3  ...  45.0  17.1  0.00  999.9    0
10335        1  2018     12   24  35.2  ...  39.9  28.9  0.00  999.9    0

[6 rows x 16 columns]
             DATE    BOROUGH  WEEKDAY  ...  PERS_KILL  PERS_INJD  NUM_COLS
460    06-07-2012  MANHATTAN        5  ...          0         37       154
10896  09-05-2018  MANHATTAN        3  ...          0         22       132
5846   15-06-2015  MANHATTAN        1  ...          1         19       118
15251  09-04-2020  MANHATTAN        4  ...          0          6        12
14308  23-01-2020  MANHATTAN        4  ...          0         23        79

[5 rows 

### Prediction

This returns [87.57121]  - while there is no way to tell if this is correct or not, and with covid still a large part of daily lives, the NY collisions data has been updated to show 27 accidents have been recorded in the Manhattan area on the date selected.

To validate this, a simple query was ran on the New York Collisions dataset:

```sql
SELECT *  FROM `bigquery-public-data.new_york_mv_collisions.nypd_mv_collisions`
where timestamp between '2021-11-20'  and '2021-11-22'
ORDER BY timestamp DESC
```

The result of this was saved (file name "pred_check_results.csv" and the location column removed (as this was a comma seperated value it caused issues) as well as the header.  A new python script was created (check_results.py) which used the getBoroghFromLatLong function used when creating the data.  The result was saved to a new file which then had the header inserted and the data filtered.

While the result obtained is less than the "recorded" result, COVID is still very mich a factor in people's daily lives and the result be affected by this.