# Training a linear regressor

### Background

The dataset which was created for part 1 has been constructed so that each borough can be identified.  This means that a linear regressor could be made for each indivual borough.  As the code uses the dataset with the same structure, what works for one, would work for all.  However, in order to simplifiy this notebook, only one borough will be used.  That borough is Manhatten.

To begin with, the data must first be loaded.  Once loaded, using the query method of the pandas dataframe the rows which are in Mahattan can be extracted.

In [1]:
# import pandas
import pandas as pd

# load the data as per normal
# df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv', index_col=0)
df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv')

# now setup to query the dataframe for only the borough noted
borough_df = df.query('BOROUGH=="MANHATTAN"')

# remove 2012
remove_2012 = borough_df.query('YEAR!=2012')

# remove 2020
remove_2020 = remove_2012.query('YEAR!=2020')

#remove 2021
remove_2021 = remove_2020.query('YEAR!=2021')

manhat_df = remove_2021

year_2013 = manhat_df.query("YEAR==2013")
# year_2014 = manhat_df.query("YEAR==2014")
# year_2015 = manhat_df.query("YEAR==2015")
# year_2016 = manhat_df.query("YEAR==2016")
# year_2017 = manhat_df.query("YEAR==2017")
# year_2018 = manhat_df.query("YEAR==2018")
# year_2019 = manhat_df.query("YEAR==2019")


#### Data Verification

Once the data has been loaded, it is a good idea to check this.  Printing only 6 rows shows that all are Manhattan.

In [2]:
# check the data had loaded by printing it
print(manhat_df[:6])


          DATE    BOROUGH  WEEKDAY  ...  PERS_KILL  PERS_INJD  NUM_COLS
2   01-01-2017  MANHATTAN        7  ...          0         24        84
8   01-02-2017  MANHATTAN        3  ...          0         23       135
13  01-03-2017  MANHATTAN        3  ...          0         20       120
17  01-04-2017  MANHATTAN        6  ...          0         14       109
22  01-05-2017  MANHATTAN        1  ...          0         28       119
26  01-06-2017  MANHATTAN        4  ...          0         31       141

[6 rows x 28 columns]


## Import numpy and shuffle

Create a new shuffled data by using the random permutation function of numpy using the length of the original mahattan data frame.  In this particular data set this is important as the data is time series (or time series derived) and there could be patterns within the data.

In [3]:
import numpy as np
shuffle_manhatten = manhat_df.iloc[np.random.permutation(len(manhat_df))]
print(shuffle_manhatten[:5])

# setup constant for use later
SCALE_NUM_COLS = 1.0

             DATE    BOROUGH  WEEKDAY  ...  PERS_KILL  PERS_INJD  NUM_COLS
8621   2016-07-24  MANHATTAN        7  ...          0         17        96
3625   2013-10-29  MANHATTAN        2  ...          0         22       128
16593  25-12-2017  MANHATTAN        1  ...          0         11        45
1018   17-12-2017  MANHATTAN        7  ...          0         14       100
2973   2013-06-20  MANHATTAN        4  ...          0         19       123

[5 rows x 28 columns]


## Predictors, Training set and Testing set

Firstly the predictors are created.  These are created from the shuffled data and only the third, fourth, fifth, seventh and last column is used.  These are, in order, year, month, day, temp and num_cols (number of collisions).  If the data was not being filtered by borough at the start, then the borough would need to be brought in also.

The Target is then defined - this is the last column in the shuffled dataset.

The training set will be 80% of the full data set (0.8) and the testing data will be the remainder - so in this case, 100 - 80 = 20%

Constants are setup for the number of predictors (3 in this case: year, month and day) and the number of outputs (or targets, which is 1 here.)

In [4]:
# select the day, month, year and number of collisions columns.  
predictors_manhatten = shuffle_manhatten.iloc[:,[3,4,5]]
print(predictors_manhatten[:6])

# We want the last column (the NUM_COLS)
targets_manhattan = shuffle_manhatten.iloc[:,-1]
print(targets_manhattan[:6])

# split data into training set
training_size_manhattan = int(len(shuffle_manhatten['NUM_COLS']) * 0.8)

# test size is the size of the data - the training size (in this case 20%)
testing_size_manhattan = len(shuffle_manhatten['NUM_COLS']) - training_size_manhattan

# define the number of input params, day, month and year = 3 (predictors)
NO_PREDICTORS_MANHATTAN = 3

# define the number of output params, collisions = 1 (targets)
NO_TARGETS = 1


       YEAR  MONTH  DAY
8621   2016      7   24
3625   2013     10   29
16593  2017     12   25
1018   2017     12   17
2973   2013      6   20
10176  2018      5   31
8621      96
3625     128
16593     45
1018     100
2973     123
10176    129
Name: NUM_COLS, dtype: int64


### Verification

Just a simple print to ensure the predictor and values are populated. 

In [None]:
print(predictors_manhatten.values)
print(predictors_manhatten)

## The Tensorflow bit..

In [6]:
%tensorflow_version 1.x
import tensorflow as tf

# check tensor version
print(tf.__version__)

import shutil

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# setup some variables to hold the file path
manhattan_dir = '/tmp/linear_regression_trained_model'

# remove the last training model
shutil.rmtree(manhattan_dir, ignore_errors=True)

# estimators for each borough
estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(
    model_dir=manhattan_dir, 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.1), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhatten.values)
    )
)

# # Prints a log to show model is starting to train
print("// Starting to train Manhattan model............\n");

# Train the model. Pass in predictor values and target values.
estimator_manhattan.fit(
    predictors_manhatten[:training_size_manhattan].values, 
    targets_manhattan[:training_size_manhattan].values.reshape(training_size_manhattan,NO_TARGETS)/SCALE_NUM_COLS, steps=10000
)

# Next, we can check our predictions based on our predictors.
preds = estimator_manhattan.predict(x=predictors_manhatten[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - predslistscale)**2))
rmse = np.sqrt(np.mean((targets_manhattan[training_size_manhattan:].values - predslistscale) ** 2))
print('\n\n// Lnear Regression has RMSE of {0}'.format(rmse));

# Calculate the mean of the NUM_COLS Values.
avg = np.mean(shuffle_manhatten['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using NUM_COLS Values and the mean of all target values.
rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('\n\n// Just using average = {0} has RMSE of {1}'.format(avg, rmse));

1.15.2
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please access pandas data directly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please convert numpy dtypes explicitly.
Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please switch to tf.contrib.estimator.*_head.
Instructions for updating:
Please replace uses of any Estimator from tf.contrib.learn with an Estimator from tf.estimator.*
Instructions for upd

## Initial Output

The RSME is quite large, which would indicate a not accurate model - however this model is only trying to predict based on a day.  Not using any extra conditions.  Temperature could be added in and this may affect the the output.

The model needs to be tested first, to do this some values will be fed in to be the target.

## Inital Prediction

```python
input = pd.DataFrame.from_dict(data = 
				{'YEAR' : [2013,2016,2022],
         'MONTH' : [1, 6, 12],
         'DAY' : [1, 1, 1]
        })

```

Setting up the values for preduction, 3 have been chosen.  1/1/2013, 1/6/2016 and 1.12/2022 will be fed into model.

In [7]:
input = pd.DataFrame.from_dict(data = 
				{'YEAR' : [2013,2016,2022],
         'MONTH' : [1, 6, 12],
         'DAY' : [1, 1, 1]
        })

# create a copy of 2013
cpy = year_2013

# # now drop all the fields we don't need
trmmed = cpy.drop(columns=["DATE","BOROUGH","WEEKDAY","COLLISION_DATE","TEMP","DEWP","SLP","VISIB","WDSP","MXPSD","GUST","MAX","MIN","PRCP","SNDP","FOG","CYC_KILL","CYC_INJD","MOTO_KILL","MOTO_INJD","PEDS_KILL","PEDS_INJD","PERS_KILL","PERS_INJD","NUM_COLS"])

# print(trmmed)

estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir=manhattan_dir, 
                                                                                 enable_centered_bias=False, 
                                                                                 feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)
                                                                                 )
)

preds = estimator_manhattan.predict(x=input.values)
# print(preds)

# The Number of collisions scale will be 106.072969 - which is the average number of collisions in Manhattan (359057 (collisions)/ 3385 (days))
predslistnorm = preds['scores']
predslistscale = preds['scores'] * 106.072969
prednorm = format(str(predslistnorm))
pred = format(str(predslistscale))

print(prednorm)
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc5ce720550>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

### Result of First Run

The model is being asked to predict 3 dates.  Two of these dates are known, these can be used as "control" dates - the number of collisions is known - so these should be fairly accurate (or at least, that is the theory...)

The dates are:

Date|Prediction|Actual
:--:|:--------:|:----:
1/1/2013|130|78
1/6/2016|136|121
1/12/2020|144|N/A

While the trend is, generally, going upwards, which does **sort of** look ok.. As noted above, it would be a good idea to try to add in an additional predictor.  The Temp value should suffice.

**Note** these values may change if the notebook is run at a later stage.

## Adding in Temp

The code cells will be copied from above, this time the predictors will be ammeded to include temperature.

**Note:** For brevity, the additional prints will not be inluded and all code will be added to the one cell.

In [8]:
# select the day, month, year and number of collisions columns.  
predictors_manhatten = shuffle_manhatten.iloc[:,[3,4,5,7]]
print(predictors_manhatten[:6])

# We want the last column (the NUM_COLS)
targets_manhattan = shuffle_manhatten.iloc[:,-1]
print(targets_manhattan[:6])

# split data into training set
training_size_manhattan = int(len(shuffle_manhatten['NUM_COLS']) * 0.8)

# test size is the size of the data - the training size (in this case 20%)
testing_size_manhattan = len(shuffle_manhatten['NUM_COLS']) - training_size_manhattan

# define the number of input params, day, month, year and now temp = 4 (predictors)
NO_PREDICTORS_MANHATTAN = 4

# define the number of output params, collisions = 1 (targets)
NO_TARGETS = 1

%tensorflow_version 1.x
import tensorflow as tf

# check tensor version
print(tf.__version__)

import shutil

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# setup some variables to hold the file path
manhattan_dir = '/tmp/linear_regression_trained_model_including_temp'

# remove the last training model
shutil.rmtree(manhattan_dir, ignore_errors=True)

# estimators for each borough
estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(
    model_dir=manhattan_dir, 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.1), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhatten.values)
    )
)

# # Prints a log to show model is starting to train
print("// Starting to train Manhattan model............\n");

# Train the model. Pass in predictor values and target values.
estimator_manhattan.fit(
    predictors_manhatten[:training_size_manhattan].values, 
    targets_manhattan[:training_size_manhattan].values.reshape(training_size_manhattan,NO_TARGETS)/SCALE_NUM_COLS, steps=10000
)

# Next, we can check our predictions based on our predictors.
preds = estimator_manhattan.predict(x=predictors_manhatten[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - predslistscale)**2))
rmse = np.sqrt(np.mean((targets_manhattan[training_size_manhattan:].values - predslistscale) ** 2))
print('\n\n// Lnear Regression has RMSE of {0}'.format(rmse));

# Calculate the mean of the NUM_COLS Values.
avg = np.mean(shuffle_manhatten['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using NUM_COLS Values and the mean of all target values.
rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - avg) ** 2))
print('\n\n// Just using average = {0} has RMSE of {1}'.format(avg, rmse));

       YEAR  MONTH  DAY  TEMP
8621   2016      7   24  73.2
3625   2013     10   29  47.5
16593  2017     12   25  41.1
1018   2017     12   17  27.6
2973   2013      6   20  60.6
10176  2018      5   31  55.1
8621      96
3625     128
16593     45
1018     100
2973     123
10176    129
Name: NUM_COLS, dtype: int64
1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc5cb2b9950>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': N

In [9]:
input = pd.DataFrame.from_dict(data = 
				{'YEAR' : [2013,2016,2022],
         'MONTH' : [1, 6, 12],
         'DAY' : [1, 1, 1],
         'TEMP': [35,60,30]
        })

# create a copy of 2013
cpy = year_2013

# # now drop all the fields we don't need
trmmed = cpy.drop(columns=["DATE","BOROUGH","WEEKDAY","COLLISION_DATE","TEMP","DEWP","SLP","VISIB","WDSP","MXPSD","GUST","MAX","MIN","PRCP","SNDP","FOG","CYC_KILL","CYC_INJD","MOTO_KILL","MOTO_INJD","PEDS_KILL","PEDS_INJD","PERS_KILL","PERS_INJD","NUM_COLS"])

# print(trmmed)

estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir=manhattan_dir, 
                                                                                 enable_centered_bias=False, 
                                                                                 feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)
                                                                                 )
)

preds = estimator_manhattan.predict(x=input.values)
# print(preds)

# The Number of collisions scale will be 106.072969 - which is the average number of collisions in Manhattan (359057 (collisions)/ 3385 (days))
predslistnorm = preds['scores']
predslistscale = preds['scores'] * 106.072969
prednorm = format(str(predslistnorm))
pred = format(str(predslistscale))

print(prednorm)
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc5cb1f6b10>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model_including_temp', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained

### After Temp

After adding in the temp, the values have changed:


Date|Prediction (no temp)|Prediction (with temp)|Actual
:--:|:------------------:|:--------------------:|:----:
1/1/2013|122|117|78|
1/6/2016|128|129|121|
1/12/2020|135|121|N/A|


In [None]:

# # trying to graph a full year of "real" data against a full year of "predicted" data - and it shows it is a terrible model :(
# import pandas as pd
# import matplotlib.pyplot as plt
# import matplotlib.dates as mdates
# from matplotlib.dates import DateFormatter
# import seaborn as sns
# import pandas as pd

# # Handle date time conversions between pandas and matplotlib
# from pandas.plotting import register_matplotlib_converters
# register_matplotlib_converters()

# # Use white grid plot background from seaborn
# sns.set(font_scale=1.5, style="whitegrid")

# # load the data as per normal
# model_df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv', parse_dates=['DATE'], index_col=['DATE'])

# # now setup to query the dataframe for only the borough noted
# model_borough_df = model_df.query('BOROUGH=="MANHATTAN"')

# # remove 2012
# model_remove_2012 = model_borough_df.query('YEAR!=2012')

# # remove 2020
# model_remove_2020 = model_remove_2012.query('YEAR!=2020')

# #remove 2021
# model_remove_2021 = model_remove_2020.query('YEAR!=2021')

# model_manhat_df = model_remove_2021

# model_year_2013 = model_manhat_df.query("YEAR==2013")

# model_trimmed_2013 = model_year_2013.drop(columns=["BOROUGH","YEAR","MONTH","DAY","WEEKDAY","COLLISION_DATE","TEMP","DEWP","SLP","VISIB","WDSP","MXPSD","GUST","MAX","MIN","PRCP","SNDP","FOG","CYC_KILL","CYC_INJD","MOTO_KILL","MOTO_INJD","PEDS_KILL","PEDS_INJD","PERS_KILL","PERS_INJD"])

# predictions_list = prednorm.split()
# # predictions_list = pred.split()

# model_pred_data = pd.DataFrame(columns=['DATE', 'NUM_COLS'])

# for x in range(len(trmmed)):
#   valYear = trmmed['YEAR'].values[x]
#   valMonth = trmmed['MONTH'].values[x]   
#   valDay = trmmed['DAY'].values[x]
#   if valMonth < 10:
#     valMonth = '0%s' % (valMonth)
#   if valDay < 10:
#     valDay = '0%s' % (valDay)
#   date_string = '%s-%s-%s' % (valYear, valMonth, valDay)  
#   # print(f"Row {x} - Year will be {valYear}, Month will be {valMonth}, Day will be {valDay}")

#   temp_collisions = predictions_list[x]
#   valCols = temp_collisions.replace('[', '')
#   temp_cols = float(valCols)
  
#   model_pred_data = model_pred_data.append({'DATE': date_string, 'NUM_COLS': int(temp_cols)},ignore_index=True)

# # adding an index
# date_time_index = pd.to_datetime(model_pred_data['DATE'])
# datetime_index = pd.DatetimeIndex(date_time_index.values)
# new_things = model_pred_data.set_index(datetime_index)

# # converting the num_cols to ints as this wasn't explicit enough above.
# new_things["NUM_COLS"] = new_things["NUM_COLS"].astype(str).astype(int)
# print("model post set")
# print(new_things)
# # print(new_things.info())
# # print(new_things.index.dtype)

# print("\n\nother one")
# print(model_trimmed_2013)
# # print(model_trimmed_2013.info())
# # print(model_trimmed_2013.index.dtype)

# # Create figure and plot space
# fig, ax = plt.subplots(figsize=(24, 24))
# ax = plt.axes()

# ax.plot(model_trimmed_2013['NUM_COLS'], label="Accidents from the data")
# ax.plot(new_things['NUM_COLS'], label="Accidents from model")
# ax.legend()
# ax.set(xlabel="Date",
#        ylabel="Number Collisions",
#        title="2013 Accidents")


now maybe add in a second model - this one with 4 predictors (the temp being the other one)