# Training a linear regressor

### Background

The dataset which was created for part 1 has been constructed so that each borough can be identified.  This means that a linear regressor could be made for each indivual borough.  As the code uses the dataset with the same structure, what works for one, would work for all.  However, in order to simplifiy this notebook, only one borough will be used.  That borough is Manhatten.

To begin with, the data must first be loaded.  Once loaded, using the query method of the pandas dataframe the rows which are in Mahattan can be extracted.

In [46]:
# import pandas
import pandas as pd

# load the data as per normal
# df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv', index_col=0)
df = pd.read_csv('https://raw.githubusercontent.com/adamsjoe/Data_Analytics_On_The_Web/main/Final_Data/Final_Data_Collated.csv')

# now setup to query the dataframe for only the boroughs noted
manhat_df = df.query('BOROUGH=="MANHATTAN"').head()



#### Data Verification

Once the data has been loaded, it is a good idea to check this.  Printing only 6 rows shows that all are Manhattan.

In [52]:
# check the data had loaded by printing it
print(manhat_df[:6])


          DATE    BOROUGH  WEEKDAY  ...  PERS_KILL  PERS_INJD  NUM_COLS
0   2012-08-21  MANHATTAN        2  ...          0         28       109
6   2012-10-27  MANHATTAN        6  ...          0         20       122
13  2012-08-25  MANHATTAN        6  ...          0         22        97
16  2012-09-09  MANHATTAN        7  ...          0         29       109
20  2012-09-17  MANHATTAN        1  ...          1         27       123

[5 rows x 28 columns]


## Import numpy and shuffle

Create a new shuffled data by using the random permutation function of numpy using the length of the original mahattan data frame.  In this particular data set this is important as the data is time series (or time series derived) and there could be patterns within the data.

In [51]:
import numpy as np

shuffle_manhatten = manhat_df.iloc[np.random.permutation(len(manhat_df))]
shuffle_manhatten[:5]

# setup constant for use later
SCALE_NUM_COLS = 1

## Predictors, Training set and Testing set

Firstly the predictors are created.  These are created from the shuffled data and only the third, fourth, fifth, seventh and last column is used.  These are, in order, year, month, day, temp and num_cols (number of collisions).  If the data was not being filtered by borough at the start, then the borough would need to be brought in also.

The Target is then defined - this is the last column in the shuffled dataset.

The training set will be 80% of the full data set (0.8) and the testing data will be the remainder - so in this case, 100 - 80 = 20%

Constants are setup for the number of predictors (3 in this case: year, month and day) and the number of outputs (or targets, which is 1 here.)

In [49]:
# select the day, month, year and number of collisions columns.  
predictors_manhatten = shuffle_manhatten.iloc[:,[3,4,5,7,-1]]

targets_manhattan = shuffle_manhatten.iloc[:,-1]

# split data into training set
training_size_manhattan = int(len(shuffle_manhatten['NUM_COLS']) * 0.8)

# test size is the size of the data - the training size (in this case 20%)
testing_size_manhattan = len(shuffle_manhatten['NUM_COLS']) - training_size_manhattan

# define the number of input params, day, month and year = 3 (predictors)
NO_PREDICTORS_MANHATTAN = 4

# define the number of output params, collisions = 1 (targets)
NO_TARGETS = 1


### Verification

Just a simple print to ensure the predictor and values are populated. 

In [50]:
print(predictors_manhatten.values)
print(predictors_manhatten)

[[2012.     9.    17.    66.5  123. ]
 [2012.     8.    25.    75.7   97. ]
 [2012.     8.    21.    72.8  109. ]
 [2012.    10.    27.    61.9  122. ]
 [2012.     9.     9.    68.4  109. ]]
    YEAR  MONTH  DAY  TEMP  NUM_COLS
20  2012      9   17  66.5       123
13  2012      8   25  75.7        97
0   2012      8   21  72.8       109
6   2012     10   27  61.9       122
16  2012      9    9  68.4       109


## THe Tensorflow bit..

In [None]:
%tensorflow_version 1.x
import tensorflow as tf

# check tensor version
print(tf.__version__)

import shutil

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# setup some variables to hold the file path
manhattan_dir = '/tmp/linear_regression_trained_model'

# remove the last training model
shutil.rmtree(manhattan_dir, ignore_errors=True)

# estimators for each borough
estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(
    model_dir=manhattan_dir, 
    hidden_units=[20,18,14], 
    optimizer=tf.train.AdamOptimizer(learning_rate=0.01), 
    enable_centered_bias=False, 
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_manhatten.values)
    )
)

# # Prints a log to show model is starting to train
print("// Starting to train Manhattan model............");

# Train the model. Pass in predictor values and target values.
estimator_manhattan.fit(
    predictors_manhatten[:training_size_manhattan].values, 
    targets_manhattan[:training_size_manhattan].values.reshape(training_size_manhattan, 
    NO_TARGETS)/SCALE_NUM_COLS, 
    steps=10000
)

# Next, we can check our predictions based on our predictors.
preds = estimator_manhattan.predict(x=predictors_manhatten[training_size_manhattan:].values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores'] * SCALE_NUM_COLS

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# i.e. take the difference between the actual and the forecast then square the difference, 
# find the average of all the squares and then find the square root. 
# The RMSE essentially punishes larger errors i.e. it puts a heavier weight on larger errors.
rmse = np.sqrt(np.mean((targets_manhattan[training_size_manhattan:].values - predslistscale)**2))
print('// DNNRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the NUM_COLS Values.
avg = np.mean(shuffle_manhatten['NUM_COLS'][:training_size_manhattan])

# Calculate the RMSE using NUM_COLS Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((shuffle_manhatten['NUM_COLS'][training_size_manhattan:] - avg)**2))
print('// Just using average = {0} has RMSE of {1}'.format(avg, rmse));

1.15.2
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f92a92d1e10>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
// Starting to train Manhattan model............
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph wa

In [None]:
input = pd.DataFrame.from_dict(data = 
				{'DAY' : [1,1,1],
         'MONTH' : [1, 6, 12],
         'YEAR' : [2013, 2016, 2022]
        })

estimator_manhattan = tf.contrib.learn.SKCompat(tf.contrib.learn.LinearRegressor(model_dir='/tmp/linear_regression_trained_model', enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(input.values)))

preds = estimator_manhattan.predict(x=input.values)
# Assume number of trips scale value is 600000 when at a maximum, based on the analysis from Tutorial 2
predslistnorm = preds['scores']
predslistscale = preds['scores']*600000
prednorm = format(str(predslistnorm))
pred = format(str(predslistscale))
print(prednorm)
print(pred)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f92ab922f10>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/linear_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/linear_regression_trained_model/model.ck

NotFoundError: ignored