<a href="https://colab.research.google.com/github/frizuma3/21013159_DataAnalytics/blob/main/LinearRegressionTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this tutorial we will train and test a linear regressor. This is something you have already done in part 3 of the course document. So, I will miss out all of the extra notes and just get to business.

In [22]:
# needed to create the data frame
import pandas as pd

# needed to help with speedy maths based calculations
import numpy as np

# create data frame from csv file we hosted on our github
#df = pd.read_csv('https://raw.githubusercontent.com/1122131uhi/dataAnalytics/master/tutorial2lineardata.csv', index_col=0, )
df = pd.read_csv('https://raw.githubusercontent.com/frizuma3/21013159_DataAnalytics/refs/heads/main/linearregressiondata.csv', index_col=0, )

In [23]:
# make sure we have our data by printing it out
print(df[:6])
# print(df) #all

   day  temp  dewp  NUM_COLLISIONS
1    1  83.6  63.0        0.520270
2    3  80.3  54.1        0.578829
3    4  79.8  56.7        0.804054
4    5  81.8  65.6        0.281532
5    6  86.7  64.3        0.639640
6    7  81.9  62.3        0.745495


In [24]:
# A scale is not required here, but the constant will be useful in the assignment.
SCALE_NUM_TRIPS = 1.0

In [25]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

2.17.0


# Day and NUM_TRIPS

In [26]:
# create a dataframe with the inputs and the output at the end using the imported dataframe. This can be replicated for any configuration, in this case, I have gone for day, temp, wdsp
df_input_data_day = [df["day"], df["NUM_COLLISIONS"]]
# create headers for our new dataframe. These should correlate with the above.
df_input_headers_day = ["day", "NUM_COLLISIONS"]
# create a final dataframe using our new dataframe and headers.
df_input_day = pd.concat(df_input_data_day, axis=1, keys=df_input_headers_day)

In [27]:
# construct a training set for runnign through the model and a test set, we do this by using sample with 0.8 for an 80% training set and 20% for test.
training_set_day = df_input_day.sample(frac=0.8, random_state=0)
test_set_day = df_input_day.drop(training_set_day.index)

In [28]:
# copy the datasets and remove the final column, i.e. the output column. We do this using pop.
training_features_day = training_set_day.copy()
test_features_day = test_set_day.copy()

training_labels_day = training_features_day.pop('NUM_COLLISIONS')
test_labels_day = test_features_day.pop('NUM_COLLISIONS')

In [29]:
# Here I have put in a scale factor and divided by it. In this dataset, I had already normalised and thus it is 1. However, 600000 is what would make sense based on the data here and we can use this later when testing our model..
training_labels_day = training_labels_day/SCALE_NUM_TRIPS
test_labels_day = test_labels_day/SCALE_NUM_TRIPS

In [30]:
print(training_features_day)

      day
2478    2
928     5
868     2
913     4
1744    1
...   ...
1818    6
1239    2
1902    6
500     4
1920    3

[1884 rows x 1 columns]


In [31]:
# boiler plate for this model. You can see that we have used the training_features here for our normalisation layer that we try and fit to the outputs.
normaliser_day = tf.keras.layers.Normalization(input_shape=[1,], axis=None) # tf.keras.layers.Normalization(axis=-1)
normaliser_day.adapt(np.array(training_features_day))

  super().__init__(**kwargs)


In [32]:
# I have decided to call the model, model_1. We add our normaliser and we are expecting a single output.
model_0 = tf.keras.Sequential([
    normaliser_day,
    layers.Dense(units=1)
])

In [33]:
model_0.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [34]:
# now we are going to fit the model where we require the training features and labels. We will run it 100 times i.e. epochs and we have applied a further 20% validation split.

%%time
history = model_0.fit(
    training_features_day,
    training_labels_day,
    epochs=100,
    verbose=0,
    validation_split = 0.2)

CPU times: user 13.6 s, sys: 659 ms, total: 14.2 s
Wall time: 16.3 s


In [35]:
# Now, we will evaluate our model using the test features and labels.
mean_absolute_error_model_0 = model_0.evaluate(
    test_features_day,
    test_labels_day, verbose=0)

In [36]:
# The mean absolute error of the model can be printed out. Remember, we want to minimise this. Perhaps a model with just day and NUM_TRIPS would be better. It will also vary on each training run due to randomisation.
print(mean_absolute_error_model_0)

0.16121359169483185


In [37]:
# we create a custom dataframe with 3 values per feature.
input_model_0 = pd.DataFrame.from_dict(data =
				{
            'day' : [1,2,3,4,5,6,7],
        })

In [38]:
linear_predictions_model_0 = model_0.predict(input_model_0)*SCALE_NUM_TRIPS # essentially 600000 in this instance would give back realistic numbers based on the TAXI_TRIPS data
print(linear_predictions_model_0)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step
[[0.5675329 ]
 [0.5978414 ]
 [0.6281498 ]
 [0.65845823]
 [0.68876666]
 [0.7190751 ]
 [0.74938357]]


# Day, Temp, Windspeed, NUM_TRIPS


In [39]:
# create a dataframe with the inputs and the output at the end using the imported dataframe. This can be replicated for any configuration, in this case, I have gone for day, temp, wdsp
df_input_data = [df["day"], df["temp"], df["dewp"], df["NUM_COLLISIONS"]]
# create headers for our new dataframe. These should correlate with the above.
df_input_headers = ["day", "temp", "dewp", "NUM_COLLISIONS"]
# create a final dataframe using our new dataframe and headers.
df_input = pd.concat(df_input_data, axis=1, keys=df_input_headers)

In [40]:
# construct a training set for runnign through the model and a test set, we do this by using sample with 0.8 for an 80% training set and 20% for test.
training_set = df_input.sample(frac=0.8, random_state=0)
test_set = df_input.drop(training_set.index)

In [41]:
# copy the datasets and remove the final column, i.e. the output column. We do this using pop.
training_features = training_set.copy()
test_features = test_set.copy()

training_labels = training_features.pop('NUM_COLLISIONS')
test_labels = test_features.pop('NUM_COLLISIONS')

In [42]:
# Here I have put in a scale factor and divided by it. In this dataset, I had already normalised and thus it is 1. However, 600000 is what would make sense based on the data here and we can use this later when testing our model..
training_labels = training_labels/SCALE_NUM_TRIPS
test_labels = test_labels/SCALE_NUM_TRIPS

In [43]:
# boiler plate for this model. You can see that we have used the training_features here for our normalisation layer that we try and fit to the outputs.
normaliser = tf.keras.layers.Normalization(axis=-1)
normaliser.adapt(np.array(training_features))

In [44]:
# I have decided to call the model, model_1. We add our normaliser and we are expecting a single output.
model_1 = tf.keras.Sequential([
    normaliser,
    layers.Dense(units=1)
])

In [45]:
# more boiler plate for creating a sequential model, we need an optimiser and loss parameter. Here we are going to be using the mean absolute error MAE
model_1.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [46]:
# now we are going to fit the model where we require the training features and labels. We will run it 100 times i.e. epochs and we have applied a further 20% validation split.

%%time
history = model_1.fit(
    training_features,
    training_labels,
    epochs=100,
    verbose=0,
    validation_split = 0.2)

CPU times: user 13.1 s, sys: 625 ms, total: 13.7 s
Wall time: 15.2 s


In [47]:
# Now, we will evaluate our model using the test features and labels.
mean_absolute_error_model_1 = model_1.evaluate(
    test_features,
    test_labels, verbose=0)

In [48]:
# The mean absolute error of the model can be printed out. Remember, we want to minimise this. Perhaps a model with just day and NUM_TRIPS would be better. It will also vary on each training run due to randomisation.
print(mean_absolute_error_model_1)

0.16298288106918335


In [52]:
# we create a custom dataframe with 3 values per feature.
input_1 = pd.DataFrame.from_dict(data =
				{
            'day' : [1,1,1],
            'temp' : [83.6 , 80.3, 79.8],
            'dewp' : [63.0, 54.1, 56.7]
        })

In [53]:
input_1.head()

Unnamed: 0,day,temp,dewp
0,1,83.6,63.0
1,1,80.3,54.1
2,1,79.8,56.7


In [54]:
# next we can check this out, you can multiply by 600000 to get more realistic NUM_TRIPS values.
linear_day_predictions_1 = model_1.predict(input_1[:3])*SCALE_NUM_TRIPS # essentially 600000 in this instance would give back realistic numbers based on the TAXI_TRIPS data
print(linear_day_predictions_1)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[[0.26524293]
 [0.30014187]
 [0.28959346]]


From test inputs we have used, the first element in the array here is similar to this actual data:

"165",1,62.4,5.6,0.668458757342322

We can see that the temperature is slightly higher in the test data (which means less trips, but slightly higher wind means more trips. So, the difference between (actual) 0.668 and (predicted) 0.576 (rounded to 3 significant figures) seems reasonable.

Similarly with the second:

"389",1,26.6,3.1,0.763954173062719, which has higher number of trips due to a lower temperature and also with a slightly higher wind speed.

And with the third:

"571",1,77.2,8.4,0.724652060408235

The last prediction with the higher temperature seems to punish the values more.

In [58]:
# same as above
input_2 = pd.DataFrame.from_dict(data =
				{
            'day' : [6,6,6],
            'temp' : [83.6 , 80.3, 79.8],
            'dewp' : [63.0, 54.1, 56.7]
        })

In [59]:
linear_day_predictions_2 = model_1.predict(input_2[:3])*SCALE_NUM_TRIPS # essentially 600000 in this instance would give back realistic numbers based on the TAXI_TRIPS data
print(linear_day_predictions_2)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[[0.49341384]
 [0.5283128 ]
 [0.5177644 ]]


This test uses day 6 (Friday) instead of day 1 (Sunday) which shows higher number of trips. The other values were left the same.

Things to think about for the assignment. Make a validation set i.e. 5% of the data (or maybe more). This should be used for this type of testing. My values are simply made up.

You should also remember to use different models with different data. In this case, I would maybe take each input valuable separately and make a regression model for each, then different variations i.e. any 2.

Remember, you need to write up your results in the assignment.