# Machine Learning

The purpose of this notebook is to predict the number of available bikes at a given station based on the time and date inputted by the user. Code was adapted from the labs for COMP47350.

Due to the little correlation between weather and bike data, the decision was made to omit all weather data for this section.

In [125]:
import pandas as pd
import os
import sys
import mysql.connector

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor

import joblib
from joblib import load, dump
import pickle
import json

In [13]:
mydb = mysql.connector.connect(
host="dbbikes.cpwzqhmscagf.eu-west-1.rds.amazonaws.com",
user="group20",
password="30830Group20",
database="dBikes"
)
query = "SELECT * FROM Bikes;"
bikes_df = pd.read_sql(query, mydb)
mydb.close()
#mycursor = mydb.cursor()

In [155]:
bikes_df.head()

Unnamed: 0,Address,Available_Bikes,Available_Stands,Updated,Status,hour_of_day,day_of_week
0,Smithfield North,18,12,2022-03-03 09:20:28,OPEN,9,3
1,Parnell Square North,3,17,2022-03-03 09:27:04,OPEN,9,3
2,Clonmel Street,4,29,2022-03-03 09:28:23,OPEN,9,3
3,Avondale Road,0,35,2022-03-03 09:26:09,OPEN,9,3
4,Mount Street Lower,6,34,2022-03-03 09:27:52,OPEN,9,3


In [156]:
bikes_df.dtypes

Address                   category
Available_Bikes              int64
Available_Stands             int64
Updated             datetime64[ns]
Status                    category
hour_of_day                  int64
day_of_week                  int64
dtype: object

In [157]:
bikes_df['Address'] = bikes_df['Address'].astype('category')
bikes_df['Status'] = bikes_df['Status'].astype('category')
categorical_columns = bikes_df[['Address','Status']].columns
bikes_df[categorical_columns].describe().T

Unnamed: 0,count,unique,top,freq
Address,1191176,111,Avondale Road,10817
Status,1191176,2,OPEN,1183923


In [158]:
bikes_df.dtypes

Address                   category
Available_Bikes              int64
Available_Stands             int64
Updated             datetime64[ns]
Status                    category
hour_of_day                  int64
day_of_week                  int64
dtype: object

In [159]:
numeric_columns = bikes_df[['Available_Bikes','Available_Stands']].columns
bikes_df[numeric_columns].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Available_Bikes,1191176.0,12.634395,9.124364,0.0,5.0,12.0,19.0,40.0
Available_Stands,1191176.0,19.094727,10.801886,0.0,11.0,19.0,27.0,40.0


In [160]:
#changing from unix timestamp

bikes_df['Updated'] = pd.to_datetime(bikes_df['Updated'],unit='s')

In [161]:
DateTime_columns = bikes_df.select_dtypes(['datetime64[ns]']).columns
bikes_df[DateTime_columns].describe().T

  bikes_df[DateTime_columns].describe().T


Unnamed: 0,count,unique,top,freq,first,last
Updated,1191176,542842,2022-04-05 14:15:05,1255,2022-03-03 09:19:27,2022-04-09 22:49:17


In [162]:
bikes_df.head()

Unnamed: 0,Address,Available_Bikes,Available_Stands,Updated,Status,hour_of_day,day_of_week
0,Smithfield North,18,12,2022-03-03 09:20:28,OPEN,9,3
1,Parnell Square North,3,17,2022-03-03 09:27:04,OPEN,9,3
2,Clonmel Street,4,29,2022-03-03 09:28:23,OPEN,9,3
3,Avondale Road,0,35,2022-03-03 09:26:09,OPEN,9,3
4,Mount Street Lower,6,34,2022-03-03 09:27:52,OPEN,9,3


The user will be able to see the expected number of bikes for their chosen station by inputting a date and time in the future. Therefore, the day of the week and the hour has been extracted from the 'Updated' column.

In [163]:
#column for hour of the day

bikes_df['hour_of_day'] = bikes_df['Updated'].dt.hour
bikes_df.head()

Unnamed: 0,Address,Available_Bikes,Available_Stands,Updated,Status,hour_of_day,day_of_week
0,Smithfield North,18,12,2022-03-03 09:20:28,OPEN,9,3
1,Parnell Square North,3,17,2022-03-03 09:27:04,OPEN,9,3
2,Clonmel Street,4,29,2022-03-03 09:28:23,OPEN,9,3
3,Avondale Road,0,35,2022-03-03 09:26:09,OPEN,9,3
4,Mount Street Lower,6,34,2022-03-03 09:27:52,OPEN,9,3


In [164]:
#getting column for the day of week

bikes_df['day_of_week'] = bikes_df['Updated'].dt.dayofweek


bikes_df.head()

Unnamed: 0,Address,Available_Bikes,Available_Stands,Updated,Status,hour_of_day,day_of_week
0,Smithfield North,18,12,2022-03-03 09:20:28,OPEN,9,3
1,Parnell Square North,3,17,2022-03-03 09:27:04,OPEN,9,3
2,Clonmel Street,4,29,2022-03-03 09:28:23,OPEN,9,3
3,Avondale Road,0,35,2022-03-03 09:26:09,OPEN,9,3
4,Mount Street Lower,6,34,2022-03-03 09:27:52,OPEN,9,3


In [165]:
categorical_columns = bikes_df[['Address']].columns

continuous_columns = bikes_df[['Available_Bikes', 'Available_Stands', 'hour_of_day', 'day_of_week']].columns

bikes_df[continuous_columns].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Available_Bikes,1191176.0,12.634395,9.124364,0.0,5.0,12.0,19.0,40.0
Available_Stands,1191176.0,19.094727,10.801886,0.0,11.0,19.0,27.0,40.0
hour_of_day,1191176.0,11.574272,6.889459,0.0,6.0,12.0,18.0,23.0
day_of_week,1191176.0,3.074466,1.962829,0.0,1.0,3.0,5.0,6.0


In [166]:
bikes_df[categorical_columns].describe().T

Unnamed: 0,count,unique,top,freq
Address,1191176,111,Avondale Road,10817


In [167]:
bikes_df.head()

Unnamed: 0,Address,Available_Bikes,Available_Stands,Updated,Status,hour_of_day,day_of_week
0,Smithfield North,18,12,2022-03-03 09:20:28,OPEN,9,3
1,Parnell Square North,3,17,2022-03-03 09:27:04,OPEN,9,3
2,Clonmel Street,4,29,2022-03-03 09:28:23,OPEN,9,3
3,Avondale Road,0,35,2022-03-03 09:26:09,OPEN,9,3
4,Mount Street Lower,6,34,2022-03-03 09:27:52,OPEN,9,3


## Training the models

- Testing Different modelling algorithims to see which one works with the data. 

- Input values will be 'hour_of_day' and 'day_of_week'.

- Modelling algorithms tested include Random Forest Regression and Linear Regression.

In [168]:
X = bikes_df[['hour_of_day', 'day_of_week']]
y = bikes_df.Available_Bikes

print("\nDescriptive features in X:\n", X)
print("\nTarget feature in y:\n", y)


Descriptive features in X:
          hour_of_day  day_of_week
0                  9            3
1                  9            3
2                  9            3
3                  9            3
4                  9            3
...              ...          ...
1191171           22            5
1191172           22            5
1191173           22            5
1191174           22            5
1191175           22            5

[1191176 rows x 2 columns]

Target feature in y:
 0          18
1           3
2           4
3           0
4           6
           ..
1191171     1
1191172    19
1191173    35
1191174    20
1191175     3
Name: Available_Bikes, Length: 1191176, dtype: int64


## Linear Regression Model

In [169]:
cont_features = ['hour_of_day', 'day_of_week']

linreg = LinearRegression().fit(X[cont_features], y)

# Print the weights learned for each feature.
print("Features: \n", cont_features)
print("Coeficients: \n", linreg.coef_)
print("\nIntercept: \n", linreg.intercept_)

feature_importance = pd.DataFrame({'feature': cont_features, 'importance':linreg.coef_})
feature_importance.sort_values('importance', ascending=False)

Features: 
 ['hour_of_day', 'day_of_week']
Coeficients: 
 [-0.01961888  0.03819846]

Intercept: 
 12.744029242847708


Unnamed: 0,feature,importance
1,day_of_week,0.038198
0,hour_of_day,-0.019619


In [170]:
linreg_predictions = linreg.predict(X[cont_features])

print("\nPredictions with linear regression: \n")
actual_vs_predicted_linreg = pd.concat([y, pd.DataFrame(linreg_predictions, columns=['Predicted'], index=y.index)], axis=1)
print(actual_vs_predicted_linreg)


Predictions with linear regression: 

         Available_Bikes  Predicted
0                     18  12.682055
1                      3  12.682055
2                      4  12.682055
3                      0  12.682055
4                      6  12.682055
...                  ...        ...
1191171                1  12.503406
1191172               19  12.503406
1191173               35  12.503406
1191174               20  12.503406
1191175                3  12.503406

[1191176 rows x 2 columns]


In [171]:
prediction_errors = y - linreg_predictions
print("Actual - Predicted:\n", prediction_errors)
print("\n(Actual - Predicted) squared:\n", prediction_errors**2)
print("\n Sum of (Actual - Predicted) squared:\n", (prediction_errors**2).sum())

Actual - Predicted:
 0           5.317945
1          -9.682055
2          -8.682055
3         -12.682055
4          -6.682055
             ...    
1191171   -11.503406
1191172     6.496594
1191173    22.496594
1191174     7.496594
1191175    -9.503406
Name: Available_Bikes, Length: 1191176, dtype: float64

(Actual - Predicted) squared:
 0           28.280542
1           93.742184
2           75.378075
3          160.834513
4           44.649856
              ...    
1191171    132.328357
1191172     42.205730
1191173    506.096728
1191174     56.198917
1191175     90.314731
Name: Available_Bikes, Length: 1191176, dtype: float64

 Sum of (Actual - Predicted) squared:
 99141532.43188888


In [172]:
mse = (prediction_errors** 2).mean()
rmse = ((prediction_errors** 2).mean())**0.5

print("\nMean Squared Error:\n", mse)
print("\nRoot Mean Squared Error:\n", rmse)


Mean Squared Error:
 83.22996134231705

Root Mean Squared Error:
 9.123045617682566


In [173]:
print("|Actual - Predicted|:\n", abs(prediction_errors))

|Actual - Predicted|:
 0           5.317945
1           9.682055
2           8.682055
3          12.682055
4           6.682055
             ...    
1191171    11.503406
1191172     6.496594
1191173    22.496594
1191174     7.496594
1191175     9.503406
Name: Available_Bikes, Length: 1191176, dtype: float64


In [174]:
def printMetrics(testActualVal, predictions):
    #classification evaluation measures
    print('\n==============================================================================')
    print("MAE: ", metrics.mean_absolute_error(testActualVal, predictions))
    #print("MSE: ", metrics.mean_squared_error(testActualVal, predictions))
    print("RMSE: ", metrics.mean_squared_error(testActualVal, predictions)**0.5)
    

In [175]:
printMetrics(y, linreg_predictions)


MAE:  7.491373935409785
RMSE:  9.123045617682244


## Random Forest Regression

In [176]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [177]:
model = RandomForestRegressor()

model.fit(X_train,y_train)
predictions = model.predict(X_test)

In [194]:
rforest = RandomForestRegressor().fit(X[cont_features], y)

In [195]:
forest_predictions = rforest.predict(X[cont_features])

print("\nPredictions with linear regression: \n")
actual_vs_predicted_linreg = pd.concat([y, pd.DataFrame(forest_predictions, columns=['Predicted'], index=y.index)], axis=1)
print(actual_vs_predicted_linreg)


Predictions with linear regression: 

         Available_Bikes  Predicted
1046292                0   0.162059
1046403                0   0.494210
1046514                0   0.494210
1046625                0   0.494210
1046736                0   0.494210
...                  ...        ...
1190703                1   1.000000
1190814                1   1.000000
1190925                1   1.000000
1191036                1   1.000000
1191147                1   1.000000

[1306 rows x 2 columns]


In [178]:
from sklearn.metrics import mean_absolute_error, r2_score
print('\n==============================================================================')
print("MAE: ", mean_absolute_error(y_test,predictions))
print("RMSE: ", metrics.mean_squared_error(y_test, predictions)**0.5)



MAE:  7.488002473730894
RMSE:  9.113555111344342


### As the Random Forest Regression model performs slightly better than the Linear Regression Model based on the MAE and RMSE calculated, this will be the model that will be use for our predictions.

# Making a pickle file for each station:

In [187]:
Address_list = list(bikes_df['Address'].unique())

for i in Address_list:
    
    df_new = bikes_df[bikes_df['Address']== i]
    
    X = pd.DataFrame(df_new[['hour_of_day', 'day_of_week']])

    y = df_new['Available_Bikes']

    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

    model = RandomForestRegressor()

    # Train the model
    model.fit(x_train, y_train)
    
    if i == "Princes Street / O'Connell Street":
        i = "Princes Street O'Connell Street"
    
    
    file_name = i + "_model.pkl"
    
    
    with open(file_name, 'wb') as handle: 
    
        joblib.dump(model, handle, pickle.HIGHEST_PROTOCOL)

## Testing to see the Pickle files are working as they should:

In [196]:
model = load('Avondale Road_model.pkl')
bike_avail_predict = model.predict([[19,4]])

predict_list = bike_avail_predict.tolist()
predict_dict = {"bikes": predict_list[0]}
result = json.dumps(predict_dict)

print(result)

{"bikes": 9.994834630183377}




In [197]:
model = load('Exchequer Street_model.pkl')
bike_avail_predict = model.predict([[20,2]])

predict_list = bike_avail_predict.tolist()
predict_dict = {"bikes": predict_list[0]}
result = json.dumps(predict_dict)

print(result)

{"bikes": 14.664878751505002}




In [150]:
model = load('Avondale Road_model.pkl')
bike_avail_predict = model.predict([[5,9]])

predict_list = bike_avail_predict.tolist()
predict_dict = {"bikes": predict_list[0]}
result = json.dumps(predict_dict)

print(result)

{"bikes": 12.989721031500054}




In [42]:
model = load('Avondale Road_model.pkl')
bike_avail_predict = model.predict([[5,9]])

predict_list = bike_avail_predict.tolist()
predict_dict = {"bikes": predict_list[0]}
result = json.dumps(predict_dict)

print(result)

{"bikes": 7.9059849219011324}




In [61]:
print(Address_list)

['Smithfield North', 'Parnell Square North', 'Clonmel Street', 'Avondale Road', 'Mount Street Lower', 'Christchurch Place', 'Grantham Street', 'Pearse Street', 'York Street East', 'Excise Walk', 'Fitzwilliam Square West', 'Portobello Road', 'Parnell Street', 'Frederick Street South', 'Custom House', 'Rathdown Road', "North Circular Road (O'Connell's)", 'Hanover Quay', 'Oliver Bond Street', 'Collins Barracks Museum', 'Brookfield Road', 'Benson Street', 'Earlsfort Terrace', 'Golden Lane', 'Deverell Place', 'Wilton Terrace (Park)', 'John Street West', 'Fenian Street', 'Merrion Square South', 'South Dock Road', 'City Quay', 'Exchequer Street', 'The Point', 'Broadstone', 'Hatch Street', 'Lime Street', 'Charlemont Street', 'Kilmainham Gaol', 'Hardwicke Place', 'Wolfe Tone Street', 'Francis Street', 'Greek Street', 'Guild Street', 'Herbert Place', 'High Street', 'North Circular Road', 'Western Way', 'Talbot Street', 'Newman House', "Sir Patrick's Dun", 'New Central Bank', 'Grangegorman Lower 