## Task: Predict number of bikers on a given day using linear regression

You are provided with a dataset about Seattle's Fremont Bridge in the form of a csv file.
The data contains different details about a given day, like weather, temperature and other factors (see the dataframe preview below) for more details. The data also contains how many bikers were observed crossing the brudge that day.

You are provided with the code to download and load the csv file.

Your task is to train a linear regression model which takes in the parameters of the day (you can drop the columns that you think you don't need) and predicts the number of bikers according to those parameters.

In [None]:
from IPython.display import clear_output

In [None]:
# Don't modify this code


%pip install gdown==4.5


clear_output()

In [None]:
# Download the CSV file.
!gdown 1_eJU8Y-31_l0oq1sSJT6pROJyo-ufuvD

Downloading...
From: https://drive.google.com/uc?id=1_eJU8Y-31_l0oq1sSJT6pROJyo-ufuvD
To: /content/bikers_data.csv
  0% 0.00/213k [00:00<?, ?B/s]100% 213k/213k [00:00<00:00, 72.1MB/s]


In [None]:
import pandas as pd
import numpy as np

In [None]:
data_df = pd.read_csv('bikers_data.csv')

In [None]:
data_df.head()

Unnamed: 0,Date,Number of bikers,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day
0,2012-10-03,14084.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1
1,2012-10-04,13900.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1
2,2012-10-05,12592.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1
3,2012-10-06,8024.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1
4,2012-10-07,8568.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1


In [None]:
data_y = data_df['Number of bikers'] # target
data_x = data_df.drop(['Number of bikers'], axis=1) # input features

In [None]:
data_x.head()

Unnamed: 0,Date,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day
0,2012-10-03,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1
1,2012-10-04,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1
2,2012-10-05,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1
3,2012-10-06,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1
4,2012-10-07,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1


In [None]:
data_y

0       14084.0
1       13900.0
2       12592.0
3        8024.0
4        8568.0
         ...   
2641     4552.0
2642     3352.0
2643     3692.0
2644     7212.0
2645     4568.0
Name: Number of bikers, Length: 2646, dtype: float64

In [None]:
data_x['Month'] = data_x['Date'].apply(lambda x: int(x[5:7]))
data_x.head()

Unnamed: 0,Date,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day,Month
0,2012-10-03,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1,10
1,2012-10-04,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1,10
2,2012-10-05,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1,10
3,2012-10-06,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1,10
4,2012-10-07,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1,10


In [None]:
data_x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2646 entries, 0 to 2645
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           2646 non-null   object 
 1   Mon            2646 non-null   float64
 2   Tue            2646 non-null   float64
 3   Wed            2646 non-null   float64
 4   Thu            2646 non-null   float64
 5   Fri            2646 non-null   float64
 6   Sat            2646 non-null   float64
 7   Sun            2646 non-null   float64
 8   holiday        2646 non-null   float64
 9   daylight_hrs   2646 non-null   float64
 10  Rainfall (in)  2646 non-null   float64
 11  Temp (F)       2646 non-null   float64
 12  dry day        2646 non-null   int64  
 13  Month          2646 non-null   int64  
dtypes: float64(11), int64(2), object(1)
memory usage: 289.5+ KB


In [None]:
data_x.drop('Date', inplace=True, axis=1)
data_x.head()

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day,Month
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1,10
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1,10
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1,10
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1,10
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1,10


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() # Feature Scaling

columns = data_x.columns
data_x = data_x.to_numpy()
data_x = scaler.fit_transform(data_x)
data_x = pd.DataFrame(data_x, columns=columns)
data_x.head()

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day,Month
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.404441,0.0,0.54386,1.0,0.818182
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.396743,0.0,0.552632,1.0,0.818182
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.389059,0.0,0.605263,1.0,0.818182
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.381392,0.0,0.622807,1.0,0.818182
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.373742,0.0,0.622807,1.0,0.818182


In [None]:
data_y = scaler.fit_transform(np.array(data_y).reshape(-1, 1)) # MinMax Scaling

In [None]:
X = data_x.values
Y = data_y.squeeze()

X = np.hstack([X, np.ones((X.shape[0], 1))])
display(pd.DataFrame(X))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.404441,0.000000,0.543860,1.0,0.818182,1.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.396743,0.000000,0.552632,1.0,0.818182,1.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.389059,0.000000,0.605263,1.0,0.818182,1.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.381392,0.000000,0.622807,1.0,0.818182,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.373742,0.000000,0.622807,1.0,0.818182,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2641,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.002488,0.003077,0.280702,0.0,1.000000,1.0
2642,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.003522,0.000000,0.333333,1.0,1.000000,1.0
2643,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.004735,0.003077,0.359649,0.0,1.000000,1.0
2644,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006124,0.012308,0.359649,0.0,1.000000,1.0


In [None]:
import sklearn
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

display(pd.DataFrame(x_train))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020013,0.000000,0.184211,1.0,1.000000,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.129912,0.184615,0.280702,0.0,0.909091,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.976932,0.000000,0.780702,1.0,0.545455,1.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.816373,0.135385,0.692982,0.0,0.363636,1.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.860440,0.144615,0.491228,0.0,0.363636,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2111,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.993691,0.000000,0.815789,1.0,0.454545,1.0
2112,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.169853,0.058462,0.526316,0.0,0.909091,1.0
2113,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.056130,0.000000,0.394737,1.0,0.000000,1.0
2114,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.667019,0.024615,0.412281,0.0,0.272727,1.0


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression(fit_intercept=False).fit(x_train, y_train)
ols_mse = mean_squared_error(y_test, lr.predict(x_test))

print(f'OLS MSE: {ols_mse}')

OLS MSE: 0.00843592126200285


In [None]:
from sklearn.linear_model import RidgeCV

ridge = RidgeCV(alphas = [1e-2, 1e-1, 1, 1e1, 1e2], fit_intercept=False, store_cv_values=True)
ridge.fit(x_train, y_train)
y_ridge_pred = ridge.predict(x_test)
ridge_mse = mean_squared_error(y_test, y_ridge_pred)
print(f'Cross Validation MSEs: {np.mean(ridge.cv_values_, axis=0)}')
print(f'Ridge MSE: {ridge_mse}, Alpha: {ridge.alpha_}')

Cross Validation MSEs: [0.00740508 0.00740487 0.00741031 0.00768968 0.01025148]
Ridge MSE: 0.008435564496930932, Alpha: 0.1


In [None]:
from sklearn.linear_model import LassoCV

lasso = LassoCV(alphas = [1e-2, 1e-1, 1, 1e1, 1e2], fit_intercept=False, max_iter=100000)
lasso.fit(x_train, y_train)
y_lasso_pred = lasso.predict(x_test)
lasso_mse = mean_squared_error(y_test, y_lasso_pred)
print(f'LASSO MSE: {lasso_mse}, Alpha: {lasso.alpha_}')

LASSO MSE: 0.013943809919790856, Alpha: 0.01


In [None]:
print(f'OLS:\tMSE = {ols_mse}')
print(f'Ridge:\tMSE = {ridge_mse}')
print(f'LASSO:\tMSE = {lasso_mse}')

OLS:	MSE = 0.00843592126200285
Ridge:	MSE = 0.008435564496930932
LASSO:	MSE = 0.013943809919790856


The best model is the Ridge with alpha = 1.0

In [None]:
print(f'Ridge MSE: {ridge_mse}, Alpha: {ridge.alpha_}')

Ridge MSE: 0.008435564496930932, Alpha: 0.1
