<a href="https://colab.research.google.com/github/enes-karatas/AI_ML_test/blob/main/Machine_Learning_Project_Bike_Rental%26Demand_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BikeEase Bike Rental Prediction Project

###Project Insight :
BikeEase is a New York-based urban mobility company providing bike rental services across the city. The company offers flexible bike rental options to both residents and tourists aiming to encourage eco-friendly transportation.

BikeEase plans to leverage AI/ML capabilities to optimize operations, predict demand, and improve user experience. The goal is to build an intelligent analytics platform that helps understand rental patterns, seasonal trends, and operational efficiency.

###Objective :

Develop an end-to-end machine learning pipeline to forecast hourly bike rentals using various regression techniques and performance evaluation methods

###Summary :
- Built machine learning models to predict future bike rental demand using historical rental data.
- Conducted data cleaning, analysis and feature engineering; trained and tuned linear regression, Ridge, Lasso, and Elastic Net models using GridSearch to identify the best-performing model.

###Dataset :

Input dataset : FloridaBikeRentals.csv  
Dataset link : https://drive.google.com/file/d/1BAJ8iDpCJdfZSg1QS62RlMiJSs0O8MrG/view

Dataset Columns :

Date: The date when the data was recorded

Rented Bike Count: The number of bikes rented during the given hour
Hour: The hour of the day (0-23)

Temperature(°C): The recorded temperature in Celsius

Humidity(%): The relative humidity percentage

Wind speed (m/s): Wind speed measured in meters per second

Visibility (10m): Visibility recorded in units of 10 meters

Dew point temperature(°C): The dew point temperature in Celsius

Solar Radiation (MJ/m2): The amount of solar radiation received

Rainfall(mm): The recorded rainfall in millimeters

Snowfall (cm): The recorded snowfall in centimeters

Seasons: The season when the data was collected (e.g., Winter, Spring, Summer, Fall)

Holiday: Whether the day was a holiday or not

Functioning Day: Indicates whether the bike rental service was operational on that day



#Contents

###1. Feature Engineering

###2. Model Building
- 2.1. Linear Regression
- 2.2. Ridge Regression (L2 Regularization)
- 2.3. Lasso Regression (L1 Regularization)
- 2.4. Elastic Net Regression

###3. Model building with polynomial features, model evaluation and validation

###4. Final Report

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler


# 1. Feature Engineering

In [None]:
df_raw = pd.read_csv('FloridaBikeRentals.csv', encoding='latin1')

df= df_raw.copy()
# Making a copy of raw data to work on

df.head(3)

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01-12-2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01-12-2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01-12-2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes


In [None]:
df.shape

(8760, 14)

In [None]:
df.info()
# Datatype, null value and memory usage check

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Date                       8760 non-null   object 
 1   Rented Bike Count          8760 non-null   int64  
 2   Hour                       8760 non-null   int64  
 3   Temperature(°C)            8760 non-null   float64
 4   Humidity(%)                8760 non-null   int64  
 5   Wind speed (m/s)           8760 non-null   float64
 6   Visibility (10m)           8760 non-null   int64  
 7   Dew point temperature(°C)  8760 non-null   float64
 8   Solar Radiation (MJ/m2)    8760 non-null   float64
 9   Rainfall(mm)               8760 non-null   float64
 10  Snowfall (cm)              8760 non-null   float64
 11  Seasons                    8760 non-null   object 
 12  Holiday                    8760 non-null   object 
 13  Functioning Day            8760 non-null   objec

In [None]:
df.isna().sum()

# No null values found

Unnamed: 0,0
Date,0
Rented Bike Count,0
Hour,0
Temperature(°C),0
Humidity(%),0
Wind speed (m/s),0
Visibility (10m),0
Dew point temperature(°C),0
Solar Radiation (MJ/m2),0
Rainfall(mm),0


In [None]:
df.dtypes

Unnamed: 0,0
Date,object
Rented Bike Count,int64
Hour,int64
Temperature(°C),float64
Humidity(%),int64
Wind speed (m/s),float64
Visibility (10m),int64
Dew point temperature(°C),float64
Solar Radiation (MJ/m2),float64
Rainfall(mm),float64


In [None]:
df['Functioning Day'] = df['Functioning Day'].map({'Yes':1 , 'No':0 })
df['Holiday'] = df['Holiday'].map({'Holiday':1 , 'No Holiday':0 })

# Mapping 'Functioning Day' and 'Holiday' columns to 1, 0 to make it machine learning model ready

In [None]:
df.drop_duplicates()
df.shape

# No duplicates found

(8760, 14)

In [None]:
df['Date'] = pd.to_datetime(df['Date'] , format='%d-%m-%Y')

# Changing Date columns datatype to DateTime object

In [None]:
df.rename(columns={'Functioning Day' : 'Functioning Day_Yes'}, inplace=True)
df.rename(columns={'Holiday' : 'Holiday_Yes'}, inplace=True)

In [None]:
df['Hour_Temp_Interaction'] = df['Hour'] * df['Temperature(°C)']
# Adding interaction feature to dataframe since Hour and Temperature has positive correlation

In [None]:
df_encoded = pd.get_dummies(df)
# Getting boolean values for season categories

df_encoded.head(2)

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Holiday_Yes,Functioning Day_Yes,Hour_Temp_Interaction,Seasons_Autumn,Seasons_Spring,Seasons_Summer,Seasons_Winter
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,0,1,-0.0,False,False,False,True
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,0,1,-5.5,False,False,False,True


In [None]:
boolean_columns = df_encoded.select_dtypes(include='bool').columns
df_encoded[boolean_columns] = df_encoded[boolean_columns].astype(int)
# Casting booleans to int values

In [None]:
df_encoded.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Holiday_Yes,Functioning Day_Yes,Hour_Temp_Interaction,Seasons_Autumn,Seasons_Spring,Seasons_Summer,Seasons_Winter
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,0,1,-0.0,0,0,0,1
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,0,1,-5.5,0,0,0,1
2,2017-12-01,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,0,1,-12.0,0,0,0,1
3,2017-12-01,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,0,1,-18.6,0,0,0,1
4,2017-12-01,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,0,1,-24.0,0,0,0,1


In [None]:
df_encoded.dtypes

Unnamed: 0,0
Date,datetime64[ns]
Rented Bike Count,int64
Hour,int64
Temperature(°C),float64
Humidity(%),int64
Wind speed (m/s),float64
Visibility (10m),int64
Dew point temperature(°C),float64
Solar Radiation (MJ/m2),float64
Rainfall(mm),float64


In [None]:
df_cleaned = df_encoded.copy()
df_cleaned.head()

# Cleaned and processed data saved in df_cleaned dataFrame

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Holiday_Yes,Functioning Day_Yes,Hour_Temp_Interaction,Seasons_Autumn,Seasons_Spring,Seasons_Summer,Seasons_Winter
0,2017-12-01,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,0,1,-0.0,0,0,0,1
1,2017-12-01,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,0,1,-5.5,0,0,0,1
2,2017-12-01,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,0,1,-12.0,0,0,0,1
3,2017-12-01,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,0,1,-18.6,0,0,0,1
4,2017-12-01,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,0,1,-24.0,0,0,0,1


In [None]:
scale_needed_columns = ['Hour', 'Temperature(°C)' , 'Humidity(%)' , 'Wind speed (m/s)' , 'Visibility (10m)' ,
                        'Dew point temperature(°C)' , 'Solar Radiation (MJ/m2)' , 'Rainfall(mm)' , 'Snowfall (cm)', 'Hour_Temp_Interaction']
no_scale_needed_columns = df_cleaned.columns.drop(scale_needed_columns)

df_unscaled = df_cleaned[no_scale_needed_columns]

std_scaler = StandardScaler()

#df_scaled = std_scaler.fit_transform(df_cleaned[scale_needed_features])

df_scaled = pd.DataFrame(std_scaler.fit_transform(df_cleaned[scale_needed_columns]) , columns=scale_needed_columns)

# df_cleaned_scaled = pd.concat([df_scaled , df_cleaned[no_scale_needed_columns]] , axis=0, ignore_index=True)

# df_cleaned_scaled.head()

df_cleaned_scaled = pd.concat([df_scaled, df_unscaled], axis=1)

df_cleaned_scaled = df_cleaned_scaled.drop(columns=['Date'])
df_cleaned_scaled.head()

df_cleaned_scaled.to_csv("bike_rental_features.csv" , index=False)

# Dateframe numeric and scale needed columns were scaled with Standart Scaler and saved in new DataFrame df_cleaned_scaled
# We will work on df_cleaned_scaled and variations of it to build our models


# 2. Model Building

###Linear Regression , Ridge Regression(L2) , Lasso Regression(L1) , Elastic Net Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

### 2.1. Linear Regression

In [None]:
# Linear Regression with only Hour used as feature

X = df_cleaned_scaled['Hour']
y = df_cleaned_scaled['Rented Bike Count']
# Defining our Features(X) and Target(y), only 'Hour' selected as feature

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Splitting dataset as Train 80% and Test 20%

model = LinearRegression()
# Defining the model, I'm gonna use Linear Regression

X_train = X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
# Reshape X_train to be a 2D array

model.fit(X_train, y_train)
# Training here with Train data

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Getting predictions

train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

train_r2 = r2_score(y_train , y_train_pred)
test_r2 = r2_score(y_test , y_test_pred)

print(f'Train RMSE : {train_rmse:.3f} , Test RMSE : {test_rmse:.3f}')
print(f'Train R2 score : {train_r2} , Test R2 score : {test_r2}')

Train RMSE : 585.707 , Test RMSE : 598.134
Train R2 score : 0.1749696559584285 , Test R2 score : 0.14132262215695923


In [None]:
# Train and Test RMSE are very closed each other so no sign of overfitting or underfitting
# Train and Test R2 values are below 17% and 14% which is pretty low, model needs improvements to get closer to 100%

In [None]:
# Linear Regression with all scaled numeric columns used as features

features = ['Hour', 'Temperature(°C)' , 'Humidity(%)' , 'Wind speed (m/s)' , 'Visibility (10m)' ,
            'Dew point temperature(°C)' , 'Solar Radiation (MJ/m2)' , 'Rainfall(mm)' , 'Snowfall (cm)', 'Hour_Temp_Interaction']

X_f = df_cleaned_scaled[features]
y_f = df_cleaned_scaled['Rented Bike Count']
# Defining our Features(X) and Target(y)

X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X_f, y_f, test_size=0.2, random_state=42)
# Splitting dataset as Train 80% and Test 20%

model = LinearRegression()
# Defining the model, I'm gonna use Linear Regression

model.fit(X_train_f, y_train_f)
# Training here with Train data

y_train_f_pred = model.predict(X_train_f)
y_test_f_pred = model.predict(X_test_f)
# Getting predictions

train_rmse = np.sqrt(mean_squared_error(y_train_f, y_train_f_pred))
test_rmse = np.sqrt(mean_squared_error(y_test_f, y_test_f_pred))

train_r2 = r2_score(y_train_f , y_train_f_pred)
test_r2 = r2_score(y_test_f , y_test_f_pred)

print(f'Train RMSE : {train_rmse:.3f} , Test RMSE : {test_rmse:.3f}')
print(f'Train R2 score : {train_r2} , Test R2 score : {test_r2}')

Train RMSE : 448.222 , Test RMSE : 452.936
Train R2 score : 0.516834581024129 , Test R2 score : 0.5076128007572661


In [None]:
# Train and Test RMSE are closed each other but not same, no sign of overfitting or underfitting so far
# Train and Test R2 values are below around 52% which is not so bad but still model needs improvements to get closer to 100%

### 2.2. Ridge Regression (L2 Regularization)

In [None]:
# Ridge Regression (L2 Regularization)

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

features = ['Hour', 'Temperature(°C)' , 'Humidity(%)' , 'Wind speed (m/s)' , 'Visibility (10m)' ,
            'Dew point temperature(°C)' , 'Solar Radiation (MJ/m2)' , 'Rainfall(mm)' , 'Snowfall (cm)', 'Hour_Temp_Interaction']

params = {'alpha': [0.1, 1, 10, 100, 1000]}
ridge_model = GridSearchCV(Ridge(), params, cv=5, scoring='neg_mean_squared_error')

X_f = df_cleaned_scaled[features]
y_f = df_cleaned_scaled['Rented Bike Count']
# Defining our Features(X) and Target(y)

X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X_f, y_f, test_size=0.2, random_state=42)
# Splitting dataset as Train 80% and Test 20%

# ridge_model = Ridge(alpha=100)
# # Alpha -> penalty ratio, eg. 0.1 is small penalty(gives close prediction to Linear regression, 100 is large penalty )
# # Defining the model, I'm gonna use Ridge Regression

ridge_model.fit(X_train_f, y_train_f)
# Training here with Train data

y_train_f_pred = ridge_model.predict(X_train_f)
y_test_f_pred = ridge_model.predict(X_test_f)
# Getting predictions

train_rmse = np.sqrt(mean_squared_error(y_train_f, y_train_f_pred))
test_rmse = np.sqrt(mean_squared_error(y_test_f, y_test_f_pred))

train_r2 = r2_score(y_train_f , y_train_f_pred)
test_r2 = r2_score(y_test_f , y_test_f_pred)

train_mae = mean_absolute_error(y_train_f, y_train_f_pred)
test_mae = mean_absolute_error(y_test_f, y_test_f_pred)

train_mse = mean_squared_error(y_train_f, y_train_f_pred)
test_mse = mean_squared_error(y_test_f, y_test_f_pred)

print(f"Ridge Train MAE: {train_mae:.2f}, Ridge Test MAE: {test_mae:.2f}")
print(f"Ridge Train MSE: {train_mse:.2f}, Ridge Test MSE: {test_mse:.2f}")
print()
print(f'Ridge Train RMSE : {train_rmse:.3f} , Ridge Test RMSE : {test_rmse:.3f}')
print(f'Ridge Train R2 score : {train_r2:.5f} , Ridge Test R2 score : {test_r2:.5f}')

Ridge Train MAE: 320.07, Ridge Test MAE: 318.89
Ridge Train MSE: 200904.82, Ridge Test MSE: 205144.51

Ridge Train RMSE : 448.224 , Ridge Test RMSE : 452.929
Ridge Train R2 score : 0.51683 , Ridge Test R2 score : 0.50763


In [None]:
# I've tried ridge with with GridsearchCV ('alpha': [0.1, 1, 10, 100, 1000]).
# By 'cv=5' , K-fold 5 used here.
# Prediction success has not changed significantly, success start to drop at alpha=1000 when I checked manualy

### 2.3. Lasso Regression (L1 Regularization)

In [None]:
# Lasso Regression (L1 Regularization)

from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

features = ['Hour', 'Temperature(°C)' , 'Humidity(%)' , 'Wind speed (m/s)' , 'Visibility (10m)' ,
            'Dew point temperature(°C)' , 'Solar Radiation (MJ/m2)' , 'Rainfall(mm)' , 'Snowfall (cm)', 'Hour_Temp_Interaction']

param_lasso = {'alpha': [0.001, 0.01, 0.1, 1.0, 10]}
lasso_model = GridSearchCV(Lasso(max_iter=10000), param_lasso, cv=5, scoring='r2')

X_f_L1 = df_cleaned_scaled[features]
y_f_L1 = df_cleaned_scaled['Rented Bike Count']
# Defining our Features(X) and Target(y)

X_train_L1, X_test_L1, y_train_L1, y_test_L1 = train_test_split(X_f_L1, y_f_L1, test_size=0.2, random_state=42)
# Splitting dataset as Train 80% and Test 20%

lasso_model.fit(X_train_L1, y_train_L1)
# Training here with Train data

y_train_L1_pred = lasso_model.predict(X_train_L1)
y_test_L1_pred = lasso_model.predict(X_test_L1)
# Getting predictions

train_rmse = np.sqrt(mean_squared_error(y_train_L1, y_train_L1_pred))
test_rmse = np.sqrt(mean_squared_error(y_test_L1, y_test_L1_pred))

train_r2 = r2_score(y_train_L1 , y_train_L1_pred)
test_r2 = r2_score(y_test_L1 , y_test_L1_pred)

train_mae = mean_absolute_error(y_train_L1, y_train_L1_pred)
test_mae = mean_absolute_error(y_test_L1, y_test_L1_pred)

train_mse = mean_squared_error(y_train_L1, y_train_L1_pred)
test_mse = mean_squared_error(y_test_L1, y_test_L1_pred)

print(f"Lasso Train MAE: {train_mae:.2f}, Lasso Test MAE: {test_mae:.2f}")
print(f"Lasso Train MSE: {train_mse:.2f}, Lasso Test MSE: {test_mse:.2f}")
print()

print(f'Lasso Train RMSE : {train_rmse:.3f} , Lasso Test RMSE : {test_rmse:.3f}')
print(f'Lasso Train R2 score : {train_r2:.5f} , Lasso Test R2 score : {test_r2:.5f}')
print()
print('Coefficients : ')
print()
coefficients = pd.Series(lasso_model.best_estimator_.coef_, index=X_f.columns)
print(coefficients.sort_values())

Lasso Train MAE: 320.14, Lasso Test MAE: 319.05
Lasso Train MSE: 200943.58, Lasso Test MSE: 205175.65

Lasso Train RMSE : 448.267 , Lasso Test RMSE : 452.963
Lasso Train R2 score : 0.51674 , Lasso Test R2 score : 0.50755

Coefficients : 

Humidity(%)                 -144.246523
Rainfall(mm)                 -69.946288
Solar Radiation (MJ/m2)      -67.572832
Wind speed (m/s)               3.036346
Snowfall (cm)                  3.206281
Visibility (10m)              12.254935
Dew point temperature(°C)     17.171651
Hour                          52.943364
Temperature(°C)              112.736063
Hour_Temp_Interaction        327.744165
dtype: float64


In [None]:
# I've tried lasso with with GridsearchCV ('alpha': [0.001, 0.01, 0.1, 1.0, 10]).
# By 'cv=5' , K-fold 5 used here
# Prediction success has not changed significantly
# By checking coefficients we can say Hour_Temp_Interaction is the best feature to train model

### 2.4. Elastic Net Regression

In [None]:
# Elastic Net Regression

from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

features = ['Hour', 'Temperature(°C)' , 'Humidity(%)' , 'Wind speed (m/s)' , 'Visibility (10m)' ,
            'Dew point temperature(°C)' , 'Solar Radiation (MJ/m2)' , 'Rainfall(mm)' , 'Snowfall (cm)', 'Hour_Temp_Interaction']

param_grid_ENR = {
    'alpha': [0.01, 0.1, 1.0, 10],
    'l1_ratio': [0.1, 0.5, 0.9]  # 0 = Ridge, 1 = Lasso
}

model_ENR = GridSearchCV(ElasticNet(max_iter=10000), param_grid_ENR, cv=5, scoring='r2')

X_f = df_cleaned_scaled[features]
y_f = df_cleaned_scaled['Rented Bike Count']
# Defining our Features(X) and Target(y)

X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X_f, y_f, test_size=0.2, random_state=42)
# Splitting dataset as Train 80% and Test 20%


model_ENR.fit(X_train_f, y_train_f)
# Training here with Train data

y_train_f_pred = model_ENR.predict(X_train_f)
y_test_f_pred = model_ENR.predict(X_test_f)
# Getting predictions

train_rmse = np.sqrt(mean_squared_error(y_train_f, y_train_f_pred))
test_rmse = np.sqrt(mean_squared_error(y_test_f, y_test_f_pred))

train_r2 = r2_score(y_train_f , y_train_f_pred)
test_r2 = r2_score(y_test_f , y_test_f_pred)

train_mae = mean_absolute_error(y_train_f, y_train_f_pred)
test_mae = mean_absolute_error(y_test_f, y_test_f_pred)

train_mse = mean_squared_error(y_train_f, y_train_f_pred)
test_mse = mean_squared_error(y_test_f, y_test_f_pred)

print(f"Elastic Net Train MAE: {train_mae:.2f}, Elastic Net Test MAE: {test_mae:.2f}")
print(f"Elastic Net Train MSE: {train_mse:.2f}, Elastic Net Test MSE: {test_mse:.2f}")
print()
print(f'Elastic Net Train RMSE : {train_rmse:.3f} , Elastic Net Test RMSE : {test_rmse:.3f}')
print(f'Elastic Net Train R2 score : {train_r2:.5f} , Elastic Net Test R2 score : {test_r2:.5f}')

Elastic Net Train MAE: 320.06, Elastic Net Test MAE: 318.84
Elastic Net Train MSE: 200917.26, Elastic Net Test MSE: 205138.64

Elastic Net Train RMSE : 448.238 , Elastic Net Test RMSE : 452.922
Elastic Net Train R2 score : 0.51680 , Elastic Net Test R2 score : 0.50764


In [None]:
# Elastic Net Regression combines best of both Ridge and Lasso, it seems best results very closed for both L1 and L2 but best is L2(Ridge) so far

# 3. Model building with polynomial features, model evaluation and validation

In [None]:
# Working with polynomial features

from sklearn.preprocessing import PolynomialFeatures

# Creating Poly Features with using Temperature(°C) and Hour columns
poly = PolynomialFeatures(degree=2, include_bias=False)

X_poly = poly.fit_transform(df_cleaned_scaled[['Temperature(°C)', 'Hour']])


df_Temp_Hour_Poly = pd.DataFrame(
    X_poly,
    columns=poly.get_feature_names_out(['Temperature(°C)', 'Hour'])
)

df_Temp_Hour_Poly.head(2)


Unnamed: 0,Temperature(°C),Hour,Temperature(°C)^2,Temperature(°C) Hour,Hour^2
0,-1.513957,-1.661325,2.292067,2.515175,2.76
1,-1.539074,-1.516862,2.368749,2.334563,2.30087


In [None]:
# Creating new dataframe which only includes Temp-Hour based poly features and 'Rented Bike Count'
df_cleaned_scaled_poly_Temp_Hour = pd.concat([df_cleaned_scaled['Rented Bike Count'], df_Temp_Hour_Poly], axis=1)

df_cleaned_scaled_poly_Temp_Hour.head()

Unnamed: 0,Rented Bike Count,Temperature(°C),Hour,Temperature(°C)^2,Temperature(°C) Hour,Hour^2
0,254,-1.513957,-1.661325,2.292067,2.515175,2.76
1,204,-1.539074,-1.516862,2.368749,2.334563,2.30087
2,173,-1.580936,-1.372399,2.499358,2.169674,1.883478
3,107,-1.59768,-1.227936,2.552582,1.961849,1.507826
4,78,-1.580936,-1.083473,2.499358,1.712901,1.173913


In [None]:
# Splitting features(X) and target(y)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

X_poly_Temp_Hour = df_cleaned_scaled_poly_Temp_Hour.drop(columns=['Rented Bike Count'])
y_poly_Temp_Hour = df_cleaned_scaled_poly_Temp_Hour['Rented Bike Count']

In [None]:
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(X_poly_Temp_Hour, y_poly_Temp_Hour, test_size=0.2, random_state=42)

model_LR = LinearRegression()

model_LR.fit(X_train_poly , y_train_poly)


y_train_poly_pred = model_LR.predict(X_train_poly)
y_test_poly_pred = model_LR.predict(X_test_poly)
# Getting predictions

train_rmse_poly = np.sqrt(mean_squared_error(y_train_poly, y_train_poly_pred))
test_rmse_poly = np.sqrt(mean_squared_error(y_test_poly, y_test_poly_pred))

train_r2_poly = r2_score(y_train_poly , y_train_poly_pred)
test_r2_poly = r2_score(y_test_poly , y_test_poly_pred)

print(f'Train RMSE : {train_rmse_poly:.3f} , Test RMSE : {test_rmse_poly:.3f}')
print(f'Train R2 score : {train_r2_poly} , Test R2 score : {test_r2_poly}')

Train RMSE : 471.285 , Test RMSE : 480.738
Train R2 score : 0.4658345145808894 , Test R2 score : 0.44531033211833404


In [None]:
# Using Poly feature with Temperature and Hour gave better results than model only trained with 'Hour' but gave less successful results than  model trained with all features
# Idea came up here as why dont we add this poly features to hour main dataframe and combine it with other features to train model better

In [None]:
# Creating new dataframe by adding poly features to main scaled dataframe
df_cleaned_scaled_with_poly_features = pd.concat([df_cleaned_scaled, df_Temp_Hour_Poly], axis=1)

# Removing duplicate feature columns (Temperature and Hour columns was duplicated due to concat)
df_cleaned_scaled_with_poly_features = df_cleaned_scaled_with_poly_features.T.drop_duplicates().T
df_cleaned_scaled_with_poly_features.head(2)


Unnamed: 0,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Hour_Temp_Interaction,Rented Bike Count,Holiday_Yes,Functioning Day_Yes,Seasons_Autumn,Seasons_Spring,Seasons_Summer,Seasons_Winter,Temperature(°C)^2,Temperature(°C) Hour,Hour^2
0,-1.661325,-1.513957,-1.042483,0.458476,0.925871,-1.659605,-0.655132,-0.1318,-0.171891,-0.841689,254.0,0.0,1.0,0.0,0.0,0.0,1.0,2.292067,2.515175,2.76
1,-1.516862,-1.539074,-0.99337,-0.892561,0.925871,-1.659605,-0.655132,-0.1318,-0.171891,-0.870912,204.0,0.0,1.0,0.0,0.0,0.0,1.0,2.368749,2.334563,2.30087


In [None]:
# Defining new Train and Test dataset with newly created dataframe
X_updated = df_cleaned_scaled_with_poly_features.drop(columns=['Rented Bike Count'])
y_updated = df_cleaned_scaled_with_poly_features['Rented Bike Count']

X_train, X_test, y_train, y_test = train_test_split(X_updated, y_updated, test_size=0.2, random_state=42)

param_grid_ENR = {
    'alpha': [0.01, 0.1, 1.0, 10],
    'l1_ratio': [0.1, 0.5, 0.9]  # 0 = Ridge, 1 = Lasso
}

model_ENR = GridSearchCV(ElasticNet(max_iter=10000), param_grid_ENR, cv=5, scoring='r2')

model_ENR.fit(X_train, y_train)

y_train_pred = model_ENR.predict(X_train)
y_test_pred = model_ENR.predict(X_test)


# Getting predictions

train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test_f, y_test_pred))

train_r2 = r2_score(y_train , y_train_pred)
test_r2 = r2_score(y_test , y_test_pred)

train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)

train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

print(f"Elastic Net Train MAE: {train_mae:.2f}, Elastic Net Test MAE: {test_mae:.2f}")
print(f"Elastic Net Train MSE: {train_mse:.2f}, Elastic Net Test MSE: {test_mse:.2f}")
print()
print(f'Elastic Net Train RMSE : {train_rmse:.3f} , Elastic Net Test RMSE : {test_rmse:.3f}')
print(f'Elastic Net Train R2 score : {train_r2:.5f} , Elastic Net Test R2 score : {test_r2:.5f}')


Elastic Net Train MAE: 295.96, Elastic Net Test MAE: 301.51
Elastic Net Train MSE: 164461.24, Elastic Net Test MSE: 173748.36

Elastic Net Train RMSE : 405.538 , Elastic Net Test RMSE : 416.831
Elastic Net Train R2 score : 0.60448 , Elastic Net Test R2 score : 0.58298


Poly features added to manin scaled dataframe as features
Elastic Net Regression was used as model
By using poly feature accuracy of moded improved

Elastic Net Regression results before poly features added:
Elastic Net Train RMSE : 448.238 , Elastic Net Test RMSE : 452.922
Elastic Net Train R2 score : 0.51680 , Elastic Net Test R2 score : 0.50764

Elastic Net Regression results after poly features added:
Elastic Net Train RMSE : 405.538 , Elastic Net Test RMSE : 416.831
Elastic Net Train R2 score : 0.60448 , Elastic Net Test R2 score : 0.58298
R2 score improved from around 51% to 58%

###Best performing model is Elastic Net Regression with Poly features added so far

In [None]:
# I have already used Cross Validation in all model creations above.
# Let me just compare models below with Polynomial Features and without Polynomial Features and not dividing to Train Test Split (80% - 20%)

In [None]:
# Model with polynomial features used

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Features and target
X_updated = df_cleaned_scaled_with_poly_features.drop(columns=['Rented Bike Count'])
y_updated = df_cleaned_scaled_with_poly_features['Rented Bike Count']

# Defining parameter grid
param_grid_ENR = {
    'alpha': [0.01, 0.1, 1.0, 10],
    'l1_ratio': [0.1, 0.5, 0.9]  # 0 = Ridge, 1 = Lasso
}

# GridSearchCV will perform 5-fold cross-validation
grid_cv = GridSearchCV(
    estimator=ElasticNet(max_iter=10000),
    param_grid=param_grid_ENR,
    cv=5,
    scoring='r2',
    return_train_score=True
)

# Fit model using CV on entire data
grid_cv.fit(X_updated, y_updated)

# Best model

# Predict on full data (or if needed, do a manual train/test split)
y_pred = grid_cv.predict(X_updated)

# Metrics
mae = mean_absolute_error(y_updated, y_pred)
mse = mean_squared_error(y_updated, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_updated, y_pred)

print("Best Parameters from GridSearch:", grid_cv.best_params_)
print(f"Elastic Net CV MAE: {mae:.2f}")
print(f"Elastic Net CV MSE: {mse:.2f}")
print(f"Elastic Net CV RMSE: {rmse:.2f}")
print(f"Elastic Net CV R² Score: {r2:.4f}")


Best Parameters from GridSearch: {'alpha': 1.0, 'l1_ratio': 0.5}
Elastic Net CV MAE: 324.08
Elastic Net CV MSE: 203728.99
Elastic Net CV RMSE: 451.36
Elastic Net CV R² Score: 0.5102


In [None]:
# Model without polynomial features used

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Features and target
X = df_cleaned_scaled.drop(columns=['Rented Bike Count'])
y = df_cleaned_scaled['Rented Bike Count']

# Defining parameter grid
param_grid_ENR = {
    'alpha': [0.01, 0.1, 1.0, 10],
    'l1_ratio': [0.1, 0.5, 0.9]  # 0 = Ridge, 1 = Lasso
}

# GridSearchCV will perform 5-fold cross-validation
grid_cv = GridSearchCV(
    estimator=ElasticNet(max_iter=10000),
    param_grid=param_grid_ENR,
    cv=5,
    scoring='r2',
    return_train_score=True
)

# Fit model using CV on entire data
grid_cv.fit(X, y)

# Predict on full data (or if needed, do a manual train/test split)
y_pred = grid_cv.predict(X)

# Metrics
mae = mean_absolute_error(y, y_pred)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)

print("Best Parameters from GridSearch:", grid_cv.best_params_)
print(f"Elastic Net CV MAE: {mae:.2f}")
print(f"Elastic Net CV MSE: {mse:.2f}")
print(f"Elastic Net CV RMSE: {rmse:.2f}")
print(f"Elastic Net CV R² Score: {r2:.4f}")


Best Parameters from GridSearch: {'alpha': 0.1, 'l1_ratio': 0.5}
Elastic Net CV MAE: 301.24
Elastic Net CV MSE: 180564.70
Elastic Net CV RMSE: 424.93
Elastic Net CV R² Score: 0.5659


Findings:

Splitting data to Train Test made 7% improvement on model prediction success, model with polynomial features used gives without train-test split
gives us R2 score : 0.51  but with train-test split gives us R2 score 0.58

# 4. Final Report :

###Findings and key takeaways from the analysis :
- Various regression models and couple variations was applied to find best prediction model
- Linear regression was used with Ridge, Lasso and Elastic net approach
- K-fold cross validation used to have insight on whole data set
- GridsearchCV used to find best tuning
- Polynomial features and interaction features was used to improve model
- Mean Absolute Error (MAE), Mean Squared Error (MSE) R-squared (R²) values analyzed to check model success
-
- Overall best performance was obtained with Elastic Net Regression while all features, interaction features and polynomial features used as train data. GridsearchCV used for best tuning and K-Fold(5) cross validation was used to improve model accuracy by splitting data to 80-20. Model success results can be tracked below.
-
- Results:
- Train MAE: 295.96 , Test MAE: 301.51  
- - Train-Test data closed and not same, model's predictions are off by 296–302 units,  this is good sign of there is no overfitting or underfitting
- Train MSE: 164461.24 , Test MSE: 173748.36
- - Outliers penaltized heavily
- Train RMSE : 405.538 , Test RMSE : 416.831
- - Model is around 400 bikes off on average
- Train R2 score : 0.60448 , Test R2 score : 0.58298
- - Model is not bad on Train data with 60% and on test(unseen) data with 58%.
- By checking results we can say model is not bad and doing decent job but still needs improvements for sure to at least make R2 values close to over 80% or 90%.


###Feature importance and business implications :

Features are backbone to build good supervised model, in this project it was obviuos how we can improve model accuracy by providing interaction features and polynomial features. When model trained with more related and good labeled data success of the model will improve linearly.

###Recommendations for further improvements :

Dataset and features play key role on model's success, before start build the model always spend good amount of time on feature engineering so we can feed model with best, correlated, high quality and necessary data. After making sure we did everyting we can with dataset, its better to use pipeline where we can put all models we potentially going to use also using GridsearchCV tuning, by doing that we can see best model to use with best parameters and also it includes K-fold inside. It was a good project and for further studies I will pay more attention to the data provided to the model and create pipeline with GridsearchCV included.