<a href="https://colab.research.google.com/github/arka420/Arkadyuti-bike-sharing-demand-prediction-project/blob/main/Arkadyuti_Bike_Sharing_Demand_Praditction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<b> <h1 style="color:SeaGreen;font-family:verdana;">Seoul Bike Sharing Demand Prediction(Supervised ML-Regression) </h1></b>

<html>
    <img src="https://thumbs.dreamstime.com/b/rental-bikes-ny-new-york-usa-december-citi-bike-city-us-largest-share-program-more-than-stations-across-manhattan-153214290.jpg" alt="Your Image Description" width="1000" height="400">
</html>

##### **Project Type**    - Regression
##### **Contribution**    - Individual - Arkadyuti Dhara


##**GitHub Link -**

[Github-link](https://github.com/arka420/Arkadyuti-bike-sharing-demand-prediction-project.git)

In the contemporary urban area, where environmental concerns and traffic congestion are growing challenges, bike sharing systems have emerged as a sustainable and convenient mode of transportation. These systems allow individuals to rent bikes for short periods. However, the success and efficiency of bike sharing systems heavily rely on accurate demand prediction. This is where machine learning comes into play, offering a promising solution to forecast bike demand accurately.

## <b> Data Description </b>

### <b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


### <b>Attribute Information: </b>

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - Winter, Spring, Summer, Autumn
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

# ***Let's Begin !***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

from datetime import datetime
import datetime as dt

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error

import warnings
warnings.filterwarnings('ignore')

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 2000)
pd.set_option('display.expand_frame_repr', False)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Dataset Loading

In [None]:
# Importing the dataset
bike_df = pd.read_csv('/content/drive/MyDrive/AlmaBetter/Bike sharing demand prediction project/dataset/SeoulBikeData.csv',encoding='latin-1')

Dataset First View

In [None]:
# Dataset First Look
bike_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f' We have total {bike_df.shape[0]} rows and {bike_df.shape[1]} columns.')

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

 The bike sharing demand dataset contains the following information:
 - Data Types: Date is in object format,Rented Bike Count is int, Hour is int, Temperature is float, Humidity is int,
   Windspeed is float, Visibility is int, Dew point temperature is float,
   Solar radiation is float, Rainfall is float, Snowfall is float.
 - Non-null Counts: All columns have non-null values, indicating no missing data.
 - Memory Usage: The bike sharing demand dataset consumes approximately 958.2 KB of memory.



#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# checking there is any duplicate values or not
len(bike_df[bike_df.duplicated()])

This dataset has no duplicate value

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_df.isnull().sum()

There is no Missing value too

## ***Understanding Variables***

In [None]:
# Dataset Describe
bike_df.describe()

 These statistics provide an initial understanding of the data's central tendencies and variability, which can be useful for data exploration and analysis.


### Variables Description

A bike sharing demand prediction dataset typically contains several variables or features that describe various aspects of bike rental behavior. These datasets are often used for machine learning and predictive modeling tasks to forecast the demand for bike rentals based on data. Here are common variable descriptions in this dataset:
   ### <b>Date and time column: </b>
  - **Date**: This variable indicates the timestamp or date and time of each observation, typically in a specific format like "YYYY-MM-DD HH:MM:SS" or similar.

### <b>Categorical columns: </b>
 - **Season**: Represents the season of the year, often categorized as Spring, Summer, Autumn and Winter.

 - **Holiday**: A binary variable indicating whether the day is a holiday or not.

 - **Functioning Day**: Another binary variable representing whether the day is a Functioning Day or a non-Functioning Day.

### <b>Numerical columns: </b>
**Weather Conditions**:
  
   - **Temperature**: This column represents the temperature in degrees Celsius. It gives the information about the weather conditions in terms of how hot or cold it was.
   - **Humidity**: This column represents the relative humidity level as a percentage (%). It indicates the amount of moisture present in the air.
   - **Windspeed**: This column represents the wind speed in meters per second (m/s). Wind speed is an important weather parameter, as strong winds can affect outdoor activities.
   - **Rainfall**: This column represents the amount of rainfall
     in millimeters (mm)
   - **Solar Radiation**: This feature says solar radiation in megajoules per square meter (MJ/m2). Solar radiation measures the amount of energy received from the sun at the Earth's surface.
   - **Snowfall**: This column represents the amount of snowfall in centimeters (cm).
   - **Dew Point Temperature**: This column represents the dew point temperature in degrees Celsius.

### <b>Target column: </b>
- **Count of Bikes Rented**: The number of bikes rented by users.
   

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(bike_df.apply(lambda col:col.unique()))

## ***Data Wrangling***

In [None]:
# Define a dictionary to map old column names to new column names
column_name_change = {'Rented Bike Count':'target_count',
                      'Temperature(°C)':'Temperature','Humidity(%)':'Humidity',
       'Wind speed (m/s)':'Wind_speed','Visibility (10m)':'Visibility',
                       'Dew point temperature(°C)':'Dew_point_temperature',
       'Solar Radiation (MJ/m2)':'Solar_radiation','Rainfall(mm)':'Rainfall',
                       'Snowfall (cm)':'Snowfall','Functioning Day':'Functioning_day'}

# Rename columns using the dictionary
bike_df.rename(columns=column_name_change, inplace=True)

In [None]:
# Change The datatype of Date columns to extract 'Month' ,'Day', "year".
bike_df['Date']=pd.DatetimeIndex(bike_df['Date'])

bike_df['year'] = bike_df['Date'].dt.year
bike_df['month'] = bike_df['Date'].dt.month_name()

bike_df['day'] = bike_df['Date'].dt.day_name()

In [None]:
#creating a new column of "weekdays_weekend"
bike_df['week']=bike_df['day'].apply(lambda x : "weekend" if x=='Saturday' or x=='Sunday' else "weekday" )

In [None]:
#Change the int64 column into catagory column
cate_cols=['Hour','day','year','month','week','Functioning_day','Seasons','Holiday']
for col in cate_cols:
  bike_df[col]=bike_df[col].astype('category')

In [None]:
bike_df.info()

 - I have changed the column names in DataFrame to use more readable and descriptive names.

 - Then Date column is object data type so I changed it to datatime data type. from here we can get all time related columns.
 - Have converted the "Date" column to a datetime data type, so I can access various time-related properties and columns.
 - Then, it creates three new columns ('Year', 'Month', and 'Day') and I created one more important column that is week, and it has two categories(weekend and weekday)
 - Lastly I conveterd these('Hour','month','week','Functioning_day','Seasons','Holiday) columns to categorical columns.

## ***Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:

# Define the columns for histplot and boxplot
columns_to_plot = ['target_count','Temperature', 'Humidity', 'Wind_speed',
                   'Visibility', 'Dew_point_temperature', 'Solar_radiation', 'Rainfall',
                   'Snowfall']

# Create subplots with two columns
fig, axes = plt.subplots(nrows=len(columns_to_plot), ncols=2, figsize=(12, 20))

# Loop through the columns and create histograms and boxplots
for i, col in enumerate(columns_to_plot):
    sns.histplot(data=bike_df, x=col, ax=axes[i, 0], kde=True, color='skyblue')
    sns.boxplot(data=bike_df, x=col, ax=axes[i, 1], color='lightcoral')
    axes[i, 0].set_xlabel('')  # Remove x-label for histograms
    axes[i, 1].set_xlabel('')  # Remove x-label for boxplots
    axes[i, 0].set_ylabel(col, fontsize=12)  # Set y-label for histograms

# Adjust the layout
plt.tight_layout()

# Show the plots
plt.show()

In this graph we can see distribution of numerical data. Target count has positive skewed distribution, temperature has slidely left skewed distribution,humidity has balance close to normal distribution, wind speed has positive skewed distribution, visibility,solar radiation,rainfall and snowfall are not normally distributed, dew point temperature is normally distributed,

#### Chart - 2

In [None]:
# Group the data by the categorical column and calculate the total bike count for each category
grouped_data = bike_df.groupby('Functioning_day')['target_count'].sum().reset_index()

# Create a pie plot for 'Functioning_day'
plt.figure(figsize=(10, 6))  # Adjust the size here
plt.pie(grouped_data['target_count'], labels=grouped_data['Functioning_day'], autopct='%1.1f%%')
plt.title('Functioning Day Distribution')
plt.show()

# Group the data by the 'Seasons' column and calculate the total bike count for each season
grouped_data = bike_df.groupby('Seasons')['target_count'].sum().reset_index()

# Create a pie plot for 'Seasons'
plt.figure(figsize=(10, 6))  # Adjust the size here
plt.pie(grouped_data['target_count'], labels=grouped_data['Seasons'], autopct='%1.1f%%')
plt.title('Season Distribution')
plt.show()

# Group the data by the 'Holiday' column and calculate the total bike count for each category
grouped_data = bike_df.groupby('Holiday')['target_count'].sum().reset_index()

# Create a pie plot for 'Holiday'
plt.figure(figsize=(10, 6))  # Adjust the size here
plt.pie(grouped_data['target_count'], labels=grouped_data['Holiday'], autopct='%1.1f%%')
plt.title('Holiday Distribution')
plt.show()



1) **Bike Rental on Functioning day**:
   - On no-functioning day no bikes were rented.Peoples dont use reneted bikes in no functioning day.



2) **Seasonal Bike Rental Patterns**:
   - The highest bike rental counts are observed during the summer season. This indicates a strong preference for renting bikes during the warmer months. Conversely, the demand for bike rentals is considerably lower in the winter season, as expected due to colder weather conditions.

3) **Bike Rentals on Non-Holidays**:
   - The data shows that a substantial number of bikes were rented on non-holiday days. This suggests that the bike-sharing service experiences higher demand on regular working days compared to holidays.




#### Chart - 3

In [None]:
numerical_columns = bike_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
#printing the regression plot for all the numerical features
for col in numerical_columns:
  fig,ax=plt.subplots(figsize=(8,6))
  sns.regplot(x=bike_df[col],y=bike_df['target_count'],scatter_kws={"color": '#FFC0CB'}, line_kws={"color": "#033E3E"})

**The relationships between these weather-related numerical features and bike sharing demand are as follows: Temperature, wind speed, visibility, dew point temperature, and solar radiation have a positive influence on demand, while humidity, rainfall, and snowfall have a negative impact. These insights can help in optimizing bike sharing services based on weather conditions and forecasting demand.**

#### Chart - 4

In [None]:
# Create a function to categorize hours into custom time periods
def categorize_time_period(Hour):
    if 0 <= Hour < 6:
        return 'Midnight'
    elif 6 <= Hour < 9:
        return 'Early Morning'
    elif 9 <= Hour < 12:
        return 'Morning'
    elif 12 <= Hour < 18:
        return 'Afternoon'
    elif 18 <= Hour < 21:
        return 'Evening'
    else:
        return 'Night'

# Apply the categorization function to create a new column in the original DataFrame
bike_df['Time_Period'] = bike_df['Hour'].apply(categorize_time_period)

# Group the data by the 'Time_Period' column and calculate the total rented bike count(target count)
grouped_data = bike_df.groupby('Time_Period')['target_count'].sum().reset_index()

# Create a bar chart to visualize the rented bike demand over custom time periods
plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
plt.bar(grouped_data['Time_Period'], grouped_data['target_count'])
plt.title('Rented Bike Demand by Time Period')
plt.xlabel('Time Period')
plt.ylabel('Rented Bike Count')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust the layout to prevent label clipping
plt.show()

**High demand occurs during the afternoon and evening, while demand is low during the early morning, midnight, and morning hours.**

#### Chart - 5

In [None]:

# Set the seaborn style
sns.set_theme(style="darkgrid")

# Create the lineplot
plt.figure(figsize=(20, 6))  # Set the figure size
ax = sns.barplot(data=bike_df, x='month', y='target_count', hue='year', palette="Set1", linewidth=2.5)

# Customize the plot
plt.title("Bike Sharing Demand across month", fontsize=16)  # Set the title and font size
plt.xlabel("Month", fontsize=12)  # Set the x-axis label and font size
plt.ylabel("Bike Count", fontsize=12)  # Set the y-axis label and font size
plt.legend(title="Year", title_fontsize=12, loc="upper right")  # Customize the legend

# Show the plot
plt.show()


**This bar plot clearly indicates the demand for rented bike is increasing per year. In June 2018, demand was at its highest and lowest in Januar and february. The demand for rented bikes in 2018 was significantly higher compared to the previous year.**

#### Chart - 6

In [None]:

plt.figure(figsize=(20, 6))
sns.pointplot(x=bike_df['Hour'], y=bike_df['target_count'], hue=bike_df['Holiday'], palette={'No Holiday': 'b', 'Holiday': 'r'})
plt.title("Bike Rental Trend according to Hour on Holiday / No Holiday")
# Show the plot
plt.show()

**There is a sudden peak in bike rentals between 8 AM and 6 PM, which coincides with typical office and school/college commuting hours. This observation highlights the importance of maintaining and optimizing  bike fleet and station availability during these peak hours to ensure a positive user experience and maximize  revenue potential.**

#### Chart - 7

In [None]:
fig,ax=plt.subplots(figsize=(20,6))
sns.pointplot(data=bike_df,x='Hour',y='target_count',hue='Seasons',ax=ax)
ax.set(title='Count of Rented bikes acording to seasons ')
plt.show()

**During the Summer demand is highest and lowest in Winter. In Summer, when demand is high, company could charge slightly more for bike rentals. During the Winter, they could offer discounts to maintain a customer base.**

#### Chart- 8

In [None]:
plt.figure(figsize=(15, 5))
sns.lineplot(x='day', y='target_count', data=bike_df)
plt.title('Rented Bike Count Over Days')
plt.xlabel('Day')
plt.ylabel('Rented Bike Count')
plt.grid(True)
plt.show()


**Demand is higher on weekdays compared to weekends, as users only demand services during the weekdays.**

#### Chart - 9

In [None]:
# Group the data by 'time_period' and calculate the average wind speed
average_wind_speed = bike_df.groupby('Time_Period')['Wind_speed'].mean()

# Create a bar chart
plt.figure(figsize=(8, 6))
plt.bar(average_wind_speed.index, average_wind_speed)
plt.xlabel('Time_Period')
plt.ylabel('Average Wind Speed')
plt.title('Average Wind Speed in Different Time Periods')
plt.show()


**As we can see, there is a positive correlation between wind speed and the count of rented bike demand. We observe that wind speed tends to be higher during the afternoon and evening and demand for rented bike is also high**

#### Chart - 10

In [None]:

pivot_table = bike_df.pivot_table(index=['Seasons', 'Holiday', 'Functioning_day'], values='target_count', aggfunc='count')

plt.figure(figsize=(10, 8))
sns.heatmap(pivot_table.unstack(), cmap='YlGnBu', annot=True, fmt='d')
plt.title('Bike Sharing Demand by Seasons, Holiday, and Functioning_day')
plt.xlabel('Holiday')
plt.ylabel('Seasons')
plt.show()


**This graph showing multiple information at the same time.In Summer, specifically on non-holiday, functioning days with clear weather conditions, the demand for rented bikes is at its highest. This suggests that during the Summer season, when the weather is favorable and people are not on holiday, there is a strong preference for using the bike-sharing service, indicating a peak in bike rental activity during these conditions.**

#### Chart - 11

In [None]:
# Create a scatter plot with 'Temperature' on the x-axis, 'target_count' on the y-axis, and 'Hour' as color.
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Temperature', y='target_count', hue='Hour', data=bike_df, palette='viridis')
plt.title("Scatter Plot of Rented_Bike_Count vs. Temperature (Colored by Hour)")
plt.show()


**This scatter plot illustrates the relationship between bike demand and temperature. We notice that low temperatures correspond to lower demand, while temperatures between 25 to 30 degrees result in higher demand. In extremely cold weather, promoting alternative transportation services like ridesharing can provide users with practical options**

#### Chart - 12 - Correlation Heatmap

In [None]:
# Set the figure size
plt.figure(figsize=(15, 8))

# Create a correlation heatmap
corr_matrix = bike_df.corr()
heatmap = sns.heatmap(corr_matrix, cmap='YlGnBu', annot=True, fmt=".2f", linewidths=0.5)

# Add a title
plt.title("Correlation Heatmap", fontsize=15)

# Rotate the tick labels for better readability
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)

# Show the plot
plt.show()



 We can see from the heatmap that "Temperature" and "Dew Point Temperature" are highly corelated. We can drop one of them.As the corelation between temperature and our dependent variable "Bike Rented Count" is high. So we will Keep the Temperature column and drop the "Dew Point Temperature" column.


## ***Hypothesis Testing***

**Hypothesis testing is a statistical method that tests the validity of new ideas or theories against data.**

**Hypothesis 1:**

*Null Hypothesis (H0):* There is no significant difference in bike rental demand between different seasons (e.g., spring, summer, autumn, winter).

*Alternative Hypothesis (H1):* There is a significant difference in bike rental demand between at least two seasons.

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Assuming you have a dataset with a "season" column
model = ols('target_count ~ Seasons', data=bike_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

alpha = 0.05  # Significance level
p_value = anova_table['PR(>F)'][0]

if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between seasons.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between seasons.")


print(p_value)
print(anova_table)

**Here I use ANOVA test because ANOVA is appropriate when we have more than two groups (in this case, four seasons) and we want to determine if there are any statistically significant differences in the means of these groups (seasons).**

**Hypothesis 2:**

***Null Hypothesis (H0)*: There is no linear relationship between temperature and bike rental demand.**

***Alternative Hypothesis (H1)*: There is a linear relationship between temperature and bike rental demand.**

In [None]:
import scipy.stats as stats

# Data for temperature and bike rental demand
temperature_data = bike_df['Temperature']
bike_rental_demand = bike_df['target_count']

# Perform the Pearson correlation test
correlation_coefficient, p_value = stats.pearsonr(temperature_data, bike_rental_demand)

# Print the results
print("Pearson correlation coefficient:", correlation_coefficient)
print("p-value:", p_value)

# Significance level (e.g., 0.05)
alpha = 0.05

# Compare the p-value to the significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a linear relationship between temperature and bike rental demand.")
else:
    print("Fail to reject the null hypothesis: There is no linear relationship between temperature and bike rental demand.")


**To test whether there is a linear relationship between two numerical variables, we can use a hypothesis test that is specifically designed for testing linear relationships. One common test for this purpose is the Pearson correlation coefficient test.
Here we can see correlation coef is 0.53, and p-value is less than alpha so we reject null hypothesis and there is a linear relationship.**

**Hypothesis 3:**

***Null Hypothesis (H0)*:  The distribution of bike sharing demand is the same for holidays and functioning days.**

***Alternative Hypothesis (H1)*:  The distribution of bike sharing demand is different for holidays and functioning days.**

In [None]:
import scipy.stats as stats
import numpy as np

# Create a contingency table of observed frequencies
# Replace these numbers with your actual data
observed_data = pd.crosstab(bike_df['Holiday'],bike_df['Functioning_day'])

# Perform the Chi-Square test
chi2, p_value, _, _ = stats.chi2_contingency(observed_data)

# Print the results
print("Chi-Square Statistic:", chi2)
print("p-value:", p_value)

# Set your significance level (e.g., 0.05)
alpha = 0.05

# Compare the p-value to the significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is an association between holiday and functioning day with respect to bike sharing demand.")
else:
    print("Fail to reject the null hypothesis: There is no association between holiday and functioning day with respect to bike sharing demand.")


observed_data


**The chi-square test is used to examine the association between two categorical variables to determine if there is a statistically significant difference between the observed and expected frequencies in a contingency table. Here we can see the null hypothesis is rejected that means, the distribution of bike sharing demand is different for holidays and functioning days.
Tt means there is a statistically significant difference in how bike sharing demand behaves differently in holiday and functioning day.**

##  ***Feature Selection***

In [None]:
from re import X
#import the module
#assign the 'x','y' value
import statsmodels.api as sm
X = bike_df[[ 'Temperature','Humidity',
       'Wind_speed', 'Visibility','Dew_point_temperature',
       'Solar_radiation', 'Rainfall', 'Snowfall']]

Y = (bike_df['target_count'])
X = sm.add_constant(X)
X

In [None]:
# fit a OLS model

model= sm.OLS(Y,X).fit()
model.summary()

**Using ols we can select the important features and also drop some unimportant features for prediction.
Visibility: The coefficient is -0.0097, but it is not statistically significant (P>|t| > 0.05), indicating that visibility may not have a strong impact on dependent variable.
Dew_point_temperature: This variable's coefficient is -0.7829, but it is not statistically significant (P>|t| > 0.05), suggesting it may not be a good predictor.
The condition number is used to check for multicollinearity. A high condition number may indicate multicollinearity among the independent variables.**

**Heatmap shows the correlation between Dew_point_temperature and temperature is very high(0.91).**  

In [None]:
len(bike_df.columns)

In [None]:
# Dropping all the columns that do not have significant effect on target column
columns_to_drop=['Date','Visibility', 'Dew_point_temperature','day','Time_Period']
bike_df.drop(columns=columns_to_drop, inplace=True)

## ***Data Splitting***

The specific proportions allocated for training and testing may differ based on individual preferences, but a common choice is to use an 80:20 split, with 80% of the data dedicated to training and 20% set aside for testing.

In [None]:
#Assign the value in x and y
x= bike_df.drop(columns=['target_count'])
small_constant = 1
y =np.log10(bike_df['target_count']+small_constant)
x.sample(5)

**target column is positively skewed that's why log tranfsormation is taken.**

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

## ***ML Model Implementation***

### Linear regression

In [None]:
# ML Model - 1 Implementation
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,7,8,9,10,11,12])# Encoding for categorical columns
],remainder='passthrough')


step2 = LinearRegression()

lr_pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

# Fit the Algorithm
lr=lr_pipe.fit(x_train,y_train)

lr_y_pred = lr.predict(x_test)

In [None]:

lr_mae = mean_absolute_error(y_test, lr_y_pred)
lr_mse = mean_squared_error(y_test, lr_y_pred)
lr_rmse = np.sqrt(lr_mse)
lr_r2 = r2_score(y_test, lr_y_pred)
# Calculate the adjusted R-squared score
n = x_test.shape[0]  # Number of samples
p = x_test.shape[1]  # Number of features
lr_adjusted_r2 = 1 - (1 -lr_r2) * ((n - 1) / (n - p - 1))

In [None]:
print(f"Mean Absolute Error: {round(lr_mae, 2)}")
print(f"Mean Squared Error: {round(lr_mse, 2)}")
print(f"Root Mean Squared Error: {round(lr_rmse, 2)}")
print(f"R-squared: {round(lr_r2, 2)}")
print("Adjusted R2:", round(lr_adjusted_r2, 2))


**The MAE, MSE, and RMSE highlight the model's accuracy and precision, while the high R-squared value indicates its ability to explain demand variations. The adjusted R-squared accounts for model complexity, Here we can see MAE,MSE,RMSE is very low, R-square and adjusted R-square value is also very good.**

####  ***Cross- Validation & Hyperparameter Tuning***

In [None]:
lr_scores = cross_val_score(lr_pipe, x_train, y_train, cv=5)  # 5-fold cross-validation
lr_scores

In [None]:
lr_average_score = round(np.mean(lr_scores),3)
lr_average_score

**The cross-validation score is slidely lower to the original R-squared value, so that the model's performance is consistent across different subsets of the data, which is a positive sign. It suggestes that our model's performance remains stable when tested on multiple data splits.**

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'step1__col_tnf__drop': ['first', 'if_binary'],  # Example hyperparameter for OneHotEncoder
    'step2__fit_intercept': [True, False],  # Example hyperparameter for LinearRegression
}

# Create GridSearchCV
grid_search = GridSearchCV(lr_pipe, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the model
grid_search.fit(x_train, y_train)

# Get the best model
best_lr_model = grid_search.best_estimator_

# Make predictions
lr_y_pred = best_lr_model.predict(x_test)


In [None]:
# Calculate evaluation metrics
mse = mean_squared_error(y_test, lr_y_pred)
rmse = np.sqrt(mse)  # RMSE is the square root of MSE
mae = mean_absolute_error(y_test, lr_y_pred)
r_squared = r2_score(y_test, lr_y_pred)

# Print the evaluation metrics
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R²): {r_squared}")


**GridSearchCV is a popular choice for hyperparameter tuning in linear regression . It helps to find the best set of hyperparameters to optimize model's performance. Here we have same result, there is no improvement in R square.**

In [None]:
# lr_y_pred: A list containing your predicted values
# y_test: containing your actual values

# Create a figure with a specified size
plt.figure(figsize=(15, 10))

# Plot the predicted values as one line
plt.plot(lr_y_pred)

# Plot the actual values as another line
plt.plot(np.array(y_test))

# Add a legend to distinguish between the predicted and actual lines
plt.legend(["Predicted", "Actual"])

# Label the x-axis
plt.xlabel('No of Test Data')

# Display the plot
plt.show()


In [None]:
# define a function to plot scatter plot for y_test and y_actual.
def plot_scatter(y_pred,y_test):
  '''Plot scatter plot for y_test values and
  y_test values. To check how close we are to regresson line'''
  plt.figure(figsize=(16,5))
  sns.regplot(x=y_test,y=y_pred,scatter_kws={'color':'#6082B6'},line_kws={'color':'#34495E'})
  plt.xlabel('Actual')
  plt.ylabel("Predicted")
  plt.title("Actual v/s Predicted")


In [None]:
# Checking how predicted values and actual values are close  to the regression line
plot_scatter(lr_y_pred,y_test)

### Ridge Regression

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf', OneHotEncoder(sparse=False, drop='first'), [0,7,8,9,10,11,12])# Encoding for categorical columns
], remainder='passthrough')

step2 = Ridge(alpha=10)

rid = Pipeline([
    ('step1', step1),
    ('step2', step2)
])

rid.fit(x_train, y_train)

rid_y_pred = rid.predict(x_test)

In [None]:

rid_mae = mean_absolute_error(y_test, rid_y_pred)
rid_mse = mean_squared_error(y_test, rid_y_pred)
rid_rmse = np.sqrt(rid_mse)
rid_r2 = r2_score(y_test, rid_y_pred)
# Calculate the adjusted R-squared score
n = x_test.shape[0]  # Number of samples
p = x_test.shape[1]  # Number of features
rid_adjusted_r2 = 1 - (1 -rid_r2) * ((n - 1) / (n - p - 1))

In [None]:
print(f"Mean Absolute Error: {round(rid_mae, 2)}")
print(f"Mean Squared Error: {round(rid_mse, 2)}")
print(f"Root Mean Squared Error: {round(rid_rmse, 2)}")
print(f"R-squared: {round(rid_r2, 3)}")
print("Adjusted R2:", round(rid_adjusted_r2, 3))


####  ***Cross- Validation & Hyperparameter Tuning***

In [None]:
rid_scores = cross_val_score(rid, x_train, y_train, cv=5)  # 5-fold cross-validation
rid_scores

In [None]:
rid_average_score = round(np.mean(rid_scores),3)
rid_average_score

**The cross-validation score is also slidely lower to the original R-squared value, so that the model's performance is consistent across different subsets of the data, which is a positive sign.**

In [None]:
# Create a parameter grid for Ridge alpha values
param_grid = {'step2__alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,60,100,0.5,1.5,1.6,1.7,1.8,1.9]}

# Use the same pipeline as defined earlier
ridge_pipeline = Pipeline([
    ('step1', step1),
    ('step2', Ridge())  # Use the default alpha here
])

# Create a GridSearchCV object
ridge_grid_search = GridSearchCV(ridge_pipeline, param_grid, cv=5)
ridge_grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_alpha = ridge_grid_search.best_params_['step2__alpha']
print(f' best alpha value is {best_alpha}')

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf', OneHotEncoder(sparse=False, drop='first'), [0,7,8,9,10,11,12])# Encoding for categorical columns
], remainder='passthrough')

step2 = Ridge(alpha=0.5)

rid=pipe = Pipeline([
    ('step1', step1),
    ('step2', step2)
])

rid.fit(x_train, y_train)

rid_y_pred = rid.predict(x_test)

In [None]:

rid_mae = mean_absolute_error(y_test, rid_y_pred)
rid_mse = mean_squared_error(y_test, rid_y_pred)
rid_rmse = np.sqrt(rid_mse)
rid_r2 = r2_score(y_test, rid_y_pred)
# Calculate the adjusted R-squared score
n = x_test.shape[0]  # Number of samples
p = x_test.shape[1]  # Number of features
rid_adjusted_r2 = 1 - (1 -rid_r2) * ((n - 1) / (n - p - 1))

In [None]:
print(f"Mean Absolute Error: {round(rid_mae, 2)}")
print(f"Mean Squared Error: {round(rid_mse, 2)}")
print(f"Root Mean Squared Error: {round(rid_rmse, 2)}")
print(f"R-squared: {round(rid_r2, 3)}")
print("Adjusted R2:", round(rid_adjusted_r2, 3))

**Slide improvement in evaluation matrics**

In [None]:
#Plot the figure
plt.figure(figsize=(15,10))
plt.plot(rid_y_pred)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

In [None]:
# Checking how predicted values and actual values are close  to the regression line
plot_scatter(rid_y_pred,y_test)

### Lasso Regression

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,7,8,9,10,11,12])# Encoding for categorical columns
],remainder='passthrough')

step2 = Lasso(alpha=0.001)

las= Pipeline([
    ('step1',step1),
    ('step2',step2)
])

las.fit(x_train,y_train)

las_y_pred = las.predict(x_test)

In [None]:
las_mae = mean_absolute_error(y_test, las_y_pred)
las_mse = mean_squared_error(y_test, las_y_pred)
las_rmse = np.sqrt(las_mse)
las_r2 = r2_score(y_test, las_y_pred)
# Calculate the adjusted R-squared score
n = x_test.shape[0]  # Number of samples
p = x_test.shape[1]  # Number of features
las_adjusted_r2 = 1 - (1 -las_r2) * ((n - 1) / (n - p - 1))

In [None]:
print(f"Mean Absolute Error: {las_mae}")
print(f"Mean Squared Error: {las_mse}")
print(f"Root Mean Squared Error: {las_rmse}")
print(f"R-squared: {las_r2}")
print("Adjusted R2:", las_adjusted_r2)

#### **Hyperparameter Tunning**

In [None]:
# Define the parameter grid for Lasso
lasso_param_grid = {
    'step2__alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014]  # Adjust the values as needed
}

# Create the pipeline
step1 = ColumnTransformer(transformers=[
    ('col_tnf', OneHotEncoder(sparse=False, drop='first'), [0, 7, 8, 9, 10, 11, 12])  # Encoding for categorical columns
], remainder='passthrough')

step2 = Lasso(alpha=0.001)

las = Pipeline([
    ('step1', step1),
    ('step2', step2)
])

# Create GridSearchCV object for hyperparameter tuning
lasso_grid_search = GridSearchCV(las, lasso_param_grid, cv=5)

# Fit the model with grid search
lasso_grid_search.fit(x_train, y_train)

# Get the best hyperparameters
best_alpha = lasso_grid_search.best_params_['step2__alpha']

# Fit the Lasso model with the best hyperparameters
best_lasso_model = Pipeline([
    ('step1', step1),
    ('step2', Lasso(alpha=best_alpha))
])
best_lasso_model.fit(x_train, y_train)

# Make predictions
las_y_pred = best_lasso_model.predict(x_test)

# Print the best hyperparameters
print("Best Alpha for Lasso:", best_alpha)

In [None]:
# Calculate predictions
las_y_pred = best_lasso_model.predict(x_test)


# Calculate R-squared (R²) - Coefficient of Determination
r2 = r2_score(y_test, las_y_pred)
# Calculate Mean Absolute Error (MAE)
las_mae = mean_absolute_error(y_test, las_y_pred)

# Calculate Mean Squared Error (MSE)
las_mse = mean_squared_error(y_test, las_y_pred)

# Calculate Root Mean Squared Error (RMSE)
las_rmse = np.sqrt(las_mse)
# Calculate R-squared (R²) - Coefficient of Determination
las_r2 = r2_score(y_test, las_y_pred)

# Calculate the adjusted R-squared score
n = x_test.shape[0]  # Number of samples
p = x_test.shape[1]  # Number of features
las_adjusted_r2 = 1 - (1 -las_r2) * ((n - 1) / (n - p - 1))

# Print the metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R²):", r2)


In [None]:
# las_y_pred: A list containing predicted values
# y_test: A list or numpy array containing actual values

# Create a figure with a specified size
plt.figure(figsize=(15, 10))

# Plot the predicted values as one line
plt.plot(las_y_pred)

# Plot the actual values as another line
plt.plot(np.array(y_test))

# Add a legend to distinguish between the predicted and actual lines
plt.legend(["Predicted", "Actual"])

# Label the x-axis
plt.xlabel('No of Test Data')

# Display the plot
plt.show()


In [None]:
# Checking how predicted values and actual values are close  to the regression line
plot_scatter(las_y_pred,y_test)

### Random Forest

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,7,8,9,10,11,12])# Encoding for categorical columns
],remainder='passthrough')

step2 = RandomForestRegressor(n_estimators=500,
                              random_state=5,
                              max_samples=0.5,
                              max_features=0.75,
                              max_depth=15)

rf = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

rf.fit(x_train,y_train)

rf_y_pred = rf.predict(x_test)

In [None]:

rf_mae = mean_absolute_error(y_test, rf_y_pred)
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_y_pred)
# Calculate the adjusted R-squared score
n = x_test.shape[0]  # Number of samples
p = x_test.shape[1]  # Number of features
rf_adjusted_r2 = 1 - (1 -rf_r2) * ((n - 1) / (n - p - 1))

In [None]:
print(f"Mean Absolute Error: {round(rf_mae, 2)}")
print(f"Mean Squared Error: {round(rf_mse, 2)}")
print(f"Root Mean Squared Error: {round(rf_rmse, 2)}")
print(f"R-squared: {round(rf_r2, 4)}")
print("Adjusted R2:", round(rf_adjusted_r2, 4))


In [None]:
rf_scores = cross_val_score(rf, x_train, y_train, cv=5)  # 5-fold cross-validation
rf_scores

In [None]:
rf_average_score = round(np.mean(rf_scores),3)
rf_average_score

**The cross-validation scores are very close to each other and close to the training scores, it indicates that the model is not overfitting.**

In [None]:
# Define the parameter grid
n_estimators = [60, 80, 100]
max_depth = [15, 20]
max_leaf_nodes = [40, 60, 80]

param_grid = {
    'step2__n_estimators': n_estimators,
    'step2__max_depth': max_depth,
    'step2__max_leaf_nodes': max_leaf_nodes
}

# Create the pipeline
step1 = ColumnTransformer(transformers=[
    ('col_tnf', OneHotEncoder(sparse=False, drop='first'), [0, 7, 8, 9, 10, 11, 12])
], remainder='passthrough')

step2 = RandomForestRegressor(n_estimators=500, random_state=5, max_samples=0.5, max_features=0.75, max_depth=15)

rf1 = Pipeline([
    ('step1', step1),
    ('step2', step2)
])

# Create GridSearchCV object for hyperparameter tuning
grid_search = GridSearchCV(rf1, param_grid, cv=5)

# Fit the model with grid search
grid_search.fit(x_train, y_train)

# Get the best hyperparameters and the best estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Make predictions using the best estimator
rf_y_pred = best_estimator.predict(x_test)

# Calculate evaluation metrics
mse = mean_squared_error(y_test, rf_y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, rf_y_pred)
r2 = r2_score(y_test, rf_y_pred)

# Print the metrics and best hyperparameters
print("Best Hyperparameters:", best_params)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
print("R-squared (R²):", r2)


**The original model had a slightly better fit to the data compared to the GridSearchCV-tuned model. In general, a slight decrease in R-squared when performing hyperparameter tuning doesn't necessarily indicate overfitting**

In [None]:

# rf_y_pred: A list containing predicted values from a random forest regression model
# y_test: containing actual values

# Create a figure with a specified size
plt.figure(figsize=(15, 10))

# Plot the predicted values as one line
plt.plot(rf_y_pred)

# Plot the actual values as another line
plt.plot(np.array(y_test))

# Add a legend to distinguish between the predicted and actual lines
plt.legend(["Predicted", "Actual"])

# Label the x-axis
plt.xlabel('No of Test Data')

# Display the plot
plt.show()


In [None]:
# Checking how predicted values and actual values are close  to the regression line
plot_scatter(rf_y_pred,y_test)

**Evaluation matrics comparision**

In [None]:
# Define the metrics and models
metrics = ["MAE", "MSE", "RMSE", "R-squared", "Adjusted R2"]
linear_reg_metrics = [lr_mae, lr_mse, lr_rmse, lr_r2, lr_adjusted_r2]
ridge_reg_metrics = [rid_mae, rid_mse, rid_rmse, rid_r2, rid_adjusted_r2]
lasso_reg_metrics = [las_mae, las_mse, las_rmse, las_r2, las_adjusted_r2]
random_forest_metrics = [rf_mae, rf_mse, rf_rmse, rf_r2, rf_adjusted_r2]

# Create a bar chart
fig, ax = plt.subplots(figsize=(10, 6))
width = 0.2

x = range(len(metrics))
plt.bar(x, linear_reg_metrics, width, label="Linear Regression", align="center")
plt.bar([i + width for i in x], ridge_reg_metrics, width, label="Ridge Regression", align="center")
plt.bar([i + 2 * width for i in x], lasso_reg_metrics, width, label="Lasso Regression", align="center")
plt.bar([i + 3 * width for i in x], random_forest_metrics, width, label="Random Forest", align="center")

ax.set_xticks([i + 1.5 * width for i in x])
ax.set_xticklabels(metrics)
plt.xlabel("Evaluation Metrics")
plt.title("Comparison of Evaluation Metrics for Different Models")

plt.legend(loc="best")
plt.tight_layout()
plt.show()


The choice of the final prediction model depends on the specific goals and requirements of the project, as well as considerations for model interpretability and computational complexity. However, based on the evaluation metrics for the Random Forest Regressor, it seems like a strong candidate for the final prediction model.

Here's a brief analysis of the evaluation metrics:

1. **Mean Absolute Error (MAE)**: A low MAE of 0.121 suggests that, on average,  model's predictions are very close to the actual values. This is a good indication of the model's accuracy.

2. **Mean Squared Error (MSE)**: A low MSE of 0.0342 also suggests that the model's predictions are accurate and relatively consistent.

3. **Root Mean Squared Error (RMSE)**: An RMSE of 0.1849 indicates that the model's predictions have small errors on average, which is a positive sign.

4. **R-squared (R²)**: An R-squared value of 0.9291 is quite high, meaning that  model explains a significant portion of the variance in the target variable. This suggests that the model is a good fit for data.

5. **Adjusted R-squared**: The adjusted R-squared value is also high, indicating that  model is well-fitted to the data without overfitting.

The Random Forest Regressor has provided excellent results and it's a strong choice for the final prediction model due to its strong predictive performance, robustness to overfitting, and ability to capture complex relationships in the data.



In [None]:
# Get feature importances
feature_importances = rf.named_steps['step2'].feature_importances_

# Assuming you have the column names (feature names) from step1
feature_names = rf.named_steps['step1'].get_feature_names_out(input_features=x_train.columns)

# Create a DataFrame to display the feature importances with their corresponding feature names
importances_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})

# Sort the DataFrame by importance in descending order
importances_df = importances_df.sort_values(by='Importance', ascending=False)

# Display the top N important features (e.g., top 10 features)
top_n = 20
print(importances_df.head(top_n))


In [None]:
# Sort the DataFrame by importance in ascending order
importances_df = importances_df.sort_values(by='Importance', ascending=True)

# Display the top N important features (e.g., top 10 features)
top_n = 20
top_features = importances_df.tail(top_n)  # Use .tail() to get the top features in ascending order

# Create a horizontal bar chart using Plotly
fig = px.bar(top_features, x='Importance', y='Feature', orientation='h', title=f'Top {top_n} Important Features')

# Customize the layout (optional)
fig.update_layout(
    xaxis_title='Importance',
    yaxis_title='',
    xaxis=dict(tickformat='%'),
)

# Show the plot
fig.show()


The feature importance scores represent the contribution of each feature to the predictions made by Random Forest Regressor model for bike sharing demand prediction. These scores indicate how much each feature influences the model's output. Here's an explanation for the feature importance results:

1. **col_tnf__Functioning_day_Yes (0.482)**: This feature appears to have the highest importance. It suggests that whether it is a functioning day or not has the most significant impact on bike sharing demand according to model.

2. **remainder__Temperature (0.166)**: Temperature is the second most important feature. It indicates that temperature plays a substantial role in predicting bike demand.

3. **remainder__Humidity (0.087)**: Humidity is also an essential factor in predicting bike sharing demand, although it has a slightly lower importance than temperature.

4. **remainder__Rainfall (0.069)**: Rainfall is another significant feature. It suggests that rainy conditions can affect bike demand.

5. **col_tnf__Seasons_Winter (0.030)**: The season of winter has moderate importance in predicting bike demand. Seasonal variations are often crucial in this type of prediction.

6. **remainder__Solar_radiation (0.026)**: Solar radiation is moderately important. It may indicate that sunny or cloudy conditions impact bike sharing demand.

7. **col_tnf__Hour_4 (0.024)**: The fourth hour of the day is more important than other hours. It might suggest that early morning has a specific influence.

8. **col_tnf__Hour_5 (0.022)**: The fifth hour of the day also holds some importance, possibly indicating the start of the workday.

9. **col_tnf__Hour_3 (0.014)**: The third hour of the day has a moderate impact. It might correspond to the early morning hours.

10. **remainder__Wind_speed (0.010)**: Wind speed plays a minor role but still contributes to predictions.

These feature importance scores are essential in understanding which features have the most significant impact on bike sharing demand prediction model. We can use this information to identify the key factors that influence bike demand and potentially focus on these factors.

# **Conclusion**

We train a model to predict the number of rented bike count in given weather conditions. First, we do Exploratory Data Analysis on the data set. We look for null values that is not found in dataset. We also perform correlation and ols analysis to extract out the important and relevant feature. The target variable had some outliers, so a log transformation has been applied to remove them.

Next we implemented few machine learning algorithms like Linear Regression,lasso and ridge regression and Random forest regressor. we calculated evaluation matrix to check model accuracy. model appears to be performing well, with low errors (MAE, MSE, RMSE) and a high percentage of variance explained (R2 and Adjusted R2)

•	Season: We see highest number bike rentals in Summer Seasons and the lowest in Spring season.

•	Weather: As one would expect, we see highest number of bike rentals on a clear day and the lowest on a snowy or rainy day.

•	Humidity: With increasing humidity, we see decrease in the number of bike rental count.

•  High demand of bike at 8AM as the time of people for going to their work and 6PM time of returning to home.
• Demand is high during afternoon and evening.
• Random forest regressor works best as we get r-squared 93%




However, this is not the ultimate end. As this data is time dependent, the values for variables like temperature, windspeed, solar radiation etc., will not always be consistent. Therefore, there will be scenarios where the model might not perform well. As Machine learning is an exponentially evolving field, we will have to be prepared for all contingencies and also keep checking our model from time to time.