<a href="https://colab.research.google.com/github/ashish-bansod/Bike_Sharing_Demand_Prediction/blob/main/Bike_Sharing_Demand_Prediction_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Bike Sharing Demand Prediction



### **Project Type**    - *Regression*
### **Contribution**    - Individual - *Ashish Bansod*


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/ashish-bansod/Bike_Sharing_Demand_Prediction

# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.**

# **Data Description**

* **Date** - day/month/year
* **Rented Bike count** - Count of bikes rented per hour
* **Hour** - Hour of the day
* **Temperature**-Temperature in Celsius
* **Humidity** - Humidity in the air in %
* **Windspeed** - Speed of the wind in  m/s
* **Visibility** - Visibility in m (10m)
* **Dew point temperature** - Temperature at the beggining of the day(Celsius)
* **Solar radiation** -Sun contribution (MJ/m2)
* **Rainfall** - Amount of raining in mm
* **Snowfall** - Amount of snowing in cm
* **Seasons** - Winter, Spring, Summer, Autumn
* **Holiday** - Holiday/No holiday
* **Functional Day** -  If the day is a Functioning Day or not

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from datetime import datetime
import datetime as dt

from sklearn.model_selection import train_test_split,cross_validate,GridSearchCV,RandomizedSearchCV

from sklearn.preprocessing import MinMaxScaler,OneHotEncoder,OrdinalEncoder,LabelEncoder

from sklearn.linear_model import LinearRegression,Lasso,Ridge,ElasticNet

from sklearn.metrics import accuracy_score,mean_absolute_error,mean_squared_error,r2_score,log_loss

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
bike_df=pd.read_csv('/content/SeoulBikeData.csv',encoding='latin')

### Dataset First View

In [None]:
# Dataset First Look
bike_df.sample(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('Total no. of rows: ' ,len(bike_df.index))
print('\n')
print('Total no. of columns: ',len(bike_df.columns))

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
bike_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_df.isnull().sum()

In [None]:
# Visualizing the missing values
missing_values_per = pd.DataFrame((bike_df.isnull().sum() / len(bike_df)) * 100).reset_index()
missing_values_per.columns = ['Column', 'Missing Percentage']

plt.figure(figsize=(15, 5))
plt.stem(missing_values_per['Column'], missing_values_per['Missing Percentage'])
plt.xlabel('Columns',color='blue')
plt.ylabel('Missing Percentage',color='blue')
plt.title('Missing Values Percentage per Column',bbox={'facecolor':'blue', 'alpha':0.5})
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

### What did you know about your dataset?

**There are no missing  and duplicate values present in our Dataset.Our data contains 8760 rows and 14 columns.**

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bike_df.columns

In [None]:
# Renaming Columns
bike_df=bike_df.rename(columns={'Rented Bike Count':'Rented_Bike_Count',
                                'Temperature(°C)':'Temperature',
                                'Humidity(%)':'Humidity',
                                'Wind speed (m/s)':'Wind_speed',
                                'Visibility (10m)':'Visibility',
                                'Dew point temperature(°C)':'Dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'Solar_Radiation',
                                'Rainfall(mm)':'Rainfall',
                                'Snowfall (cm)':'Snowfall',
                                'Functioning Day':'Functioning_Day'})

In [None]:
# Dataset Describe
bike_df.describe(include='all')

### Variables Description

* **Date** - day/month/year
* **Rented Bike count** - Count of bikes rented per hour
* **Hour** - Hour of the day
* **Temperature**-Temperature in Celsius
* **Humidity** - Humidity in the air in %
* **Windspeed** - Speed of the wind in  m/s
* **Visibility** - Visibility in m (10m)
* **Dew point temperature** - Temperature at the beggining of the day(Celsius)
* **Solar radiation** -Sun contribution (MJ/m2)
* **Rainfall** - Amount of raining in mm
* **Snowfall** - Amount of snowing in cm
* **Seasons** - Winter, Spring, Summer, Autumn
* **Holiday** - Holiday/No holiday
* **Functional Day** -  If the day is a Functioning Day or not

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values_per_variable = bike_df.apply(lambda column: column.unique())
print(unique_values_per_variable)

In [None]:
# Number of Unique values in each columns
bike_df.nunique().sort_values(ascending=False)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
bike_df.info()

In [None]:
# convert data type of 'Date' column to Datetime format
bike_df['Date'].dtype

In [None]:
bike_df['Date']=bike_df['Date'].apply(lambda x: datetime.strptime(x,'%d/%m/%Y'))

In [None]:
bike_df['Date'].dtype

In [None]:
# Extracting: 'day' , 'month' and  'year'  from 'Date' column:
bike_df['day']=bike_df['Date'].dt.day_name()
bike_df['month']=bike_df['Date'].dt.month
bike_df['year']=bike_df['Date'].dt.year

In [None]:
bike_df.sample(5)

In [None]:
bike_df['day'].unique()

In [None]:
# Creating new column named: 'weekend'
bike_df['weekend']=bike_df['day'].apply(lambda x: 1 if x== 'Saturday' or x== 'Sunday' else 0)

In [None]:
bike_df['weekend'].value_counts()

In [None]:
# Drop columns: 'Date', 'day' and 'year'
bike_df.drop([ 'Date','day','year'],axis=1,inplace=True)

In [None]:
bike_df.sample(3)

### What all manipulations have you done and insights you found?

###***The "Date" column, initially read as a string by Python, is essential for analyzing user behavior. To enable precise analysis, it is necessary to convert the column into a datetime format. Once converted, it can be separated into three separate columns: "year," "month," and "day." These columns represent distinct temporal components and allow for more efficient categorization and analysis of the data, providing valuable insights into user behavior patterns.***

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

## ***UNIVARIATE ANALYSIS***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
bike_df['Rented_Bike_Count'].value_counts().sort_values(ascending=False)

In [None]:
# Chart - 1 : Distribution of Dependent variable

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(18,6))
sns.kdeplot(bike_df,x='Rented_Bike_Count',fill=True,color='b',ax=ax1)
ax1.axvline(bike_df['Rented_Bike_Count'].mean(), color='salmon', linestyle='dashed', linewidth=2)
ax1.axvline(bike_df['Rented_Bike_Count'].median(), color='royalblue', linestyle=':', linewidth=2)
sns.boxplot(bike_df,x='Rented_Bike_Count',ax=ax2,palette="viridis")
ax1.set_xlabel('Rented Bike Count', color='blue')
ax1.set_ylabel('Density', color='blue')

ax2.set_xlabel('Rented Bike Count', color='blue')
plt.show()

#####  What is/are the insight(s) found from the chart?

**The dependent variable is positively skewed and have lot more outliers**

#####  Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The gained insights from analyzing data with a positively skewed dependent variable (Bike rented count) and a high number of outliers can potentially create a positive business impact. However, the presence of outliers suggests instances where there are extremely high bike rental counts, which may indicate exceptional demand spikes or anomalies. While this may not directly lead to negative growth, it can pose challenges in capacity planning, resource allocation, and service delivery, requiring businesses to carefully manage and optimize operations to meet customer demand and prevent potential negative impacts on customer satisfaction and business growth.**

#### Chart - 2

In [None]:
# Chart - 2 visualization code
num_features=bike_df.drop('Rented_Bike_Count',axis=1).describe().columns
num_features

In [None]:
# Chart - 2 : Distribution of Numerical Features
for var in num_features:
  fig,(ax1,ax2)=plt.subplots(1,2,figsize=(18,5))
  sns.kdeplot(bike_df,x=var,fill=True,ax=ax1,color='b')
  ax1.axvline(bike_df[var].mean(),color='salmon', linestyle='dashed', linewidth=2)
  ax1.axvline(bike_df[var].median(),color='royalblue', linestyle=':', linewidth=2)
  sns.boxplot(bike_df,x=var,ax=ax2,palette="viridis")

  ax1.set_xlabel(f'{var}', color='blue')
  ax1.set_ylabel('Density', color='blue')

  ax2.set_xlabel(var, color='green')

  plt.show()
  print('\n\n')


#####  What is/are the insight(s) found from the chart?

**Our numerical features exhibit skewness and some of them contains outliers.**

#####  Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insights from careful analysis and appropriate handling of skewness and outliers can help create a positive business impact. By accurately understanding and addressing these data characteristics, businesses can make informed decisions, develop effective strategies, and optimize their operations. This can lead to improved resource allocation, targeted marketing, enhanced customer satisfaction, and overall positive growth and performance in the business.**

#### Chart - 3

In [None]:
# Chart - 3 : Plotting graph for categorical features
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(18,5))
sns.countplot(bike_df,x='Seasons',ax=ax1,palette='pastel')
sns.countplot(bike_df,x='Holiday',ax=ax2,palette='pastel')
sns.countplot(bike_df,x='Functioning_Day',ax=ax3,palette='flare')

ax1.set_xlabel('Seasons', color='blue')
ax1.set_ylabel('Count', color='blue')

ax2.set_xlabel('Holiday', color='blue')
ax2.set_ylabel('Count', color='blue')

ax3.set_xlabel('Functioning_Day', color='blue')
ax3.set_ylabel('Count', color='blue')

plt.show()


##### 1. Why did you pick the specific chart?

**Show Categorical Features**

##### 2. What is/are the insight(s) found from the chart?

**There is not much difference across seasons but the count of Rental bikes significantly imbalance across 'Holiday' and 'Functioning Day' columns.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Analyzing the rental bike patterns across seasons, holidays, and functional days provides valuable insights for understanding the fluctuation in demand and optimizing resource allocation accordingly.**

## ***BIVARIATE ANALYSIS***

#### Chart - 4

In [None]:
# Chart - 4 visualization code
num_features

In [None]:
for var in num_features:
  plt.figure(figsize=(12,6))
  sns.scatterplot(bike_df,x=var,y='Rented_Bike_Count',color='salmon')
  correlation=bike_df[var].corr(bike_df['Rented_Bike_Count'])
  plt.title('Rented_Bike_Count vs ' + var + ': Correlation = '+str(correlation),bbox={'facecolor':'salmon', 'alpha':0.3} )
  z = np.polyfit(bike_df[var], bike_df['Rented_Bike_Count'], 1)
  y_hat = np.poly1d(z)(bike_df[var])
  plt.plot(bike_df[var], y_hat,'r--', lw=1)

  plt.xlabel(var , color='salmon')
  plt.ylabel('Rented_Bike_Count Label', color='salmon')

  plt.show()
  print('\n\n\n')

##### 1. Why did you pick the specific chart?

**Show The Relation between  Numerical Features and dependent variable**

##### 2. What is/are the insight(s) found from the chart?

**Hour, Temperature, wind speed, visibility, dew point temperature,solar radiation & month are positively correlated with our dependent variable (Rented Bike Count) while other numerical features are negatively correlated with Rented bike count.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insights indicating positive correlations between Hour, Temperature, wind speed, visibility, dew point temperature, solar radiation, and month with the Rented Bike Count can inform business decisions such as optimizing operational hours, adjusting pricing, and targeting marketing efforts to maximize bike rentals and drive positive business impact. Additionally, understanding the negative correlations with other numerical features can help identify areas for improvement and implement strategies to mitigate potential negative impacts on bike rentals.**

#### Chart - 5

In [None]:
# Chart - 5 visualization code
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(18,5))
sns.barplot(bike_df,x='Seasons',y='Rented_Bike_Count',ax=ax1,palette='viridis',capsize=0.1)
sns.barplot(bike_df,x='Holiday',y='Rented_Bike_Count',ax=ax2,palette='pastel',capsize=0.1)
sns.barplot(bike_df,x='Functioning_Day',y='Rented_Bike_Count',ax=ax3,palette='inferno',capsize=0.1)
plt.show()

##### 1. Why did you pick the specific chart?

**To plot the variation in Rented Bike count due to Seasons,Holiday and Functioning day**

##### 2. What is/are the insight(s) found from the chart?

* **Count is maximum during summer but minimum during winter.**
* **During holidays counts drop down**.
* **Contribution of non-funtioning day to count is insignificant**.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insights about the count being maximum during summer, dropping during winter, and decreasing during holidays can help businesses plan their resources, adjust marketing strategies, and optimize operations to meet customer demand, resulting in a positive business impact. Additionally, the understanding that non-functioning days have an insignificant contribution can guide businesses in allocating resources more efficiently.**

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12,6))
bike_df.groupby('Rainfall').mean()['Rented_Bike_Count'].plot(c='royalblue')
plt.xlabel('Rainfall in mm',color='blue')
plt.ylabel('Average rented bike count',color='blue')
plt.xticks(range(0,37,2))
plt.show()

##### 1. Why did you pick the specific chart?

**to analyze the relationship between "Rented_Bike_Count" and "Rainfall"**

##### 2. What is/are the insight(s) found from the chart?

**The above plot indicates that despite heavy rainfall, the demand for rented bikes does not decrease. For instance, even with a rainfall of 22-24 mm, there is a significant peak in the number of rented bikes.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insight that heavy rainfall does not decrease the demand for rented bikes can have a positive business impact. Businesses can leverage this information to optimize their operations during rainy periods and ensure a continuous supply of bikes, meeting customer demand and potentially increasing revenue.**

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Chart - 7 : plot to analyze the relationship between "Rented_Bike_Count" and "Wind_speed"
plt.figure(figsize=(12,6))
bike_df.groupby('Wind_speed').mean()['Rented_Bike_Count'].plot(c='royalblue')
plt.xlabel('Wind_Speed in m/s',color='blue')
plt.ylabel('Average rented bike count',color='blue')
plt.show()

##### 1. Why did you pick the specific chart?

**To analyze the relationship between "Rented_Bike_Count" and "Wind_speed"**

##### 2. What is/are the insight(s) found from the chart?

**From the plot above, we can observe that the demand for rented bikes is evenly distributed regardless of the wind speed. However, there is a spike in bike rentals when the wind speed is at 7 m/s, indicating that people enjoy riding bikes when there is a slight breeze.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insight about the even distribution of bike rentals regardless of wind speed, with a spike at 7 m/s, can have a positive business impact. Businesses can promote biking as an enjoyable activity during breezy conditions, potentially increasing bike rentals and attracting more customers.**

#### Chart - 8

In [None]:
# Chart - 8 visualization code
fig, ax = plt.subplots(figsize=(15, 8))
sns.boxplot(data=bike_df, x='Hour', y='Rented_Bike_Count', ax=ax,palette='viridis')
plt.title('Count of Rented bikes according to Hour',bbox={'facecolor':'blue', 'alpha':0.5})
plt.show()

##### 1. Why did you pick the specific chart?

**To see the count of rented bike hourly**

##### 2. What is/are the insight(s) found from the chart?

**The plot above showcases the usage of rented bikes across different hours throughout the year. It is notable that people tend to use rented bikes during their working hours, specifically from 7 AM to 9 AM and 5 PM to 7 PM.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insight that people tend to use rented bikes during their working hours can have a positive business impact. Businesses can optimize their operations and marketing efforts during these peak hours to meet customer demand, attract more riders, and potentially increase revenue.**

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(14,6))
sns.barplot(x='month',y='Rented_Bike_Count',data=bike_df,palette='pastel')
plt.title('Average count of Bikes Rented per Month',bbox={'facecolor':'blue','alpha':0.5})
plt.show()
# bbox={'facecolor':'blue', 'alpha':0.5}


##### 1. Why did you pick the specific chart?

**To Check Average count of Bikes Rented per Month**

##### 2. What is/are the insight(s) found from the chart?

**During Summer season the demand for rented bikes are on hike while during winter demand is low.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insight that demand for rented bikes is high during summer and low during winter can help businesses align their resources and marketing strategies accordingly, maximizing revenue and creating a positive business impact.**

## ***TriVariate Analysis***

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(16,6))
sns.lineplot(x='Hour',y= "Rented_Bike_Count",data=bike_df,hue='Seasons',palette='deep',alpha=1)
plt.xticks(range(0,24))
plt.title('Analysing trend line of "Rented Bike Count" w.r.t "Hour" for different Seasons',bbox={'facecolor':'blue','alpha':0.5})
plt.show()

##### 1. Why did you pick the specific chart?

**Analysing trend line of "Rented Bike Count" w.r.t "Hour" for different Seasons**

##### 2. What is/are the insight(s) found from the chart?

 **The analysis reveals that the use of rented bikes is significantly high during the summer season with peak demand during 7am-9am and 5pm-7pm. However, during the winter season, the use of rented bikes is quite low due to snowfall.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insights that highlight high demand for rented bikes during the summer season and specific peak hours, as well as low demand during the winter season due to snowfall, can help businesses optimize operations, target marketing efforts, and adjust resources accordingly, leading to a positive business impact.**

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(16,6))
sns.pointplot(x='Hour',y= "Rented_Bike_Count",data=bike_df,hue='Holiday',palette='rocket')
plt.title('Analysing trend line of "Rented Bike Count" w.r.t "Hour" seperately for "Holiday" and "No Holiday" ',
                                      bbox={'facecolor':'blue', 'alpha':0.5})
plt.show()

##### 1. Why did you pick the specific chart?

**To Analysing trend line of "Rented Bike Count" w.r.t "Hour" seperately for "Holiday" and "No Holiday"**

##### 2. What is/are the insight(s) found from the chart?

**During Holidays People prefer to use rented bikes after 12 pm.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insight that people prefer to use rented bikes after 12 pm during holidays can help businesses adjust their operational hours and allocate resources effectively, catering to the increased demand and potentially creating a positive business impact.**

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(16,6))
sns.pointplot(x='Hour',y= "Rented_Bike_Count",data=bike_df,hue='weekend',palette='rocket')
plt.title('Analysing trend line of "Rented Bike Count" w.r.t "Hour" sperately for "weekdays" and "weekend" ',
                        bbox={'facecolor':'blue','alpha':0.5})
plt.show()

##### 1. Why did you pick the specific chart?

**To Analysing trend line of "Rented Bike Count" w.r.t "Hour" sperately for "weekdays" and "weekend"**

##### 2. What is/are the insight(s) found from the chart?

* **we can observe that the demand for rented bikes is higher on weekdays and more specifically between 7am-9am and 5pm-7pm**.
* **On weekends,the demand for rented bikes is generally lower, especially during the morning hours but rise thereafter.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the gained insights about higher demand for rented bikes on weekdays, specifically during peak commuting hours, and lower demand on weekends, especially in the morning, can help businesses optimize their operations, staffing, and marketing strategies to cater to these patterns, potentially leading to a positive business impact**

### **Multivariate Analysis**

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(16,8))
plt.title('Correlation Chart')
sns.heatmap(bike_df[bike_df.describe().columns].corr(),annot=True,annot_kws={'size': 10},linewidths=2,square=True,fmt='.2f',cmap='YlOrBr')
plt.show()

### **Presence of Multicollinearity**
* **We observe that columns 'Temperature' and 'Dew point temperature' are highly positively correlated, with a correlation coefficient of 0.91.**
* **'Visibility' and 'Humidity' have high negative correlation as compared to others, with a correlation coefficient of -0.54.**

### **Using VIF to remove Multicollinearity**
*  <I>VIF score should be less than 5 for no multicollinearity.<I>

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

# Calculating VIF
  vif =pd.DataFrame()
  vif['Features']= X.columns
  vif['VIF']=[variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
  return vif

In [None]:
bike_df_copy=bike_df.copy()

In [None]:
calc_vif(bike_df_copy[[i for i in bike_df_copy.describe().columns if i not in ["Rented_Bike_Count"]]])

In [None]:
bike_df_copy.drop(columns = ['Dew_point_temperature'],axis = 1, inplace = True)

In [None]:
calc_vif(bike_df_copy[[i for i in bike_df_copy.describe().columns if i not in ["Rented Bike Count"]]])

In [None]:
bike_df_copy.drop(columns = ['Humidity'],axis = 1, inplace = True)

In [None]:
calc_vif(bike_df_copy[[i for i in bike_df_copy.describe().columns if i not in ["Rented Bike Count"]]])

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0)**: <i>There is no significant relationship between the independent variables and the 'Rented Bike Count' (dependent variable).<i>

**Alternate Hypothesis (Ha)**: <i>There is a significant relationship between the independent variables and the 'Rented Bike Count' (dependent variable).<i>

#### 2. Perform an appropriate statistical test.

**To test this hypothesis, we can perform statistical tests such as:**

* *Feature Significance Test: Evaluate the p-values of the coefficients of the independent variables in the regression model. If the p-values are below a predetermined significance level (e.g., 0.05), reject the null hypothesis and conclude that there is a significant relationship between the independent variables and the 'Rented Bike Count.'*

* *Overall Model Significance Test: Assess the overall significance of the regression model by conducting an F-test or chi-square test. If the p-value is below the chosen significance level, reject the null hypothesis and conclude that the model as a whole is significant in predicting the 'Rented Bike Count.'*

####**Ordinary Least Square Model**

In [None]:
# Perform Statistical Test to obtain P-Value
import statsmodels.api as sm


# Add a constant column to the DataFrame for the intercept term
bike_df_copy = sm.add_constant(bike_df_copy)

independent_vars=bike_df_copy[bike_df_copy.describe().columns].drop('Rented_Bike_Count',axis=1)
dependent_var=bike_df_copy['Rented_Bike_Count']

# Perform the regression analysis
model = sm.OLS(dependent_var,independent_vars)
results = model.fit()

# Obtain the p-values
p_values = results.pvalues

print(round(p_values,5))

In [None]:
results.summary()

**Conclusion**
* For the 'Solar_Radiation' variable, the p-value is 0.53238, which is greater than 0.05. Therefore, there is not enough evidence to conclude a significant relationship between 'Solar_Radiation' and the 'Rented Bike Count'.
*Similarly, for the 'Snowfall' variable, the p-value is 0.10884, which is also greater than 0.05. Hence, there is not enough evidence to establish a significant relationship between 'Snowfall' and the 'Rented Bike Count'.



In summary, based on the given p-values, we can reject the null hypothesis for the independent variables 'Hour', 'Temperature', 'Wind_speed', 'Visibility', 'Rainfall', 'Month', and 'Weekend'. This implies that there is a significant relationship between these independent variables and the 'Rented Bike Count'. However, there is insufficient evidence to reject the null hypothesis for the 'Solar_Radiation' and 'Snowfall' variables, indicating that these variables may not have a significant relationship with the 'Rented Bike Count'.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null hypothesis (H0)**: The dependent variable is normally distributed in the population.

**Alternative hypothesis (Ha)**: The dependent variable is not normally distributed in the population.

#### 2. Perform an appropriate statistical test.

####**Shapiro-Wilk test**

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy import stats

# Perform the Shapiro-Wilk test
statistic, p_value = stats.shapiro(bike_df_copy['Rented_Bike_Count'])

print("Shapiro-Wilk Test")
print("Test statistic:", statistic)
print("p-value:", p_value)

**Conclusion**
Based on the Shapiro-Wilk test results, with a test statistic of 0.8822 and a p-value of 0.0, the p-value is less than the chosen significance level (e.g., 0.05). Therefore, we would reject the null hypothesis (H0) that the dependent variable is normally distributed.

In [None]:
bike_df_copy.drop('const',axis=1,inplace=True)

## ***6. Feature Engineering & Data Pre-processing***

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Treatment of Outliers in our dependent Variable(applying square root transformation)
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(16,6))
sns.kdeplot(np.sqrt(bike_df_copy['Rented_Bike_Count']),color='y',fill=True,ax=ax1)
ax1.axvline(np.sqrt(bike_df_copy['Rented_Bike_Count']).mean(), color='green', linestyle='dashed', linewidth=2)
ax1.axvline(np.sqrt(bike_df_copy['Rented_Bike_Count']).median(), color='blue', linestyle='dashed', linewidth=2)
sns.boxplot(x= np.sqrt(bike_df_copy['Rented_Bike_Count']),color='y')
plt.show()

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
#ONE HOT ENCODING
cat_features=['Hour', 'Seasons', 'Holiday', 'Functioning_Day', 'month',
       'weekend']
def one_hot_encoding(data, column):
    data = pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
    data = data.drop([column], axis=1)
    return data

#### What all categorical encoding techniques have you used & why did you use those techniques?

**One-hot encoding enables a more descriptive representation of categorical data.Since many machine learning algorithms do not accept categorical data as input, the categories need to be converted into numerical values.**

In [None]:
for col in cat_features:
    bike_df_copy = one_hot_encoding(bike_df_copy, col)
bike_df_copy.head()

In [None]:
bike_df_copy.columns

### 6. Data Scaling

In [None]:
# Scaling your data
features = list(set(bike_df_copy.columns) - {'Rented_Bike_Count'})
from scipy.stats import zscore
bike_df_copy[features]=bike_df_copy[features].apply(zscore)

##### Which method have you used to scale you data and why?

## ***7. ML Model Implementation***

In [None]:
X=bike_df_copy.drop('Rented_Bike_Count',axis=1)
y=np.sqrt(bike_df_copy['Rented_Bike_Count'])

In [None]:
# TRAIN TEST SPLIT
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.25,random_state=19)
print(X_train.shape)
print(X_test.shape)

In [None]:
bike_df_copy.info()

### **Linear Regression**

In [None]:
lr=LinearRegression()
lr.fit(X_train,y_train)

In [None]:
#check the score
lr.score(X_train,y_train)

In [None]:
#check the coefficient
lr.coef_

In [None]:
# Prediction
y_pred_train= lr.predict(X_train)
y_pred_test= lr.predict(X_test)

In [None]:
# Calculate: Training Set
# 1. mean_squared_error
mse_lr= mean_squared_error(y_train,y_pred_train)
print('MSE :' , mse_lr)
#2. Root_mean_squared_error
rmse_lr=np.sqrt(mse_lr)
print('RMSE :' , rmse_lr)
#3. mean_absolute_error
mae_lr=mean_absolute_error(y_train,y_pred_train)
print('MAE :' ,mae_lr)
#4. coefficient of determination(r2_score)
r2_lr=r2_score(y_train,y_pred_train)
print('R2 :' ,r2_lr)
#5. adjusted  coefficient of determination
adjusted_r2_lr=(1-(1-r2_score(y_train,y_pred_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))
print('Adjusted R2 :' ,adjusted_r2_lr)

In [None]:
#Storing
lr_dict={'Model':'Linear Regression','MAE':round(mae_lr,2),'MSE':round(mse_lr,2),'RMSE':round(rmse_lr,2),'R2_score':round(r2_lr,2),'Adjusted R2_score':round(adjusted_r2_lr,2)}
training_df=pd.DataFrame(lr_dict,index=[1])
training_df

In [None]:
# Calculate: Test Set
# 1. mean_squared_error
mse_lr= mean_squared_error(y_test,y_pred_test)
print('MSE :' , mse_lr)
#2. Root_mean_squared_error
rmse_lr=np.sqrt(mse_lr)
print('RMSE :' , rmse_lr)
#3. mean_absolute_error
mae_lr=mean_absolute_error(y_test,y_pred_test)
print('MAE :' ,mae_lr)
#4. coefficient of determination(r2_score)
r2_lr=r2_score(y_test,y_pred_test)
print('R2 :' ,r2_lr)
#5. adjusted  coefficient of determination
adjusted_r2_lr=(1-(1-r2_score(y_test,y_pred_test))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))
print('Adjusted R2 :' ,adjusted_r2_lr)

In [None]:
#Storing
lr_dict2={'Model':'Linear Regression','MAE':round(mae_lr,2),'MSE':round(mse_lr,2),'RMSE':round(rmse_lr,2),'R2_score':round(r2_lr,2),'Adjusted R2_score':round(adjusted_r2_lr,2)}
test_df=pd.DataFrame(lr_dict2,index=[1])
test_df

### **Concluding Remark:**

* The linear regression model shows moderate performance on both the training and test sets.
* The model achieves an R-squared (R2) value of approximately 0.76, indicating that around 76% of the variance in the target variable is explained by the independent variables.
* The mean squared error (MSE) values are 36.49 (training set) and 37.49 (test set), suggesting moderate errors in the predictions.
* The root mean squared error (RMSE) values are around 6.04 and 6.12, indicating the average magnitude of the errors.
* The mean absolute error (MAE) values are approximately 4.55 and 4.56, representing the average absolute deviation of the predictions.
* The adjusted R-squared values account for the number of predictors in the model, showing a similar pattern.
* Overall, further analysis and model refinement may be beneficial to improve the performance.

In [None]:
# Checking Heteroscedasticity
residuals = y_test - y_pred_test
sns.scatterplot(x=y_pred_test, y=residuals,color='salmon')

# Add a horizontal line at y=0
plt.axhline(y=0, color='red', linestyle='--')


plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')

plt.show()

* **Since,the  points in the scatter plot are more or less evenly distributed on both sides of the line y=0, it suggests that the residuals have relatively consistent variability across the range of predicted values. This indicates homoscedasticity rather than heteroscedasticity**.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***