<a href="https://colab.research.google.com/github/ayush2444/PlayStore_Data-Analysis/blob/main/ML_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Seoul Bike Sharing Demand Prediction



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

Bike sharing systems provide a convenient method for renting bicycles through a network of kiosks located throughout a city, where individuals can obtain membership, rent a bike, and automatically return it to any other kiosk location as per their requirement.


With over 500 bike-sharing programs operating globally, the trend of bike and scooter ride-sharing has gained significant momentum, particularly in metropolitan areas such as San Francisco, New York, Chicago, and Los Angeles. However, predicting bike demand on a specific day remains a crucial challenge for these businesses.

The bike sharing system market is picking up speed and gaining momentum globally. In 2019, its market share was valued at an impressive 3.39 billion, and industry experts predict that it will surge to a whopping $6.98 billion by 2027, with a projected compound annual growth rate of around 14% from 2020 to 2027.


Our objective is to develop a predictive model that can estimate the approximate number of bikes rented based on the available dataset, considering that bike sharing systems typically rent bikes on an hourly, daily, or monthly basis, with static pricing inclusive of these time periods. The system's affordability and user-friendly renting process have made it a popular choice for commuters of all kinds.

Our project aims to utilise historical bike usage patterns and weather data to forecast the demand for bike rentals. The dataset we are working with comprises hourly rental data for a span of two years. Specifically, the training set includes data from the first 19 days of each month, while the test set encompasses the period from the 20th to the end of each month.

Our initial step is to conduct an Exploratory Data Analysis on the dataset. We examine the presence of any missing data values, though none were found, and identify and address any outliers in the dataset. Moreover, we carry out correlation analysis to identify the relevant and significant feature set, which we later modify through feature engineering. This involves adjusting a few existing columns and dropping any irrelevant ones from the dataset.



then further on Through the process of feature engineering and data preprocessing, we aimed to identify and isolate impactful features for our analysis. One of the initial steps in this process was to address multicollinearity within the independent variables. We accomplished this through the use of various inflation factor (VIF) measures. then we utilized the interquartile range (IQR) technique to detect and treat outliers in our data. We then capped all outliers of continuous features between the 25th and 75th percentile.

We also noted that certain features were categorical in nature, and as such, were unsuitable for input into a machine learning model in their current form. To address this, we encoded these features into numerical values using the One-Hot Encoding technique, as they were unordered in nature.

Following that, we have divided our dataset  into training and testing sets. The model is then trained using the selection of a machine learning algorithm and the training set of data. In order to determine how effectively the model can predict rented bike count, we lastly assessed its performance on the testing data

Our exploration of machine learning models for the Bike Sharing Demand dataset took us on a journey through various popular approaches. We started with the tried-and-true Linear Regressor, as well as Regularization Models like Ridge and Lasso, and even ventured into the realm of elasticnet. But we didn't stop there; we also delved into more sophisticated ensemble models like Random Forest, decision trees, and Gradient Boost.

Constructing a machine learning model for the Bike Sharing Demand Prediction dataset required a combination of data preprocessing, utilization of various machine learning techniques, and adept model evaluation skills. Despite the challenges, we were able to develop a high-performing model that can effectively predict  the demand of rented bikes. Among the multiple models we trained, XGboost outperformed the others and gave the highest r2 score.  **Finally**, we have developed a model that can successfully predict the demand for rented bikes.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.**


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error  
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor


### Dataset Loading

In [None]:
# Load Dataset
URL='https://drive.google.com/file/d/1dZ7p614gC_iwxHwcj-1N0Lc155AGMTJS/view?usp=share_link'
df= pd.read_csv('https://drive.google.com/uc?id='+URL.split('/')[-2] , encoding= 'unicode_escape')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

In [None]:
df.columns

### Dataset Information

In [None]:
# Dataset Info
df.info()

In [None]:
print(df.dtypes.astype(str).value_counts())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate = df.duplicated().sum()
print('Total number of duplicate values are : ', duplicate)



* The dataset does not contain any duplicate rows or missing values.

*  Several feature names are quite long; perhaps they should be renamed
  




#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
plt.figure(figsize=(14, 5))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.xlabel("column_name", size=14, weight="bold")
plt.title("missing values in column",fontweight="bold",size=17)
plt.show()
     

### What did you know about your dataset?

The 'SeoulBikeData' dataset contains 8760 rows and 14 columns, and is free of any null or duplicate values. Additionally, there are four categorical features within this dataset, namely Date, Season, Holiday, and Functioning Day.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().T

### Variables Description 

**The dataset includes date information, the number of bikes rented every hour, and weather data (temperature, humidity, windspeed, visibility, dewpoint, solar radiation, snowfall, and rainfall).**



* **Date** : year-month-day
*  **Rented Bike count** - Count of bikes rented at each hour
* **Hour** - Hour of he day
* **Temperature**-Temperature in Celsius
*  **Humidity** - Humidityin the air in %, type: int
* **Wind speed (m/s)** - Speed of the wind in m/s
*  **Visibility** - 10m
* **Dew point temperature(°C)** - The temperature at which the water starts to condense out of the air,
* **Solar radiation** - MJ/m2-Electromagnetic radiation emitted by the Sun
*   **Rainfall** - mm
*   **Holiday** - If the day is holiday or not
*   **Seasons** - Winter, Spring, Summer, Autumn
*  **Functional Day** - NoFunc(Non Functional Hours), Fun(Functional hours)
* **Snowfall** -Amount of snowfall in cm














#Renaming the features

In [None]:
df =df.rename(columns= {'Temperature(°C)':'temperature','Rented Bike Count': 'rented_bike_count', 
                        'Hour':'hour', 'Humidity(%)':'humidity',  'Dew point temperature(°C)':'dew_point_temp',
                        'Wind speed (m/s)': 'wind_speed','Visibility (10m)': 'visibility', 'Solar Radiation (MJ/m2)': 'solar_radiation',
                        'Seasons':'seasons', 'Functioning Day':'functioning_day', 'Holiday':'holiday', 'Snowfall (cm)':'snowfall','Rainfall(mm)': 'rainfall'})

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()  

In [None]:
# Check Unique Values for each variable.
print("the unique values of seasons are :", df['seasons'].unique())
print("the unique values of holidays are :", df['holiday'].unique())
print("the unique values of functioning_day  are :", df['functioning_day'].unique())
print("the unique values of date are :", df['Date'].unique())

print("the unique values of hour are :", df['hour'].unique())
print("the unique values of humidity are :", df['humidity'].unique())
print("the unique values of temperature are :", df['temperature'])


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# importing datetime 
import datetime as dt
df['Date'] =df['Date'].apply(lambda x:  dt.datetime.strptime(x,'%d/%m/%Y'))
     

In [None]:
# extracting the year and month from the date feature
df['year'] = df['Date'].dt.year
df['month'] =df['Date'].dt.month
df['weekday'] = df['Date'].dt.weekday


In [None]:
# convert into category 
cat =['month','weekday', 'hour', 'year']
for i in cat:
  df[i]=df[i].astype('category')

In [None]:
## dropping ther date columns 
df = df.drop(columns = ['Date'], axis = 1)

In [None]:
df.info()

In [None]:
# read the data for the year
df['year'].value_counts()

In [None]:
# year month count from the data set
df.groupby(['year','month']).agg({'rented_bike_count':['sum']}).reset_index()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

##**EDA (Exploratory Data Analysis)** ##





##**Univariate Analysis**##

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10,6))
ax = sns.distplot(df['rented_bike_count'], hist=True , color = 'blue')
plt.xlabel('Rented_Bike_count')
plt.ylabel('Density')

##### 1. Why did you pick the specific chart?

We have picked this chart to check the distribution of the dependent variable 

##### 2. What is/are the insight(s) found from the chart?

Given that the distribution of the dependent variable must be normal for linear regression to work, the dependent variable appears to be moderately right skewed in the distribution plot shown above. As a result, we should perform some operations to make the distribution of the dependent variable normal.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.boxplot(x=df['rented_bike_count'])



* Outliers can be found in the rented bike count data, as seen by the boxplot above.

In [None]:
# Normalizing the skewed data by using the square root transformation.
sns.boxplot(x = np.sqrt(df['rented_bike_count']))

##### 1. Why did you pick the specific chart?

Here we have picked this boxplot to check the outlier in dependent variable

##### 2. What is/are the insight(s) found from the chart?

After the removal of outliers from the square root transformation.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# visualizing the distribution of dependent variable after sqrt transformation
plt.figure(figsize=(16,10))
ax=sns.distplot(np.sqrt(df['rented_bike_count']), color="blue")
ax.axvline(np.sqrt(df['rented_bike_count']).mean(), linestyle='dashed', color='red', linewidth=2)
plt.xlabel('Rented_Bike Count')
plt.ylabel('Density')

##### 1. Why did you pick the specific chart?

We chose this graph to determine whether or not we now have a normal distribution.

##### 2. What is/are the insight(s) found from the chart?

We obtain an almost normal distribution after applying the square root to the skewed Rented Bike Count. As a result, we may perform the square root transformation during modelling.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4 - Exploring Numerical features

###Numeric features

In [None]:
# features which are continuous variable
numerical_features = [feature for feature in df.columns if df[feature].dtypes != 'O' and feature not in [ 'month', 'year','date' , 'weekday'] ]
numerical_features 

In [None]:
# Chart - 4 visualization code
#With displot, examine the distribution of each numerical feature.
for num in numerical_features :
  plt.figure(figsize=(12,5))
  sns.distplot(x=df[num] ,color="purple" )
  plt.xlabel(num)
  plt.show() 



##### 1. Why did you pick the specific chart?

A distplot chart helps us to examine the density of data and calculate the mean of a data column. When a distribution is skewed, the mean and median become skewed as well. In fact, skewed features can tug on both the mean and median, causing them to lean towards the side of the skewness.

##### 2. What is/are the insight(s) found from the chart?

As we can see from the above displot that Normally distributed attributes: temperature , hour , humidity. Positively skewed attributes: wind, rented bike count , solar_radiation, snowfall, rainfall. Negatively skewed attributes: visibility. , 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing positively and negatively distributed data using a distplot chart can provide valuable insights for businesses. identifying patterns in positively and negatively distributed data can help businesses identify areas for improvement and prioritize resources accordingly. Ultimately, using data visualization tools like distplot can help businesses make data-driven decisions and improve overall performance.





#Bivariate analysis.

Let's strive to determine how numerical variables relate to our dependant variable.

#### Chart - 5 - Regression plot , Numeric feature Vs rented bike count

In [None]:
continuous_variable = [ 'temperature', 'wind_speed','dew_point_temp',  'humidity', 'visibility',  'rainfall', 'snowfall','solar_radiation' ]

In [None]:
rented_bike_count = ['rented_bike_count']

In [None]:
for i in continuous_variable:
  plt.figure(figsize=(13,6))
  sns.regplot(x=df[i], y=df['rented_bike_count'], scatter_kws={"color": "cyan"}, line_kws={"color": "red"})
  plt.title(f'rented_bike_count vs {i}')
  plt.tight_layout()


##### 1. Why did you pick the specific chart?

We have taken this chart up here to identify the relationship between the dependant and independent variables and select the best fit line since a regression plot is an effective visualisation tool for studying the relationship between two variables.

##### 2. What is/are the insight(s) found from the chart?

**Temperature**: A positive correlation exists between the two. Between 20 and 30 °C, the number of rented bikes is at its peak. It follows that the effects of temperature exist.

**Visibility**: We don't know much about how visibility influences our outcomes, but we do know that it is related with the number of rented bike count.

**Dew point**: To get a relative humidity, air must be cooled to the dew point, which is the temperature it must reach (while maintaining a constant pressure). With the data, it has a positive correlation.

**Wind speed**: The amount of wind has little impact on our data.

**Humidity** : The level of air moisture is known as humidity. Individuals therefore like borrowing bikes when the humidity is lower.

**SnowFall and Rainfall** : Individuals avoid borrowing bikes in locations with snowfall or rain when those conditions exist.

**'Solar_Radiation'** : are positively related to the dependent variable.



* Hour, Temperature, Wind Speed, Visibility, and Solar Radiation are all  positively correlated with the dependant variable. This implies that the number of rented bikes rises as these features do, while the columns "Rainfall," "Snowfall," and "Humidity" are those features that have a negative relationship with the dependent variable, suggesting that the number of rented bikes falls as these features rise.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

##**Considering the correlation between the dependent variable and categorical independent variables, Attempting to extract some crucial information from the category feature**

#### Chart - 6  - Year Vs Rented bike count

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12,8))
sns.barplot(x=df.year,y=df['rented_bike_count'], palette=("pastel"))
plt.xlabel('Year')
plt.ylabel("Rented_Bike_Count")
plt.show()


##### 1. Why did you pick the specific chart?

We have taken this bar plot to check the data contains from which year 

##### 2. What is/are the insight(s) found from the chart?

Our collection primarily includes data from the year 2018 and only a little amount from the year 2017.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7 - Season Vs Rented bike count





In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(15,9))
sns.barplot(x=df.seasons,y=df['rented_bike_count'],palette=("RdYlGn_r"))
plt.xlabel('Seasons')
plt.ylabel("Rented_Bike_Count")
plt.show()

##### 1. Why did you pick the specific chart?

In order to analyse the data that is divided into four seasons, we have chosen a bar chart to show the number of rented bikes that have been recorded.

##### 2. What is/are the insight(s) found from the chart?

As we can see clearly from the above bar graph,  We have found that average of the rented bike counts is higher during the summer and lowest during the winter.
 it is evident that people enjoy riding bicycles in the summer and the autumn.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

By analyzing the graph, we have unlocked valuable insights that can steer the business towards a positive impact. The data suggests that the demand for bikes follows a seasonal pattern, with a peak in the Summer months, followed by a  Autumn, a refreshing Spring, and a chilly Winter. Based on this knowledge, we can strategically plan and maximize our profits during the summer , Autum and spring seasons, while also devising innovative ways to tackle the off-season. 

#### Chart - 8 - Month Vs Rented bike count

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(15,9))
sns.barplot(x=df.month,y=df['rented_bike_count'],palette=("Oranges"))
plt.xlabel('Months')
plt.ylabel("Rented_Bike_Count")
plt.show()

##### 1. Why did you pick the specific chart?

There We have created a bar plot to analyse the quantity of bikes that are rented. We can clearly see from the data that this bar plot enables us to determine which month has a higher demand for rented bikes.

##### 2. What is/are the insight(s) found from the chart?

We can observe that the months of December, January, and February—the winter seasons—have lower demand for rented bikes than those months, as well as May, June, and July—the summer seasons—which have the highest demand for bikes.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Through careful analysis of the graph, we have uncovered valuable insights that can help us generate a positive impact on our business. It is evident that the demand for bikes fluctuates on a monthly basis, indicating the importance of seasonality in our business planning. Upon closer examination, we can observe that the demand for bikes is at its peak during the scorching months of May, June, and July, while remaining moderate during the months of March, April, August, September, and November. Conversely, the demand for bikes is at its lowest during the colder months of January, February, and December. Armed with this knowledge, we can make informed decisions and tailor our strategies to optimize profits during peak months while developing innovative approaches to tackle the low-demand months. So let's pedal towards prosperity!

#### Chart - 9 - Week_Day Vs rented bike count

In [None]:
df.columns

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10,7))
sns.barplot(x=df.weekday,y=df['rented_bike_count'],palette=("brg_r") )
plt.xlabel('days')
plt.ylabel("Rented_Bike_Count")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is used when you want to show a distribution of data points or perform a comparison of metric values across different subgroups of your data

There, we used a bar plot to determine which day of the week has the highest demand for rental bikes.

##### 2. What is/are the insight(s) found from the chart?

As we can see from the above graph the , all days, rented bike count is consistant and equal

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10- holiday, No holiday

In [None]:
# View the frequency count of 'holiday' column
holiday_counts = df['holiday'].value_counts()
holiday_counts

In [None]:
# Chart - 12 visualization code
# Generate a pie chart to visualize the distribution
plt.figure(figsize=(10,8))
ax = plt.subplot(111)
plt.pie(holiday_counts, 
        autopct="%1.1f%%",
        startangle=90,
        shadow=True,
        labels=['No Holiday(%)','Holiday(%)'],
        colors=['blue','red'],
        explode=[0,0])
plt.title('Distribution of Holiday ')
ax.legend(bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart helps organize and show data as a percentage of a whole. When comparing different percentages, pie charts are widely utilised. I thus utilised a pie chart, which enabled me to compare the variable's percentages.

##### 2. What is/are the insight(s) found from the chart?

By analyzing the pie chart, I am able to see the valuable insights about the distribution of the 'holiday' column in my dataset. The chart revealed that the majority of the ratings - a whopping 95.1% or 8,328 records - were on non-holiday days. In contrast, the number of ratings received during holidays was relatively low, accounting for only 4.9% or 432 records of the total rented bike count data available in the dataset. These findings highlight the importance of considering external factors, such as holidays, when analyzing data, as they can have a significant impact on the trends and patterns observed in the dataset.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

we can analyze that the demand for rented bikes is significantly higher on non-holiday days than on holidays. This crucial insight can have a profound impact on business decisions, as it implies that the rental bike business is poised for significant growth during non-holiday periods. On the other hand, the data reveals that the demand for rented bikes during holidays is negligible, indicating a negative growth trend for the business during these times.

#### Chart - 11 - Average rented bikes per hour

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(10,7))
sns.barplot(x=df.hour,y=df['rented_bike_count'])
plt.xlabel('hour')
plt.ylabel("Rented_Bike_Count")
plt.show()

##### 1. Why did you pick the specific chart?

we have picked this bar chart to analyze the demand of rented bike per hour 

##### 2. What is/are the insight(s) found from the chart?

People favour rented bikes during rush hour, as evidenced by the high surge in hired bikes from 8:00 am to 9:00 pm. We may state that during business opening and closing times there is a significantly high demand because it is apparent that demand increases most at 8 a.m. and 6:00 p.m.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 These findings provide valuable insights into the usage patterns of rented bikes, suggesting that businesses should focus their marketing efforts on promoting the convenience and affordability of rented bikes during peak hours to capitalize on this trend. By doing so, businesses can enhance their profitability and better cater to the needs of their customers.

#### Chart - 12 - Functioning Day and Non Functional day

In [None]:
df.columns

In [None]:
# Chart - 12 visualization code
# Generate a pie chart to visualize the distribution
plt.figure(figsize=(10,8))
ax = plt.subplot(111)
plt.pie(df['functioning_day'].value_counts(), 
        autopct="%1.1f%%",
        startangle=90,
        shadow=True,
        labels=['Functional day(%)','Non functional day(%)'],
        colors=['blue','red'],
        explode=[0,0])
plt.title('(%) of functioning day and Non functional day ')
ax.legend(bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart helps organize and show data as a percentage of a whole. When comparing different percentages, pie charts are widely utilised. I thus utilised a pie chart, which enabled me to compare the variable's percentages.

##### 2. What is/are the insight(s) found from the chart?

Based on the chart, it can be observed that 96.6% of the dataset consists of non-functional days, while the remaining 3.4% represents weekends. This indicates that there is a significantly higher demand for bikes on functional days compared to weekends, where the demand is relatively low.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

By analyzing the chart, we can conclude that functional days are the prime time for bike rentals, with a remarkable surge in demand. However, the weekends exhibit a different story as there is minimal demand for bike rentals. This valuable insight is crucial from a business perspective, as it presents an opportunity to capitalize on the growth potential during functional days and to strategize ways to tackle the decline in demand during non-functional days. In essence, functional days act as a catalyst for business growth, while weekends can pose a challenge to profitability.

###Chart - 13 - Snowfall Vs rented bike count




In [None]:
plt.figure(figsize=(10,7))
sns.barplot(x=df.snowfall,y=df['rented_bike_count'])
plt.xlabel('snowfall')
plt.ylabel("Rented_Bike_Count")
plt.show()

##### 1. Why did you pick the specific chart?

In order to examine the provided data, we have chosen a bar chart to show the relationship between the number of rented bikes and the amount of snowfall.

##### 2. What is/are the insight(s) found from the chart?

We can observe a decline in demand for rented bikes when snow falls.

##### 3. Will the gained insights help creating a positive business impact? 


Are there any insights that lead to negative growth? Justify with specific reason.

The analysis of the graph suggests that snowfall has a significant impact on the demand for rented bikes, resulting in a considerable decrease. This decrease in demand, if not accounted for, can have a negative impact on the business. Therefore, it is crucial for businesses to be aware of the weather conditions and adjust their strategies accordingly to optimize growth potential and ensure the smooth running of operations.

##In order to represent the number of rented bikes across several categorical parameters with regard to hour, we create point plots.

#### Chart - 14 Average Bike Rented_per hour

In [None]:
# Chart - 13 visualization code
fig,ax=plt.subplots(figsize=(18,8))
Average_Rent_hours =df.groupby('hour')['rented_bike_count'].mean()
ax=Average_Rent_hours.plot(legend=True, marker='o', title="Average_Bikes_Rented_Per_Hour")
ax.set_xticks(range(len(Average_Rent_hours)))
ax.set_xticklabels(Average_Rent_hours.index.tolist(), rotation=90)



##### 1. Why did you pick the specific chart?

we have picked this bar chart to analyze the demand of rented bike per hour 

##### 2. What is/are the insight(s) found from the chart?

By observing the graph, it is clear that there is a sharp increase in the use of rented bikes between the hours of 8:00 a.m. and 9:00 p.m., suggesting that individuals prefer to use rented bikes during peak hours, perhaps for their commute to work. In addition, it can be seen that the demand for rental bikes is greater on non-holiday days than it is on days with holidays.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 This valuable insight is critical from a business perspective, as it highlights the growth potential during non-holiday days, while also underlining the challenge of low demand during holidays.

The peak hours during non-holiday days present a significant opportunity for business growth, whereas holidays can result in a sharp decline in profitability.

###Chart - 15 Average rented bike per day

In [None]:
fig, ax = plt.subplots(figsize=(15,9))
sns.pointplot(data=df,x='hour',y='rented_bike_count', palette=("Reds_r"),hue='weekday' , dodge=True, ci= None)
ax.set_xlabel('Hour',fontsize=15)
ax.set_ylabel('Rented_Bikes_Count',fontsize=15)
ax.set_title('Average Rented bikes per day' , fontsize=15)

##### 1. Why did you pick the specific chart?

we have taken this plot to analyze which day has the demand in a week

##### 2. What is/are the insight(s) found from the chart?

By analyzing the Graph, it can be concluded that the average number of rented bikes remains relatively stable from Monday to Saturday. However, there is a noticeable dip in bike rentals on Sundays, and on average, the number of rented bikes is significantly lower on weekends than on weekdays

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This valuable insight presents a unique opportunity for businesses to capitalize on the stable demand for rented bikes during weekdays, while also strategizing ways to overcome the challenge of low demand on weekends.

####Chart - 16 Average Rented Bike Monthly

In [None]:
fig, ax = plt.subplots(figsize=(15,9))
sns.pointplot(data=df,x='hour',y='rented_bike_count',hue='month',ci= None, )
ax.set_xlabel('Hour',fontsize=15)
ax.set_ylabel('Rented_Bikes_Count',fontsize=15)
ax.set_title('Average Rented bikes monthly' , fontsize=15)

##### 1. Why did you pick the specific chart?

In this analysis, we utilized a line plot to visualize the trends in rented bike count over the hours of the day, specifically focusing on the monthly patterns. Line plots are a popular choice for time-series data visualization, as they represent data points connected by lines, allowing us to observe the changes in data values over time.

##### 2. What is/are the insight(s) found from the chart?

The analysis of the graph reveals that the demand for rented bikes is lower during the winter months, specifically December, January, and February, in contrast to the summer months of May, June, and July, which exhibit the highest demand. Additionally, the graph indicates a significant surge in rented bike usage between 8:00 a.m. and 9:00 p.m., highlighting the preference of individuals to rent bikes during peak hours, likely for their daily commute to work. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help businesses tailor their strategies to maximize the growth potential during the summer months and peak hours, while also overcoming the challenge of lower demand during the winter season.

The peak hours during non-holiday days present a significant opportunity for business growth, whereas holidays can result in a sharp decline in profitability.

##### 1. Why did you pick the specific chart?

####Chart - 17 - Hourly demand of bike based on seasons


In [None]:
fig, ax = plt.subplots(figsize=(15,9))
sns.pointplot(data=df,x='hour',y='rented_bike_count',hue='seasons', palette=('Oranges'), ci=None )
ax.set_xlabel('Hour',fontsize=15)
ax.set_ylabel('Rented_Bikes_Count',fontsize=15)
ax.set_title('Hourly demand of Bike based on season' , fontsize=15)


##### 1. Why did you pick the specific chart?

 Line plots are a popular choice for time-series data visualization, as they represent data points connected by lines, allowing us to observe the changes in data values over time.
 we have plotted this line plot to analyze the hourly demand of rented bike over the seasons 

##### 2. What is/are the insight(s) found from the chart?

Amazingly, it has been shown that during certain seasons, consumers prefer to rent bikes more frequently. Summer is the season with the most and winter is least number of rented bikes.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

 It is evident from the graph that there is a high demand for rented bikes during the summer months, followed by autumn and spring, and a comparatively lower demand during winter. we can conclude that the business has the potential to yield higher profits during the summer, autumn, and spring months, and comparatively lower profits during winter. By utilizing this information, businesses can develop effective strategies to maximize their profits

##Multivariate Analysis

###Let's examine the heat map's correlation with each numerical feature to learn more about multilinearity.

#### Chart - 18 - Correlation Heatmap

In [None]:
df.corr()['rented_bike_count']

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(18,12))
corr = sns.heatmap(df.corr(),cmap='PiYG', square=True,annot=True)
corr.set_xticklabels(corr.get_xticklabels(),horizontalalignment='right',  rotation=50 )
   

##### 1. Why did you pick the specific chart?

When there are numerous variables or observations for each unit or person, the analysis of the data is referred to be multivariate.

##### 2. What is/are the insight(s) found from the chart?

The temperature(°C) and dew point temperature(°C) columns of this graph demonstrate multicollinearity, as can be seen.

#### Chart - 19 - Pair Plot 

In [None]:
df_ = df.columns

In [None]:
# Pair Plot visualization code
sns.pairplot(df,palette="bright")
plt.show()

##### 1. Why did you pick the specific chart?

I utilised pair plot to analyse data patterns and relationships between features. Pair plot is used to identify the optimal set of features to describe a relationship between two variables or to generate the most isolated clusters.

##### 2. What is/are the insight(s) found from the chart?

The above graph showed less linear correlations between variables and non-linear separability of data points, which was new to me. Hence, we may conclude that both positive and negative trends are present in the relationship between each column in the graph.

In [None]:
df.columns

In [None]:
# Calculating VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
  vif = pd.DataFrame()
  vif["variables"] = X.columns
  vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

  return(vif)

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in ['rented_bike_count','dew_point_temp']]])

In [None]:
"""As a result, we may remove the DPT column from the dataset because
   having two variables that are this highly
   correlated won't improve prediction accuracy 
   and will instead make the model more complex."""

df.drop(columns=['dew_point_temp'],inplace=True)

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1) - **Null Hypothesis** : There is no relation between Wind Speed and Rented Bike Count.

**Alternate Hypothesis** : There is a relationship between Wind Speed and Rented Bike Count

2) **Null Hypothesis** : There is no relation between Temperature and Rented Bike Count.

**Alternate Hypothesis** : There is a relationship between Temperature and Rented Bike Count

3) - **Null Hypothesis** : There is no relation between Holiday and Rented Bike Count.

**Alternate Hypothesis** : There is a relationship between Holiday and Rented Bike Count

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

 **Null Hypothesis** - There is no relation between Wind Speed and Rented Bike Count.

**Alternate Hypothesis** - There is a relationship between Wind Speed and Rented Bike Count

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr

def hypothesis_test(x, y, alpha=0.05):
    stat, p = pearsonr(x, y)
    print(f"Correlation coefficient: {stat:.3f}, p-value: {p:.3f}")
    if p > alpha:
        print("Fail to reject the null hypothesis")
    else:
        print("Reject the null hypothesis")

first_sample = df["wind_speed"].head(100)
second_sample = df["rented_bike_count"].head(100)

hypothesis_test(first_sample, second_sample)


##### Which statistical test have you done to obtain P-Value?

**We calculated the P-Value and Pearson Correlation coefficient values using the Pearson Correlation test.**

##### Why did you choose the specific statistical test?

**To identify the connection between the testing series.  Here, we can see that the results of a statistical test comparing wind speed and the number of rented bikes show that the number of rented bikes does not depend on the wind speed, indicating that the two variables have no relationship.**

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

2) **Null Hypothesis** : There is no relation between Temperature and Rented Bike Count.

**Alternate Hypothesis** : There is a relationship between Temperature and Rented Bike Count

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

def hypothesis_test(x, y, alpha=0.05):
    stat, p = pearsonr(x, y)
    print(f"Correlation coefficient: {stat:.3f}, p-value: {p:.3f}")
    if p > alpha:
        print("Fail to reject the null hypothesis")
    else:
        print("Reject the null hypothesis")

first_sample = df["temperature"].head(100)
second_sample = df["rented_bike_count"].head(100)

hypothesis_test(first_sample, second_sample)



##### Which statistical test have you done to obtain P-Value?

**We calculated the P-Value and Pearson Correlation coefficient values using the Pearson Correlation test.**

##### Why did you choose the specific statistical test?

**To identify the connection between the testing series.  Here, we can see that the results of a statistical test comparing temperature and the number of rented bikes show that the number of rented bikes depend on the temperature, indicating that the two variables are correlated.**

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis** : There is no relation between Holiday and Rented Bike Count.

**Alternate Hypothesis** : There is a relationship between Holiday and Rented Bike Count

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

def hypothesis_test(x, y, alpha=0.05):
    stat, p = pearsonr(x, y)
    print(f"Correlation coefficient: {stat:.3f}, p-value: {p:.3f}")
    if p > alpha:
        print("Fail to reject the null hypothesis")
    else:
        print("Reject the null hypothesis")

first_sample = df["holiday"].head(100)
second_sample = df["rented_bike_count"].head(100)

hypothesis_test(first_sample, second_sample)



##### Which statistical test have you done to obtain P-Value?

**We calculated the P-Value and Pearson Correlation coefficient values using the Pearson Correlation test.**

##### Why did you choose the specific statistical test?

**To identify the connection between the testing series.  Here, we can see that the results of a statistical test comparing holiday and the number of rented bikes show that the number of rented bikes depend on the holiday, indicating that the two variables are correlated.**

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
df.columns

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()


#### What all missing value imputation techniques have you used and why did you use those techniques?

As null values have previously been handled with, our dataset is complete and free of duplicate, missing, or null values.

### 2. Handling Outliers

###Let's use the IQR method to create a function for the outlier treatment, capping the outliers in the 25–75 percentile.

In [None]:
df.columns

In [None]:
print("Numeric_features: ",continuous_variable)
print("Categorical_features: ",categorical_features)

In [None]:
# Handling Outliers & Outlier treatment
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(20,10))
for i, ax in zip(continuous_variable, axes.flatten()):
    sns.boxplot(df[i], ax=ax)
    ax.set_title(i)
plt.tight_layout()


**We can easily see from the box plots above that the variables "wind speed," "rainfall," "snowfall," and "solar radiation" have a number of outliers, whereas the other features are good because they are Numeric in nature.**

**Let's build a function to determine the percentage and quantity of outliers present in each feature so that we can manage them appropriately.**

In [None]:
##using IQR to define the code for outlier detection and percentage.
def detect_outliers(bike_df):
    data = sorted(bike_df)
    q1, q2, q3 = np.percentile(data, [25, 50, 75])
    print(f"q1:{q1}, q2:{q2}, q3:{q3}")
    IQR = q3 - q1
    lower_bound, upper_bound = q1 - 1.5*IQR, q3 + 1.5*IQR
    print(f"Lower bound: {lower_bound}, Upper bound: {upper_bound}, IQR: {IQR}")
    

    outliers = [i for i in data if i < lower_bound or i > upper_bound]
    num_outliers = len(outliers)
    perc_outliers = round(num_outliers * 100 / len(data), 2)
    print(f"Total number of outliers are: {num_outliers}")
    print(f"Total percentage of outlier is: {round(perc_outliers*100/len(data),2)} %")
    

    results = (
        q1, q2, q3,
        IQR,
        lower_bound, upper_bound,
        outliers,
        num_outliers,
        perc_outliers
    )
    return results




In [None]:
#Finding the IQR, lower and upper bounds, and counting the number of outliers present in each continuous numerical feature
for feature in continuous_variable:
  print(feature,":")
  detect_outliers(df[feature])
  print("\n")

In [None]:
## Using the IQR technique to define the function that treat outliers
def treat_outliers_iqr(data):
    # Sort the data in ascending order
    sorted_data = sorted(data)
    
    # Calculate the quartile indices
    q1_index = int(len(sorted_data) * 0.25)
    q2_index = int(len(sorted_data) * 0.5)
    q3_index = int(len(sorted_data) * 0.75)
    
    
    # Calculate the quartile values
    q1 = sorted_data[q1_index]
    q2 = sorted_data[q2_index]
    q3 = sorted_data[q3_index]
    
    # Calculate the interquartile range (IQR)
    iqr = q3 - q1
    
    # Identify the outliers
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    outliers = [x for x in data if x < lower_bound or x > upper_bound]
    
    # Treat the outliers (e.g., replace with the nearest quartile value)
    treated_data = [q1 if x < lower_bound else q3 if x > upper_bound else x for x in data]
    treated_data_int = [int(absolute) for absolute in treated_data]
    
    return treated_data_int


In [None]:
##Using the function we defined previously to treat outliers, passing each feature one by one from the continuous value feature list.
for treat in continuous_variable:
  df[treat]= treat_outliers_iqr(df[treat])

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(32, 10))
for col, ax in zip(continuous_variable, axes.flatten()):
    sns.boxplot(df[col], ax=ax)
    ax.set_title(col.title(), weight='bold')
fig.tight_layout()

###Bivariate outlier analysis 

In [None]:
categorical_features = ['hour','seasons', 'holiday', 'functioning_day', 'month', 'weekday', 'year']

In [None]:
# Checking the outliers present in each categorical_features.
for cat_f in categorical_features:
  plt.figure(figsize=(12,6))
  sns.boxplot(x = cat_f ,y = rented_bike_count[0],data=df, )
  plt.title(cat_f)
  plt.show()

In [None]:
#Finding the IQR, lower and upper bounds, and the number of outliers present in each category of object dtype characteristics
for outlier in categorical_features:
    print(f"Feature: {outlier}")
    cats = df[outlier].unique().tolist()
    for i, (cat, data) in enumerate(df.groupby(outlier)["rented_bike_count"]):
        print(f"{i+1}: Category: {cat}")
        detect_outliers(data)
        print()

####Despite the fact that the dataset contains a few categorical outliers, we won't treat them because the ML model and algorithm we'll be using to handle categorical outliers can do so without compromising the accuracy of the model.


##### What all outlier treatment techniques have you used and why did you use those techniques?

Our two distinct functions—one for "outlier detection" and the other for "outlier treatment using IQR"—have been defined, and all observations of continuous characteristics have been run through them. Extreme left outliers (25%) and extreme right outliers (>75%) in the 25th and 75th quartile values have been successfully eliminated.

### 3. Categorical Encoding

We have a variety of encoding methods, but the main ones are:



1)- When features are ordinal in nature and have a rank between them, ordinal encoding is used.

2)- When the features are nominal in nature and have equal weight, nominal encoding is used.



We will utilise One-Hot Encoding (Type of Nominal encoding) in our scenario because all of our category columns are nominal in nature:

In [None]:
df.columns

In [None]:
df['is_functioning_day'] = (df['functioning_day'] == 'Yes').astype(int)
df['Not functioning_day'] = (df['functioning_day'] == 'No').astype(int)

# Drop the original 'Functioning Day' column
df.drop(columns=['functioning_day'], inplace=True)


In [None]:
# Create a list of the season names
# Encode your categorical columns of Season 
season_names = ['Winter', 'Spring', 'Summer', 'Autumn']

# Loop through the season names and create a new column for each season
for season in season_names:
    df[season] = (df['seasons'] == season).astype(int)

# Drop the original 'Seasons' column
df.drop(columns=['seasons'], inplace=True)


In [None]:
# Encode your categorical columns of Holiday
df['is_holiday '] = np.where(df['holiday']=='holiday',1,0)
df['No holiday'] = np.where(df['holiday']==' No holiday',1,0)

# Drop the 'holiday' column from the dataframe
df.drop(columns=['holiday'],axis=1, inplace=True)

In [None]:
df.head().T

In [None]:
## Encode your categorical columns of Hour, Month, Weekday and year
category=['hour','month','weekday', 'year']
for col in category:
  df[col]=df[col].astype('category')

In [None]:
#creating dummy variable for ease of operations on categorical features.
df = pd.get_dummies(df,drop_first=True,sparse=True)
df.head().T
     

In [None]:
df.columns

In [None]:
df.head()



#### What all categorical encoding techniques have you used & why did you use those techniques?

- In order to make our categorical object type characteristics suitable to be fed into multiple ML algorithms in the future, we utilized the one-hot encoding technique to convert their dummy into int type.

- Because each of the category features has 3–4 distinct orderless categories (which is less in number). Hence, rather of using an ordinal encoding method, nominal encoding is recommended.


### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# checking which of the variables are continous in nature
for i in df.columns:
  print(f"The total number of unique counts in {i} is: {df[i].nunique()}")

In [None]:
 df.columns

In [None]:
# Transform Your data
df['rented_bike_count']=np.log1p(df['rented_bike_count'])

### 6. Data Scaling

In [None]:
# Importing StandardScaler and normalize library
from sklearn.preprocessing import StandardScaler
##scaler = StandardScaler()
from sklearn.preprocessing import normalize

In [None]:
# Scaling your data
X = df.drop(columns = ['rented_bike_count'] , axis = 1)
y = df['rented_bike_count']

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train , X_test, y_train, y_test =train_test_split(X, y, test_size= 0.2 , random_state =0)

In [None]:
# getting shape
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
std_scalar = StandardScaler()
X_train = std_scalar.fit_transform(X_train)
X_test = std_scalar.transform(X_test)


##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

In [None]:
def regression_metrics(y_train_actual, y_train_pred, y_test_actual, y_test_pred): 
   
    # Calculate mean absolute error for train and test sets
    train_mae = mean_absolute_error(y_train, y_train_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    
    # Print MAE values
    print(f"MAE on train set: {train_mae:.3f}")
    print(f"MAE on test set: {test_mae:.3f}")


     # Calculate mean squared error for train and test sets
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    
    # Print MSE values
    print(f"MSE on train set: {train_mse:.3f}")
    print(f"MSE on test set: {test_mse:.3f}")


     # Calculate root_mean_squared_error for train and test sets
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    
    # Print RMSE values
    print(f"RMSE on train set: {train_rmse:.3f}")
    print(f"RMSE on test set: {test_rmse:.3f}")


    # Calculate r2_score for train and test sets
    R2_train= r2_score(y_train,y_train_pred)
    R2_test= r2_score(y_test,y_test_pred)
    print("R2 on train set:" ,R2_train)
    print("R2 on test set:" ,R2_test)

    ## Adjusted R2_score
    train_Adj_R2 = (1-(1-r2_score(y_train, y_train_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
    print( 'Adjusted R2 on train is :', train_Adj_R2)
    test_Adj_R2 = (1-(1-r2_score(y_test, y_test_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
    print( 'Adjusted R2 on test is :', test_Adj_R2)

### ML Model - 1  Linear regression

In [None]:
# ML Model - 1 Implementation
regressor= LinearRegression()

# Fit the Algorithm
regressor.fit(X_train,y_train)


# Predict on the model
y_pred_train_Lr=regressor.predict(X_train)
y_pred_test_Lr=regressor.predict(X_test)
     

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
regressor.coef_

In [None]:
regressor.intercept_

In [None]:
regressor.score(X_train, y_train)

In [None]:
print(y_pred_test_Lr)

In [None]:
# Calculating the regression metrics
regression_metrics(y_train,y_pred_train_Lr,y_test,y_pred_test_Lr)

In [None]:
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_pred_test_Lr[:100], label='Predicted')
ax.plot(np.array(y_test[:100]), label='Actual')
ax.legend(fontsize=12)
plt.show()

#####Linear Regression, the most fundamental and simplest machine learning model, is where we started. In order to complete our ML model, we have made an effort to assess the most significant regression metrics on both the train and test data sets. The fact that both r2 scores are quite close in this case for linear regression explains why our model performed correctly on the test dataset.

#####We understand that the greatest r2 score obtained in the implementation of the LR model was 0.84 for the 'dependent' and 'independent' variables and y.

#### 2. Cross- Validation & Hyperparameter Tuning

###**Ridge (L2) Regression**

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Create a Ridge instance
ridge = Ridge()

# Define a dictionary of parameter values to be searched
param_grid_ridge = {"alpha": [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100], "max_iter":[1,2,3]}

# Create a GridSearchCV object with the Ridge instance and the parameter grid
grid_search = GridSearchCV(ridge, param_grid_ridge, scoring='neg_mean_squared_error', cv=3)

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Get the best estimator (Ridge model) from the GridSearchCV object
best_ridge = grid_search.best_estimator_

# Use the best estimator to predict the output for the training and test data
y_train_ridge_predict = best_ridge.predict(X_train)
y_test_ridge_predict = best_ridge.predict(X_test)

# Print the best alpha value found by GridSearchCV and the corresponding negative mean squared error
print("Best parameter values : ", grid_search.best_params_)
print("Negative mean squared error: ", grid_search.best_score_)


In [None]:
# Ridge's regression metrics are calculated.
regression_metrics(y_train,y_train_ridge_predict,y_test,y_test_ridge_predict)

In [None]:
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_test_ridge_predict[:100], label='Predicted' ,color='blue')
ax.plot(np.array(y_test[:100]), label='Actual', color='red')
ax.legend(fontsize=12)
plt.show()

###**Lasso Regression** (L1)

In [None]:
# Create a lasso instance
lasso= Lasso()

# Define a dictionary of parameter values to be searched
param_grid_lasso = {"alpha": [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014] , "max_iter":[7,8,9,10]} 

# Create a GridSearchCV object with the lasso instance and the parameter grid
grid_search = GridSearchCV(lasso, param_grid_lasso, scoring='neg_mean_squared_error', cv=5)

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Get the best estimator (lasso model) from the GridSearchCV object
best_lasso = grid_search.best_estimator_

# Use the best estimator to predict the output for the training and test data
y_train_lasso_predict = best_lasso.predict(X_train)
y_test_lasso_predict = best_lasso.predict(X_test)

# Print the best alpha value found by GridSearchCV and the corresponding negative mean squared error
print("Best parameter values : ", grid_search.best_params_)
print("Negative mean squared error: ", grid_search.best_score_)


In [None]:
# lasso's regression metrics are calculated.
regression_metrics(y_train,y_train_lasso_predict,y_test,y_test_lasso_predict)
     

In [None]:
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_test_lasso_predict[:100], label='Predicted' ,color='skyblue')
ax.plot(np.array(y_test[:100]), label='Actual', color='orange')
ax.legend(fontsize=12)
plt.show()

####**Elastic Net Regression**

In [None]:
from sklearn.linear_model import ElasticNet

# Create a elestic_net  instance
elastic_net = ElasticNet()

# Define a dictionary of parameter values to be searched
param_grid_E_net = {"alpha": [1e-5,1e-4,1e-3,1e-2,1,5]} 

# Create a GridSearchCV object with the elestic_net instance and the parameter grid
grid_search = GridSearchCV(elastic_net, param_grid_E_net, scoring='neg_mean_squared_error', cv=5)

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Get the best estimator (Elestic_net model) from the GridSearchCV object
best_elastic_net = grid_search.best_estimator_

# Use the best estimator to predict the output for the training and test data
y_train_elestic_net_predict = best_elastic_net.predict(X_train)
y_test_elestic_net_predict = best_elastic_net.predict(X_test)

# Print the best alpha value found by GridSearchCV and the corresponding negative mean squared error
print("Best parameter values : ", grid_search.best_params_)
print("Negative mean squared error: ", grid_search.best_score_)


In [None]:
# ElasticNet regression metrics are calculated.
regression_metrics(y_train,y_train_elestic_net_predict,y_test,y_test_elestic_net_predict)

In [None]:
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_test_elestic_net_predict[:100], label='Predicted' ,color='skyblue')
ax.plot(np.array(y_test[:100]), label='Actual', color='orange')
ax.legend(fontsize=12)
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Due to its ability to use all feasible combinations of hyperparameters and their values, GridSearchCV was chosen as the hyperparameter optimization technique. The optimal value for the hyperparameters is then chosen after calculating the performance for each combination. The most precise tuning approach is provided by this.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Lasso, Ridge, and Elastic net models were all used, but we still weren' able to detect any appreciable change in the r2 score, MSE, or MAPE.

### ML Model - 2   Decision Trees

In [None]:
#importing the decision tree model for training 
DT_Regressor = DecisionTreeRegressor(max_depth=10,  max_leaf_nodes=100)

 # Fitting Decision Tree to the Training set
DT_Regressor.fit(X_train, y_train)

#y pred for test and train data
y_train_predict_DT = DT_Regressor.predict(X_train)
y_test_predict_DT = DT_Regressor.predict(X_test)

In [None]:
# Visualizing evaluation Metric Score chart
regression_metrics(y_train,y_train_predict_DT,y_test,y_test_predict_DT)

In [None]:
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_test_predict_DT[:100], label='Predicted' ,color='skyblue')
ax.plot(np.array(y_test[:100]), label='Actual', color='orange')
ax.legend(fontsize=12)
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric 

Score Chart.

Following the use of LR models, we tried "Decision Tree" and observed a great increase in the r2 score from 0.84 to 0.85, indicating that "90% Variance of our Test Dataset is Covered by Our Training Model," which is excellent. It's great that our RMSE decreased and moved below 5 (=0.631) on the other side.

We tried the "Decision Tree" after applying the linear regression model, and we noticed that the r2 score has gone up by 1%, from.84 to.85, which equals '85.%'. As the trained model successfully captured the variation in our test data, we choose to fine-tune the hyperparameters and evaluate the outcomes.

#### 2. Cross- Validation & Hyperparameter Tuning

decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

# Define the range of hyperparameters
param_grid = {
    'max_depth': [8, 9, 10],
    'min_samples_leaf': [6, 7, 8],
    'min_samples_split': [1, 2, 3]
}

# Create an instance of the model
Decision_Tree = DecisionTreeRegressor()

# Define the grid search object
grid_search_Tree = GridSearchCV(
    estimator=Decision_Tree,
    param_grid=param_grid,
    scoring='neg_mean_squared_error',
    cv=3,
    n_jobs=-1
)

# Fit the grid search object to the data
grid_search_Tree.fit(X_train, y_train)

# Print the best parameters and negative mean squared error
print(f"The best hyperparameters are: {grid_search_Tree.best_params_}")
print(f"Negative mean squared error is: {grid_search_Tree.best_score_}")

# Make predictions on the training and testing data
y_train_DecisionTree_pred = grid_search_Tree.predict(X_train)
y_test_DecisionTree_pred = grid_search_Tree.predict(X_test)


In [None]:
# Visualizing evaluation Metric Score chart
regression_metrics(y_train,y_train_DecisionTree_pred,y_test,y_test_DecisionTree_pred)

In [None]:
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_test_DecisionTree_pred[:100], label='Predicted' ,color='skyblue')
ax.plot(np.array(y_test[:100]), label='Actual', color='orange')
ax.legend(fontsize=12)
plt.show()

##### Which hyperparameter optimization technique have you used and why?

As GridSearchCV uses all feasible combinations of hyperparameters and yields more precise results, we chose it as our chosen hyperparameter optimization strategy. In order to choose the best value for the hyperparameters, it then evaluates the performance for each combination. This tuning approach provides the best results.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

To find the r2 score with the lowest MSE and the optimum value for our situation, we have tried various parameter combinations. It was determined that the following combination worked best: "max depth": "8,9, 10," "min samples leaf": "6, 7, 8." The MSE on the test dataset was improved by'min samples split':[1, 2, 3, 4] via hyperparameter tuning of Decision trees, from 41% to 36%.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

We measure the performance of our ML model using several measures in an effort to minimise the gaps between real and predicted values. These measurements all attempt to indicate our level of satisfaction with the actual or expected output. In this instance, there is little to no variation between the train and test data for each evaluation metric, indicating that our model accurately predicts the predicted result. As a result, the dependent variable, the number of rented bikes, which has an impact on the business, is correctly predicted to an extent of 86.9%, and 3% off the mean of the actual absolute values.

### ML Model - 3 Random Forest

In [None]:
# import the regressor
from sklearn.ensemble import RandomForestRegressor 

#importing the decision tree model for training 
RF_Regressor = RandomForestRegressor(n_estimators=100, max_depth=15)

 # Fitting Decision Tree to the Training set
RF_Regressor.fit(X_train, y_train)

#y pred for test and train data
y_RandomForest_train_pred = RF_Regressor.predict(X_train)
y_RandomForest_test_pred = RF_Regressor.predict(X_test)

In [None]:
# Calculating Regression Metrics using RandomForestRegressor
regression_metrics(y_train,y_RandomForest_train_pred,y_test,y_RandomForest_test_pred)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_RandomForest_test_pred[:100], label='Predicted' ,color='skyblue')
ax.plot(np.array(y_test[:100]), label='Actual', color='orange')
ax.legend(fontsize=12)
plt.show()

With the implementation of our third model, the Random Forest, our predictions are sprouting like a well-tended garden. The training dataset yields a magnificent r2 score of 0.95, while the test dataset follows closely behind with a score of 0.91 , we've pruned the MSE from 0.361 to a trim  0.232, shaping our model towards optimal model.. It's a pleasure to witness our efforts blossom into such fruitful results!

#### 2. Cross- Validation & Hyperparameter Tuning

#### **RandomizedSearchCV**

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc
# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# Instantiate the Random Forest model
RF_tree = RandomForestRegressor()

# Define the parameters to be tuned
param_dist = {'n_estimators': [100], 'max_depth': [15,17,19], 'min_samples_leaf': [1, 2,3]} 

# Perform hyperparameter tuning using RandomizedSearchCV
RF_treeR = RandomizedSearchCV(estimator=RF_tree, param_distributions=param_dist, n_iter=5, n_jobs=-1, scoring='neg_mean_squared_error', cv=3, verbose=3)
RF_treeR.fit(X_train, y_train)

# Make predictions on training and test datasets
y_train_RFtree_pred = RF_treeR.predict(X_train)
y_test_RFtree_pred = RF_treeR.predict(X_test)

# Retrieve the best parameters and negative mean square error score
best_params = RF_treeR.best_params_
best_score = RF_treeR.best_score_

# Print the best parameters and negative mean square error score
print(f"The optimal hyperparameters found were: {best_params}")
print(f"The negative mean square error score is: {best_score}")


In [None]:
# Calculating Regression Metrics using GridSearchCV in RandomForestRegressor
regression_metrics(y_train,y_train_RFtree_pred,y_test,y_test_RFtree_pred)

In [None]:
# Visualizing evaluation Metric Score chart
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_RandomForest_test_pred[:100], label='Predicted' ,color='skyblue')
ax.plot(np.array(y_test[:100]), label='Actual', color='orange')
ax.legend(fontsize=12)
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Given our large dataset and the fact that it works well for large, complicated models where we simply want to choose random parameters from a bag of parameters, RandomizedSearchCV has been utilized in Random Forest. By using randomly selected subsets of the specified parameters, it shortens the processing and training times without sacrificing the model's accuracy.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After exploring with multiple hyperparameters using RandomizedSearchCV, we found that there was not much of a noticeable improvement. Although the MSE on the test dataset was decreased from 0.232 to 0.212 and the r2 score was also raised by 1% from 91% to 92%

###Model- 4 - GradientBoostingRegressor

In [None]:
# Import the regressor
from sklearn.ensemble import GradientBoostingRegressor
  
# Create a regressor object
GBR = GradientBoostingRegressor(max_depth=4, 
                                n_estimators=500, 
                                learning_rate=0.1,
) 
  
# Fit the regressor with X and Y data
GBR.fit(X_train, y_train)

# Predict with the model
y_train_GBR_pred = GBR.predict(X_train)
y_test_GBR_pred = GBR.predict(X_test)


In [None]:
# Calculating Regression Metrics 
regression_metrics(y_train,y_train_GBR_pred,y_test,y_test_GBR_pred)

**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# Visualizing evaluation Metric Score chart
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_test_GBR_pred[:100], label='Predicted' ,color='skyblue')
ax.plot(np.array(y_test[:100]), label='Actual', color='orange')
ax.legend(fontsize=12)
plt.show()

###Cross- Validation & Hyperparameter Tuning

GradientBoostingRegressor with  RandomizedSearchCV

In [None]:
# Creating XGBoost instance
Gbm = GradientBoostingRegressor()

# Defining parameters
parameters_GBM={"max_depth":[10],"learning_rate":[0.01,0.1], "n_estimators":[30,40,50]}

# Train the model
Gbm_R= RandomizedSearchCV(Gbm,parameters_GBM,scoring='neg_mean_squared_error',n_jobs=-1,cv=3,verbose=3)
Gbm_R.fit(X_train,y_train)

# Predict the output
y_train_GBR_rand_pred = Gbm_R.predict(X_train)
y_test_GBR_rand_pred = Gbm_R.predict(X_test)  

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {Gbm_R.best_params_}")
print(f"Negative mean square error is: {Gbm_R.best_score_}")

In [None]:
# Calculating Regression Metrics using GridSearchCV in RandomForestRegressor
regression_metrics(y_train,y_train_GBR_rand_pred,y_test,y_test_GBR_rand_pred)

In [None]:
# Visualizing evaluation Metric Score chart
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_test_GBR_rand_pred[:100], label='Predicted' ,color='skyblue')
ax.plot(np.array(y_test[:100]), label='Actual', color='orange')
ax.legend(fontsize=12)
plt.show()

**Which hyperparameter optimization technique have you used and why?**



Even yet, RandomizedSearchCV was still the preferable choice because it requires extremely little processing time while maintaining accuracy. So, it was determined by both of us to employ that hyperparameter optimization method.

**Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

Our GBM model was improved by experimenting with a number of parameters, and the results revealed that it is optimised and neither underfitting nor overfitting, with r2 scores of 0.98 on the training dataset and 0.92 on the testing set. The optimization produced the ideal values for the following parameters: "n estimators": [500, 600], "max depth": [3,4,5], and "learning rate": [0.01, 0.1].

###XGboost (Extreme Gradient Boost)

In [None]:
import xgboost as xgb

# define hyperparameters
params = {
    'learning_rate': 0.1,
    'max_depth': 8,
    'n_estimators': 100
}

# create XGBoost regressor object
xgb_model = xgb.XGBRegressor(**params)

# train model on training data
xgb_model.fit(X_train, y_train)

# predict target variable for training and testing data
y_train_xgb_pred = xgb_model.predict(X_train)
y_test_xgb_pred = xgb_model.predict(X_test)


In [None]:
regression_metrics(y_train,y_train_xgb_pred,y_test,y_test_xgb_pred)

**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

In [None]:
# Visualizing evaluation Metric Score chart
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_test_xgb_pred[:100], label='Predicted' ,color='skyblue')
ax.plot(np.array(y_test[:100]), label='Actual', color='orange')
ax.legend(fontsize=12)
plt.show()

The gradient boosting method known as XGBoost (eXtreme Gradient Boosting) is well known for its high accuracy. We've utilized XGBoost.

"Our model achieved excellent and improved R2 scores of 0.97 for the training dataset and 0.93 for the testing dataset, indicating high accuracy and potential benefits for a business such as better decision-making, improved customer satisfaction, and cost savings. Moreover, with a reduced MSE, our model's predictions are more accurate, which further enhances its potential benefits. Excited to further improve our model's efficiency, we have decided to tune its various hyperparameters using xgboost."

**2. Cross- Validation & Hyperparameter Tuning**

XGBOOST

In [None]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor

# Create an instance of XGBRegressor
xgb_model_CV = XGBRegressor()

# Define the hyperparameters to tune
parameters = {
    "learning_rate": [0.01, 0.1],
    "max_depth": [5,8]
}

# Perform grid search with cross-validation
grid_search_CV = GridSearchCV(
    xgb_model_CV, 
    parameters, 
    scoring='neg_mean_squared_error', 
    n_jobs=-1, 
    cv=5, 
    verbose=3
)
grid_search_CV.fit(X_train, y_train)

# Make predictions on the training and testing data
y_train_XGBCV_pred = grid_search_CV.predict(X_train)
y_test_XGBCV_pred = grid_search_CV.predict(X_test)

# Print the best hyperparameters and negative mean squared error
print(f"Best hyperparameters found: {grid_search_CV.best_params_}")
print(f"Negative mean squared error: {grid_search_CV.best_score_}")


In [None]:
regression_metrics(y_train,y_train_XGBCV_pred,y_test,y_test_XGBCV_pred)

In [None]:
# Visualizing evaluation Metric Score chart
#plotting evaluation graph
fig, ax = plt.subplots(figsize=(18,10))
ax.plot(y_test_XGBCV_pred[:100], label='Predicted' ,color='blue')
ax.plot(np.array(y_test[:100]), label='Actual', color='orange')
ax.legend(fontsize=12)
plt.show()

**Which hyperparameter optimization technique have you used and why?**

Using GridSearchCV, XGboost requires a lot of processing time due to its complexity. Hence, using GridSearchCV to tune hyperparameters proved to be a challenging effort for us. For this scenario, RandomizedSearchCV is a great hyperparameter optimization strategy.

**Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.**

Our XG Boost model has been tuned using a variety of parameters, resulting in an R2 score of 0.97 on the training dataset and 0.93 on the testing set. These scores suggest that our model has achieved an optimal balance between bias and variance, indicating that it is not underfitting or overfitting. The best parameters obtained by the optimatization is 'max_depth': [5,8], 'learning_rate':[ 0.01,0.1]}.


In addition, we have observed a reduction in our mean squared error (MSE) values, which have reached a minimum of 0.177 - the lowest error rate achieved among all models. Additionally, our mean absolute error (MAE) values have also decreased.

Moreover, we have found that increasing the max_depth of our decision trees causes our model to overfit the data, indicating that the best combination of hyperparameters has been achieved with the current set of values.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

1)- This metric, known as MAE (Mean Absolute Error), determines the average size of predictions' errors without taking into account their direction. It bears an inverse relationship to the model's accuracy. Regression analysis aims to reduce the MAE, which will ultimately have a beneficial business impact.

2)- MSE (Mean Squared Error) is a common metric used to evaluate the accuracy of a predictive model. It measures the average of the squared differences between the predicted values and the actual values.MSE is widely used in regression analysis because it penalizes larger errors more heavily than smaller ones, due to the squaring of the differences. A lower MSE indicates that the model is better at predicting the outcomes, as the squared differences between the predicted and actual values are smaller on average

3)- RMSE (Root Mean Squared Error) is a commonly used metric to evaluate the accuracy of a regression model. It is the square root of the Mean Squared Error (MSE), which is the average of the squared differences between the predicted values and the actual values. RMSE is widely used in regression analysis because it provides a measure of the error in the same units as the original data. This makes it easy to interpret the magnitude of the error and compare it to the scale of the original data. 

4)- The R2 score, also known as the coefficient of determination, is a commonly used metric in regression analysis. It measures the proportion of the variance in the dependent variable that is explained by the independent variables.Analysts may quickly and readily evaluate a model's goodness of fit and compare several models using the R2 score, making it a useful tool. By giving a precise indication of how well the model explains the variance in the dependant variable, it enables analysts to choose the best model and conduct more research with confidence.

5) - Adjusted R2 is a modified version of the R2 (coefficient of determination) metric that takes into account the number of independent variables in a regression model,  Adjusted R2 is a useful metric that provides a more accurate measure of the goodness of fit of a regression model than R2, as it takes into account the complexity of the model and the number of independent variables. It is especially useful when comparing models with different numbers of independent variables.





### 2. Which ML model did you choose from the above created models as your final prediction model and why?



Comparing results of all the models

###**ML model summary for train datset** 

In [None]:
from prettytable import PrettyTable
train = PrettyTable(['SL NO',"MODEL_NAME", "Train MAE", "Train MSE","Train RMSE",'Train R^2','Train Adjusted R^2'])
train.add_row(['1','Linear Regression','0.451','0.431','0.656',"0.8238815033172513"," 0.8180628391200631"])
train.add_row(['2','Ridge Regression','0.451','0.430','0.656',"0.8241217917636909"," 0.8183110662998365"])
train.add_row(['3','lasso Regression ','0.452','0.430','0.656',"0.8240981731870523"," 0.8182866674044416"])
train.add_row(['4','Elastic_Net','0.451','0.430','0.656',"0.8241226210806791",'0.818311923016088'])
train.add_row(['5','Decision Tree ','0.375','0.280','0.529',"0.8855926397615665","0.8818128095707982"])
train.add_row(['6','Random forest','0.161','0.059','0.243',"0.9758434506469656",'0.9750453581609657'])
train.add_row(['7','GradientBoostingRegressor','0.147',' 0.045 ','0.211',"0.9817521187026687",' 0.9811381108904208'])
train.add_row(['8','XGboost','0.149','0.052','0.228',"0.9788153777902806",'0.9781154728677176'])
print(train)

###**ML model summary for test datset**

In [None]:
from prettytable import PrettyTable
test = PrettyTable(['SL NO',"MODEL_NAME", "test MAE", "test MSE","test RMSE",'test R^2','test Adjusted R^2'])
test.add_row(['1','Linear Regression','0.439','0.411','0.641',"0.8465076552197833"," 0.8414365217049207"])
test.add_row(['2','Ridge Regression','0.439','0.411','0.641',"0.8466522668216554","0.8415859110352322"])
test.add_row(['3','lasso Regression ','0.440','0.411','0.641',"0.8467750592305618","0.841712760302486"])
test.add_row(['4','Elastic_Net','0.439','0.411','0.641',"0.8466691741131821",'0.8416033769157415'])
test.add_row(['5','Decision Tree ','0.401','0.361','0.601',"0.8652143693343365","0.860761274751872"])
test.add_row(['6','Random forest','0.276','0.213','0.462',"0.9203796422993189",'0.917749117207143'])
test.add_row(['7','GradientBoostingRegressor','0.260','0.195','0.441',"0.9273355088408269",'0.9248904816884816'])
test.add_row(['8','XGboost','0.247','0.177','0.420',"0.9340355547816962",'0.9318561984794985'])
print(test)

Based on the metrics we have got, it appears that the XGBoost model has the best performance among the models you have tested, with the lowest test RMSE and highest test R^2 and test adjusted R^2.


We have chosen XGboost as our final prediction model with hyperparameters  {'learning_rate': 0.1, 'max_depth': 8} as it is very clear from above dataframe that it has given  the lowest test RMSE and highest test R^2  score(0.93) and test adjusted R^2 on the testing dataset among all other models.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

XGBoost, which stands for eXtreme Gradient Boosting, offers a highly efficient way to implement the gradient boosting framework that works well for both linear and tree-based models, making it especially suited for large datasets. The fundamental concept underlying XGBoost involves training a series of basic models, including decision trees, and then combining their predictions to produce a more robust and accurate overall model. By using boosting to train each tree to correct the errors of the previous trees in the sequence, XGBoost can achieve even better results.

XGBoost is powered by gradient boosting, a technique used to optimize the parameters of decision trees by minimizing the loss function. This approach involves adjusting the tree parameters to reduce the overall error of the model. Additionally, XGBoost comes with several other features, including regularization, which can help to prevent overfitting by constraining the complexity of the model, and parallel processing, which enables faster training times by splitting up the computations across multiple processors.

###**Model Explainablity**

In [None]:
# Get feature importances from XGBoost model
importances = xgb_model.feature_importances_

# Create a dictionary mapping feature names to importances
feature_dict = dict(zip(df.columns, importances))

# Sort the features by importance, in descending order
sorted_features = sorted(feature_dict.items(), key=lambda x: x[1])

# Extract the sorted feature names and importances as separate lists
sorted_feature_names = [feat[0] for feat in sorted_features]
sorted_importances = [feat[1] for feat in sorted_features]

# Plot the feature importances as a horizontal bar chart
plt.figure(figsize=(10,20))

plt.title('Feature Importance', fontsize=25)
plt.barh(sorted_feature_names, sorted_importances, color='red', align='center')
plt.xlabel('Relative Importance')
plt.ylabel('Features', fontsize=20) 

The above plot gives the average feature shapley values.

While XGBoost has been chosen as the optimal model due to its impressive accuracy, it is often referred to as a black box model, as it lacks transparency in revealing the inner workings of the algorithm. In order to increase stakeholder confidence and trust, it is crucial to provide meaningful explanations for the model's predictions, underlining the conditions that lead to the final outcomes. To enhance model explainability, we have generated a descending bar plot of feature importance, effectively shedding light on the most impactful features and their contribution towards the predictions.

While linear regression helped us achieve an accuracy of 85%, it became necessary to employ more advanced models like random forest, Xgboost, and decision tree to further enhance accuracy and grow our business. However, these models are considered black box models as they are difficult to explain. To overcome this limitation and improve model explainability, we leveraged the power of the SHAP (SHapley Additive exPlanations) model explainability tool

###**SHAP** **(Shapley Additive exPlanations)**

In [None]:
!pip install shap

In [None]:
from sklearn.tree import export_graphviz
import shap 
sns.set_style('darkgrid')

In [None]:
feature = df.columns[:-1]

In [None]:
feature

In [None]:
X_test[0:1]

In [None]:
# Initialize the JavaScript visualizations in the notebook environment
shap.initjs()

# Create a TreeExplainer object for the best XGBoost model from the grid search
explainer = shap.TreeExplainer(xgb_model)

# Calculate SHAP values for the first row of the test data
shap_values = explainer.shap_values(X_test[0:1])

# Plot the SHAP force plot for the first row's explanation
shap.force_plot(explainer.expected_value,features = feature, shap_values=shap_values[0])


They tell the contribution of each feature in increasing or decreasing the final prediction of the dependent variable.

Note that these shapley values are valid for this observation only. With other data points the SHAP values will change.

In [None]:
y_test[0:1]

In [None]:
# second sample test
shap_values = explainer.shap_values(X_test[1:2])   

In [None]:
shap_values

In [None]:
##begin the JavaScript visualisation in the notebook environment.
shap.initjs() 
shap.force_plot(explainer.expected_value, shap_values=shap_values[0], features = feature)

In [None]:
# third sample test
shap_values = explainer.shap_values(X_test[2:3])

In [None]:
#begin the JavaScript visualisation in the notebook environment.
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values=shap_values[0], features = feature)
     

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

###**Conclusions drawn from EDA**

Exploratory Data Analysis (EDA) is an initial and essential step in data analysis, where we explore and examine the dataset to gain insights, find patterns, and detect anomalies . it also helps us to identify relationships and correlations between the variables and get a better understanding of how they affect each other ,  It aids in seeing any patterns, irregularities, and correlations in the data as well as any potential problems such missing values or outliers.  it enables us to truly understand the data, laying the foundation for insightful decision-making that can lead to groundbreaking discoveries

In this project, we began by loading the dataset and then conducted an Exploratory Data Analysis (EDA) to examine all of its features. We then focused on our dependent variable, "Rented Bike Count," which we transformed and treated for null values. After that, we performed feature selection and categorical column encoding. We also analyzed the numeric variables and identified highly correlated variables, which we subsequently dropped. Additionally, we conducted one-hot encoding and built the model. Finally, we extracted statistical information that proved to be useful for business purposes.

 

*  Given that the distribution of the dependent variable must be normal for linear regression to work, the dependent variable appears to be moderately right skewed in the distribution plot shown above. As a result, we have  performed some operations to make the distribution of the dependent variable normal.
* Outliers can be found in the rented bike count data, as seen by the boxplot above. After the removal of outliers from the square root transformation we got normal distribution



*  We obtain an almost normal distribution after applying the square root to the skewed Rented Bike Count. As a result, we may perform the square root transformation during modelling.

* As we can see from the above displot that Normally distributed attributes: temperature , hour , humidity. Positively skewed attributes: wind, rented bike count , solar_radiation, snowfall, rainfall. Negatively skewed attributes: visibility.



*  Hour, Temperature, Wind Speed, Visibility, and Solar Radiation are all positively correlated with the dependant variable. This implies that the number of rented bikes rises as these features do, while the columns "Rainfall," "Snowfall," and "Humidity" are those features that have a negative relationship with the dependent variable, suggesting that the number of rented bikes falls as these features rise.
*  Our collection primarily includes data from the year 2018 and only a little amount from the year 2017.



 

*   We have found that average of the rented bike counts is higher during the summer and lowest during the winter.
*  We can observe that the months of December, January, and February—the winter seasons—have lower demand for rented bikes than those months, as well as May, June, and July—the summer seasons—which have the highest demand for bikes.


*  As we have observed that , all days, rented bike count is consistant and equal.
*  I am able to see the valuable insights about the distribution of the 'holiday' column in my dataset. The chart revealed that the majority of the ratings - a whopping 95.1% or 8,328 records - were on non-holiday days. In contrast, the number of ratings received during holidays was relatively low, accounting for only 4.9% or 432 records of the total rented bike count data available in the dataset. These findings highlight the importance of considering external factors, such as holidays, when analyzing data, as they can have a significant impact on the trends and patterns observed in the dataset.


*   People favour rented bikes during rush hour, as evidenced by the high surge in hired bikes from 8:00 am to 9:00 pm. We may state that during business opening and closing times there is a significantly high demand because it is apparent that demand increases most at 8 a.m. and 6:00 p.m.

*  there we have observed that 96.6% of the dataset consists of non-functional days, while the remaining 3.4% represents weekends. This indicates that there is a significantly higher demand for bikes on functional days compared to weekends, where the demand is relatively low.

*  We can observe a decline in demand for rented bikes when snow falls.
*   it is clear that there is a sharp increase in the use of rented bikes between the hours of 8:00 a.m. and 9:00 p.m., suggesting that individuals prefer to use rented bikes during peak hours, perhaps for their commute to work. In addition, it can be seen that the demand for rental bikes is greater on non-holiday days than it is on days with holidays.




* By analyzing the Graph, it can be concluded that the average number of rented bikes remains relatively stable from Monday to Saturday. However, there is a noticeable dip in bike rentals on Sundays, and on average, the number of rented bikes is significantly lower on weekends than on weekdays.

* The analysis of the graph reveals that the demand for rented bikes is lower during the winter months, specifically December, January, and February, in contrast to the summer months of May, June, and July, which exhibit the highest demand. Additionally, the graph indicates a significant surge in rented bike usage between 8:00 a.m. and 9:00 p.m., highlighting the preference of individuals to rent bikes during peak hours, likely for their daily commute to work.





*   The temperature(°C) and dew point temperature(°C) columns of this graph demonstrate multicollinearity, as can be seen.
*   The above graph showed less linear correlations between variables and non-linear separability of data points, which was new to me. Hence, we may conclude that both positive and negative trends are present in the relationship between each column in the graph.


###**Conclusions drawn from ML Model Implementation**

The success of any business heavily relies on the accuracy of its machine learning models. Therefore, it's crucial to thoroughly evaluate the model's performance and predictions before deploying it in the real world. This evaluation process helps to identify the model's strengths and weaknesses, ensuring that it's fully prepared and ready for deployment.

we'll discuss some essential factors that apply to all ML models and then delve into the project-specific conclusions we've drawn. By doing so, we'll gain a better understanding of the model's overall performance and its impact on the business's growth.

Let's explain the findings and insights we've gained from evaluating our ML model, which will help us make informed decisions on the next steps in the deployment process.

###**Model conclusiuon**

We have experimented with several regression models, beginning with Linear Regression and moving on to other non-linear models. With each model, we've carefully tuned the hyperparameters to minimize errors and ensure optimal performance.

By systematically testing and evaluating these models, we've gained a deep understanding of their capabilities and limitations, providing us with valuable insights into how we can optimize our ML strategy to drive business growth.



*  Even after employing regularisation techniques, the model in linear regression still captures 70% of variance and has a r2 of 84%, indicating that the target variable and our data are not entirely linearly dependent on one another.



*  After experimenting with the Decision Tree model, we found that we achieved an impressive R-squared value of 86% accuracy with a maximum depth of 10. However, as we increased the depth beyond this point, we noticed that the model began to overfit the data. This led to a higher Mean Squared Error (MSE) of 0.361



*  We discovered that prioritizing individual variables when building our model did not always lead to the best accuracy. Instead, using ensemble techniques like Random Forest and optimizing hyperparameters allowed us to create a more robust and accurate model. By doing so, we achieved an impressive R-squared value of 92% with 100 trees in the forest.
*   while we achieved a slightly lower R-squared value of 87% using the GradientBoostingRegressor, we found that it offered the advantage of faster results due to its ability to utilize all available cores. This increased efficiency led to a reduction in processing time, making it a valuable tool for projects that require faster results.



* After extensive experimentation, we ultimately settled on XGBoost as our final model. This decision was driven by the impressive results we achieved, including an R-squared value of 97% and a mean squared error (MSE) of 0.177. These results suggest that our model is highly accurate and reliable, providing us with a valuable tool for predicting outcomes and driving business growth.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***