<a href="https://colab.research.google.com/github/deepakbharti21/Hotel-booking/blob/main/final_Bike_Sharing_Demand_Prediction_Capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bike_Sharing_Demand_Prediction_Capstonee**



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team

# **Project Summary -**

*  This project aims to enhance the mobility and convenience of the public through bike-sharing programs in metropolitan areas. 

*  One of the main challenges is maintaining a consistent supply of bikes for rental. 

*  Bike-sharing systems are automated and enable people to rent and return bikes at various locations. The project focuses on utilizing historical data on factors such as temperature and time to predict the demand for the bike-sharing program in Seoul.

*  There were approximately 8760 records and 14 attributes in the dataset.
We started by importing the dataset, and necessary libraries and conducted exploratory data analysis (EDA).

*  Outliers and null values were removed from the raw data and treated. Data were transformed to ensure that it was compatible with machine learning models.

*  We handled target class imbalance using square root normalization.

*  Then finally cleaned and scaled data was sent to 11 various models, the metrics were made to evaluate the model, and we tuned the hyperparameters to make sure the right parameters were being passed to the model.

*  When developing a machine learning model, it is generally recommended to track multiple metrics because each one highlights distinct aspects of model performance. We are, focusing more on the R2 score and RMSE score.

*  The R2 score is scale-independent, which means that it can be used to compare models that are fit to different target variables or to target variables that have different units of measurement. 

*  This is particularly useful when comparing models for different problems, as it allows for a direct comparison of the performance of the models, regardless of the scale of the target variable

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Many metropolitan areas now offer bike rentals to improve mobility and convenience. Ensuring timely access to rental bikes is critical to reducing wait times for the public, making a consistent supply of rental bikes a major concern. The expected hourly bicycle count is particularly crucial in this regard.

Bike sharing systems automate membership, rentals, and bike returns through a network of locations. Individuals can rent bikes from one location and return them to another or the same location, as needed. Membership or request facilitates bike rentals, and the process is overseen by a citywide network of automated stores.

This dataset aims to predict the demand for Seoul's Bike Sharing Program based on historical usage patterns, including temperature, time, and other data.

#### **Define Your Business Objective?**


Estimating the number of required bikes at any given time and day is a critical business concern. Having fewer bikes results in resource wastage (in terms of bike maintenance and the land/bike stand required for parking and security), while having more bikes can lead to revenue loss, ranging from immediate loss due to a lower number of customers to potential long-term loss due to a loss of future customers. Therefore, it is essential for bike rental businesses to have an estimate of demand to function effectively.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import Libraries and modules

# libraries that are used for analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# to import datetime library
from datetime import datetime
import datetime as dt

# libraries used to pre-process 
from sklearn import preprocessing, linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split


# libraries used to implement models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# libraries to evaluate performance
from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, mean_absolute_error

# Library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings('ignore')

# to set max column display
pd.pandas.set_option('display.max_columns',None)


### Dataset Loading

In [None]:
# Load Dataset
# let's mount the google drive first
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
bike_df = pd.read_csv('/content/drive/MyDrive/AlmaBetter/Module 3/SeoulBikeData.csv',encoding= 'unicode_escape') #data file from dive


### Dataset First View

In [None]:
# Dataset First Look
bike_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
bike_df.shape

### Dataset Information

In [None]:
# Dataset Info
print(f'number of rows : {bike_df.shape[0]}  \nnumber of columns : {bike_df.shape[1]}')

In [None]:
bike_df.describe()


In [None]:
bike_df.columns


In [None]:
df1 = bike_df.copy()

In [None]:
df1['Date'].unique()

In [None]:
df1['Rented Bike Count'].unique()


In [None]:
df1['Hour'].unique()


In [None]:
df1['Solar Radiation (MJ/m2)'].unique()


In [None]:
df1['Dew point temperature(°C)'].unique()


In [None]:
df1['Visibility (10m)'].unique()


In [None]:
df1['Seasons'].unique()


In [None]:
df1['Wind speed (m/s)'].unique()    # This column has 0 as well as null values


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df1[df1.duplicated()].shape                # Show no. of rows of duplicate rows duplicate rows

In [None]:
bike_df.drop_duplicates(inplace = True)

unique_rows = bike_df.shape[0]

unique_rows 

In [None]:
df1.drop_duplicates(inplace = True)        # Dropping duplicate values

In [None]:
df1.shape

The number of duplicate values in the data set is =  0

We found that there is no duplicate entry in the above data.

#### Missing Values/Null Values

Real-world data often contains numerous missing values, which can be due to data corruption or other factors. As many machine-learning algorithms do not support missing values, it is necessary to handle them during the dataset pre-processing stage. Thus, the first step in dealing with missing data is to identify the missing values.

In [None]:
# Missing Values/Null Values Count
bike_df.isnull().sum()


Above result says that there are no null values in the data

In [None]:
# Visualizing the missing values
null_value = bike_df.isnull() == True
bike_df.fillna(np.NaN, inplace = True)

bike_df                                                                 #here the all null values are replaced by the value as NaN.

In [None]:
bike_df.isnull().sum().sort_values(ascending = False)[:6]

In [None]:
df1.isnull().sum().sort_values(ascending = False)[:6] # Columns having missing values.


In [None]:
miss_values =bike_df.isnull().sum().sort_values(ascending=False)
miss_values # We have to check the count of null value in individual columns

In [None]:
type(bike_df['Date'][0])    ## data_type


In [None]:
bike_df['Date'] = pd.to_datetime(bike_df['Date'])

In [None]:
# all the seasons present in data

bike_df['Seasons'].unique()


In [None]:
# creating a column containing the year from a particular date

year = []
for i in range(len(bike_df['Date'])):
  year.append(bike_df['Date'][i].year)
bike_df['year'] = year 

In [None]:
# creating a column containing the month number from a particular date

months = []
for i in range(len(bike_df['Date'])):
  months.append(bike_df['Date'][i].month)
bike_df['month'] = months  

In [None]:
bike_df1 = bike_df.groupby('Seasons').sum()  ##df_s == new as bike_data1

In [None]:
bike_df1

In [None]:
import missingno as msno
msno.bar(bike_df, color='green',sort='ascending', figsize=(8,3), fontsize=15)


In [None]:
# Visualizing the missing values using Heatmap
plt.figure(figsize=(12,4))
sns.heatmap(bike_df.isna(), cmap = 'coolwarm')

From the above command, we noticed that every column has 0 null values. This seems to be clean data and there is no missing data in any of the rows and columns.

### What did you know about your dataset?

The dataset provided contains 14 columns and 8760 rows and does not have any missing or duplicate values.

The goal is to predict the demand for bike-sharing using this dataset, which is sourced from the bike-sharing services market. Demand prediction involves analytical studies on the probability of a customer using bike-sharing services, with the aim of understanding and managing demand and supply equilibrium throughout the day.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bike_df.columns

In [None]:
# Dataset Describe
bike_df.describe(include='all').T


### Variables Description 

The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.

**Attribute Information:**

**Date** :year-month-day

**Rented Bike count** :Count of bikes rented at each hour

**Hour** :Hour of the day

**Temperature** :Temperature in Celsius

**Humidity** :%

**Windspeed** :m/s

**Visibility** :10m

**Dew point temperature** :Celsius

**Solar radiation** :MJ/m2

**Rainfall** :mm

**Snowfall** :cm

**Seasons** :Winter, Spring, Summer, Autumn

**Holiday** :Holiday/No holiday

**Functional Day** :NoFunc(Non Functional Hours), Fun(Functional hours)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
bike_df.nunique()


In [None]:
for i in bike_df.columns.tolist():
  print("No. of unique values in ",i,"is",bike_df[i].nunique())                   #We have to describes all unique value in all individual column.
  


In [None]:
bike_df.describe(include='all').loc['unique', :] #tried in the @2nd  time to show the  data  in same  way unique value in all individual column.

**Observations:**

*  We are focusing on several key columns of our dataset, including 'Hour', 'Holiday', 'Functioning Day', 'Rented Bike Count', 'Temperature(°C)', and 'Seasons', as they contain a wealth of information.

*  By utilizing these features, we plan to create a regression model and implement various regression algorithms.

*  There is a column 'Hour' which might be considered a categorical feature or maybe a numerical feature based on the data we will try both and see the result difference.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
miss_values[:4]

In [None]:
percentage_company_null_values = miss_values[0] / unique_rows*100
percentage_company_null_values

**Creating Some New Features**

In [None]:
# Renaming complex columns name for the sake of simplicity    **(Not a necessary step to do)**
bike_df=bike_df.rename(columns={'Rented Bike Count':'rented_bike_count',
                                'Date':'date',
                                'Hour':'hour',
                                'Seasons':'seasons',
                                'Holiday':'holiday',
                                'Temperature(°C)':'temperature',
                                'Humidity(%)':'humidity',
                                'Wind speed (m/s)':'wind_speed',
                                'Visibility (10m)':'visibility',
                                'Dew point temperature(°C)':'dew_point_temperature',
                                'Solar Radiation (MJ/m2)':'solar_radiation',
                                'Rainfall(mm)':'rainfall',
                                'Snowfall (cm)':'snowfall',
                                'Functioning Day':'functioning_day'})

In [None]:
# Splitting Date into year, month, day & day_name
bike_df.date = pd.to_datetime(bike_df.date)

bike_df['day'] = bike_df['date'].dt.day
bike_df['month'] = bike_df['date'].dt.month
bike_df['year'] = bike_df['date'].dt.year
bike_df['weekday'] = bike_df['date'].dt.day_name()

# droping Date column
bike_df.drop('date', axis=1, inplace=True)

In [None]:
bike_df.hour.unique()


The hours of the day follow a clear sequence, with 9 am being closer to 10 am than it is to 8 am, and farther from 6 pm. This feature can be classified as a discrete ordinal variable. We will consider the hour as a categorical value and transform it into a numerical value to see if there is any difference in the results.

In [None]:
def session(x):
   
    ''' 
    For exploratory data analysis (EDA) purposes, the "Hour" column can be converted into categorical variables
    such as "Morning", "Noon", and "Night", without altering the existing label encoding format of the "Hour" column. 
    This conversion is not necessary for model training.
    '''
    
    if x>4 and x<=8:
        return 'Early Morning'
    elif x>8 and x<=12:
        return 'Morning'
    elif x>12 and x<=16:
        return 'Afternoon'
    elif x>16 and x<=20:
        return 'Evening'
    elif x>20 and x<=24:
        return 'Night'
    elif x<=4:
        return 'Late Night'

#apply funtion to make new category column
bike_df['session'] = bike_df['hour'].apply(session)

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
def get_count_from_column_bar(bike_df, column_label):
  df_grpd = bike_df[column_label].value_counts()
  df_grpd = pd.DataFrame({'index':df_grpd.index, 'count':df_grpd.values})
  return df_grpd


def plot_bar_chart_from_column(bike_df, column_label, t1):
  df_grpd = get_count_from_column(bike_df, column_label)
  fig, ax = plt.subplots(figsize=(14, 6))
  c= ['g','r','b','c','y']
  ax.bar(df_grpd['index'], df_grpd['count'], width = 0.4, align = 'edge', edgecolor = 'black', linewidth = 4, color = c, linestyle = ':', alpha = 0.5)
  plt.title(t1, bbox={'facecolor':'0.8', 'pad':3})
  plt.legend()
  plt.ylabel('Count')
  plt.xticks(rotation = 15) # use to format the lable of x-axis
  plt.xlabel(column_label)
  plt.show()

In [None]:
def get_count_from_column(bike_df, column_label):
  df_grpd = bike_df[column_label].value_counts()
  df_grpd = pd.DataFrame({'index':df_grpd.index, 'count':df_grpd.values})
  return df_grpd

# plot a pie chart from grouped data
def plot_pie_chart_from_column(bike_df, column_label, t1, exp):
  df_grpd = get_count_from_column(bike_df, column_label)
  fig, ax = plt.subplots(figsize=(15,10))
  ax.pie(df_grpd.loc[:, 'count'], labels=df_grpd.loc[:, 'index'], autopct='%1.2f%%',startangle=90,shadow=True, labeldistance = 1, explode = exp)
  plt.title(t1, bbox={'facecolor':'0.8', 'pad':3})
  ax.axis('equal')
  plt.legend()
  plt.show()  

In [None]:
bike_df1['Rented Bike Count'].plot(kind='pie', subplots=True, figsize=(8, 8))


In [None]:
fig,ax = plt.subplots(1,3, figsize=(15,4))

# Univariate analysis
count = sns.countplot(data=bike_df, x='seasons', ax=ax[0])
count.set_title('Count Plot of Seasons')

# adding value count on the top of bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Bi-variate analysis 
# Rented Bike Count Vs Seasons
bar = sns.barplot(data=bike_df, x='seasons', y='rented_bike_count', ax = ax[1])
bar.set(xlabel='Seasons', ylabel='Rented Bike Count', title='Rented Bike Count Vs Seasons')

# Multi-variate analysis
cat = sns.barplot(data=bike_df, x='seasons', y='rented_bike_count', hue='holiday', ax= ax[2])
cat.set(xlabel='Seasons', ylabel = ' Rented Bike Count', title='Average Bike Rented Vs Seasons with holiday status')

plt.tight_layout()
plt.show()
     

##### 1. Why did you pick the specific chart?

Here multiple charts are used bar and pi charts to repesent the data in diffrent and each section

The bar chart is an effective tool for the management of work in a project.

pi chart can be used to show percentages of a whole, and represents percentages at a set point in time. 

##### 2. What is/are the insight(s) found from the chart?

Dataset has 4 seasons and every season has more than 2000 counts. most bikes have been rented in the summer season. least bike rent count is in winter season. autumn and spring seasons have almost equal amounts of bike rent count.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The most preferred season for the rented_bike_count is summer and the least preferred is winter which means that people prefer to rent bikes in warm temperatures. 

In every season on a no holiday rented_bike_count has more count than on a holiday.

#### Chart - 2 humidity

In [None]:
# Chart - 2 visualization code
fig,ax = plt.subplots(1,3, figsize=(15,4))

# Univariate analysis
dist = sns.distplot(bike_df.humidity, ax = ax[0])
dist.set_title('Distribution Plot of Humidity')

# Bi-variate analysis 
# Rented Bike Count Vs Humidity
scatter = sns.scatterplot(data=bike_df, x='humidity', y='rented_bike_count', ax = ax[1])
scatter.set(xlabel='humidity', ylabel='rented_bike_count', title='Rented Bike Count Vs Humidity')

# Line Plot
group_wind_speed = bike_df.groupby(['humidity'])['rented_bike_count'].mean().reset_index()
line = sns.lineplot(data=group_wind_speed, x='humidity', y = 'rented_bike_count', ax= ax[2])
line.set(xlabel='Humidity', ylabel = ' Rented Bike Count', title='Average Bike Rented Vs Humidity')

plt.tight_layout()
plt.show()

In [None]:
bike_df.columns

##### 1. Why did you pick the specific chart?

charts  to used to compare in Count of Rented bikes acording to humidity

##### 2. What is/are the insight(s) found from the chart?

We can see from the plots above that the average number of bikes rented goes up and down sharply with the peak at around 50. For the number of rented bikes in demand, the most preferred humid environment is 20-90.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Humidity: With increasing humidity, we see decrease in the number of bike rental count.

#### Chart - 3 Outliers 

In [None]:
# Chart - 3 visualization code


plt.figure(figsize=(20,25))
for index,item in enumerate([i for i in bike_df.describe().columns.to_list() if i not in ['Rainfall(mm)','Snowfall (cm)']]):  ## looking for outliers using box plot for all of the data sets 

  plt.subplot(9,8,index+1)
  sns.boxplot(bike_df[item])

In [None]:
Q1 = bike_df.quantile(0.25)
Q3 = bike_df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)


# listing features to remove outliers

features = list(bike_df.columns)
features = features[2:]
list_0 = ['Hour','Seasons','Holiday','Functioning Day','month','year','week']
new_features = [x for x in features if x not in list_0]

In [None]:
# listing features to remove outliers

features = list(bike_df.columns)
features = features[2:]
list_0 = ['Hour','Seasons','Holiday','Functioning Day','month','year','week']
new_features = [x for x in features if x not in list_0]

In [None]:
new_features

In [None]:
# removing outliers

bike_df[new_features] = bike_df[new_features][~((bike_df[new_features] < (Q1 - 1.5 * IQR)) |(bike_df[new_features] > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
bike_df.info()

In [None]:
numerical_features = []
categorical_features = []

# splitting features into numeric and categoric.
'''
If feature has more than 35 categories we will consider it
as numerical_features, remaining features will be added to categorical_features.
'''
for col in bike_df.columns:  
  if bike_df[col].nunique() > 35:
    numerical_features.append(col) 
  else:
    categorical_features.append(col) 

print(f'Numerical Features : {numerical_features}')
print(f'Categorical Features : {categorical_features}')

In [None]:
plt.figure(figsize=(15,5))
# title
plt.suptitle('Outlier Analysis of Numerical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(numerical_features):
  plt.subplot(3, 4, i+1)            # subplot of 2 rows and 3 columns

  # countplot
  sns.boxplot(bike_df[col])
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()
     

In [None]:
# checking for distribution after treating outliers.

# figsize
plt.figure(figsize=(15,6))
# title
plt.suptitle('Data Distibution of Numerical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(numerical_features):
  plt.subplot(4, 3, i+1)                       # subplots 3 rows, 2 columns

  # dist plots
  sns.distplot(bike_df[col])  
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()

##### 1. Why did you pick the specific chart?

chart is to used to looking for outliers and box plot is uesd to show distributions of numeric data values, especially when you want to compare them between multiple groups.

##### 2. What is/are the insight(s) found from the chart?

Here we can see that the columns that contain outliers are Rainfall, Snowfall, Windspeed and Solar Radiation

Due to outlier deletion, some null values have been created in these 4 columns.Now, we can either delete the observations with null values or impute them with some meaning full values. In this case I will be imputing them with the median value of each column.

Note: Usually mean is chosen to impute null values, but I'll be choosing the median because mean is affected very much by outliers whereas the median is not.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

there is no negative  outcome as  removing of the  ouliers is  done and  featuring themself and 5 columns are aslo created "object(5)"

We can also observe some shifts in the distribution of the data after treating outliers. Some of the data were skewed before handling outliers, but after doing so, the features almost follow the normal distribution. Therefore, we are not utilizing the numerical feature transformation technique.

#### Chart - 4 Hour

In [None]:
# Chart - 4 visualization code

fig,ax = plt.subplots(1,3, figsize=(15,4))

# Univariate analysis
count = sns.countplot(data=bike_df, x='hour', ax=ax[0])
count.set_title('Count Plot of Hour')

# adding value count on the top of bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Bi-variate analysis 
# Rented Bike Count Vs Hour
bar = sns.barplot(data=bike_df, x='hour', y='rented_bike_count', ax = ax[1])
bar.set(xlabel='Hour', ylabel='Rented Bike Count', title='Rented Bike Count Vs Hour')

# Multi-variate analysis
point = sns.pointplot(data=bike_df, x='hour', y='rented_bike_count', hue='holiday', ax= ax[2])
point.set(xlabel='Hour', ylabel = ' Rented Bike Count', title='Average Bike Rented Vs Hour with year')

plt.tight_layout()
plt.show()

In [None]:
#ploting line graph
# group by Hrs and get average Bikes rented, and precent change
avg_rent_hrs = bike_df.groupby('hour')['rented_bike_count'].mean()

# plot average rent over time(hrs)
plt.figure(figsize=(20,4))
a=avg_rent_hrs.plot(legend=True,marker='o',title="Average Bikes Rented Per Hr")
a.set_xticks(range(len(avg_rent_hrs)));
a.set_xticklabels(avg_rent_hrs.index.tolist(), rotation=85);

##### 1. Why did you pick the specific chart?

Charts are used to compare the  datasets as per the hour and Line graphs can be used to show how something changes over time

##### 2. What is/are the insight(s) found from the chart?

*  Every hour has an equal number of counts in the dataset.

*  Demand for rented bike count is higher at 8 AM and 6 PM indicating a high demand during business hours.

*  Rented Bike Count follows 2 patterns one for holiday and another for no holiday.

*  Holiday: The first pattern is where there is a peak in the rentals at around 8 am and another at around 6 pm. These correspond to local bikers who typically go to work on a working day, Monday to Friday.

*  Non-Working Day: Second pattern where there are more or less uniform rentals across the day with a peak at around noon time. 

*  These correspond to probable tourists who typically are casual users who rent/drop off bikes uniformly during the day and tour the city on nonworking days which typically are Saturday and Sunday

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

There is no negative growth as demand for rented bike count is higher at 8 AM and 6 PM indicating a high demand during business hours.

#### Chart - 5 seasons

In [None]:
bike_df.columns

In [None]:
# Chart - 5 visualization code

#seasons

fig,ax=plt.subplots(figsize=(10,8))
sns.boxplot(data=bike_df,x='seasons',y='rented_bike_count',ax=ax)
ax.set(title='Count of Rented bikes acording to Seasons ')

In [None]:
fig,ax = plt.subplots(1,3, figsize=(15,4))

# Univariate analysis
count = sns.countplot(data=bike_df, x='seasons', ax=ax[0])
count.set_title('Count Plot of seasons')

# adding value count on the top of bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Bi-variate analysis 
# Rented Bike Count Vs Seasons
bar = sns.barplot(data=bike_df, x='seasons', y='rented_bike_count', ax = ax[1])
bar.set(xlabel='Seasons', ylabel='Rented Bike Count', title='Rented Bike Count Vs Seasons')

# Multi-variate analysis
cat = sns.barplot(data=bike_df, x='seasons', y='rented_bike_count', hue='holiday', ax= ax[2])
cat.set(xlabel='Seasons', ylabel = ' Rented Bike Count', title='Average Bike Rented Vs Seasons with holiday status')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable.

##### 2. What is/are the insight(s) found from the chart?

 Dataset has 4 seasons and every season has more than 2000 counts.

The most preferred season for the rented_bike_count is summer and the least preferred is winter which means that people prefer to rent bikes in warm temperatures.

In every season on a no holiday rented_bike_count has more count than on a holiday.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Bike reservations are lesser in Spring season compared to Summer and Fall
Lots of outlier points for a particular seasons or weather conditions. This is most likely due to variable distribution across the day

#### Chart - 6 Holiday

In [None]:
# Chart - 6 visualization code
##Holiday

bike_df.groupby('holiday').sum()['rented_bike_count'].plot.pie(radius=2)

In [None]:
fig,ax=plt.subplots(figsize=(7,7))
sns.pointplot(data=bike_df,x='hour',y='rented_bike_count',hue='holiday',ax=ax)
ax.set(title='Count of Rented bikes acording to Holiday ')

In [None]:
fig,ax = plt.subplots(1,3, figsize=(15,4))

# Univariate analysis
count = sns.countplot(data=bike_df, x='holiday', ax=ax[0])
count.set_title('Count Plot of Holiday')

# adding value count on the top of bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Bi-variate analysis 
# Rented Bike Count Vs Holiday
bar = sns.barplot(data=bike_df, x='holiday', y='rented_bike_count', ax = ax[1])
bar.set(xlabel='Holiday', ylabel='Rented Bike Count', title='Rented Bike Count Vs Holiday')

# Multi-variate analysis
cat = sns.barplot(data=bike_df, x='holiday', y='rented_bike_count', hue='year', ax= ax[2])
cat.set(xlabel='Holiday', ylabel = ' Rented Bike Count', title='Average Bike Rented Vs Holiday with year')

plt.tight_layout()
plt.show()




In [None]:
bike_df.columns

##### 1. Why did you pick the specific chart?

Multiple charts are used to represent the data and charts are is used to show data comparisons of No holiday than a holiday.

##### 2. What is/are the insight(s) found from the chart?

*  Dataset has more records of No holiday than a holiday which is obvious as most of the days are working days.

*  When there are no holidays, demand for bike sharing is higher than when there are holidays, indicating that business-related bike rentals are preferred.

*  Dataset has more records of 2018 than 2017.

Holiday and 'day': workingday = weekday and not a holiday. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Since we noticed that there were two kinds of bike rental behavoirs - during working days and not a working day, we will retain only the workingday column and drop 'day' and 'holiday' column

#### Chart - 7 Temperature

In [None]:
# Chart - 7 visualization code

fig,ax = plt.subplots(1,3, figsize=(18,4))

# Univariate analysis
# Dependent Column Value Vs temperature
# group temperature column

temp_wrt_bike_rent_count = bike_df.groupby(['temperature'])['rented_bike_count'].mean().reset_index()

line = sns.lineplot(x = 'temperature', y ='rented_bike_count', data = temp_wrt_bike_rent_count, ax = ax[0])
line.set_title('Average rented bike count wrt temperature')

# Multi-variate analysis
# Dependent Column Value Vs hour with temperature

bike_df_nw = bike_df[(bike_df.weekday != 'Saturday') & (bike_df.weekday != 'Sunday')]
bike_df_w = bike_df[(bike_df.weekday == 'Saturday') | (bike_df.weekday == 'Sunday')]

# Weekend

scatter2 = sns.scatterplot(x=bike_df_w.hour, y=bike_df_w['rented_bike_count'], c=bike_df_w.temperature, cmap="RdBu", ax =ax[1])
scatter2.set(xticks = range(24), xlabel='Hours in day', ylabel='rented_bike_count')
scatter2.set_title('Weekend: Rented Bike Count vs. Hour with Temperature Gradient')

# Not a Weekend


scatter = sns.scatterplot(x=bike_df_nw.hour, y=bike_df_nw['rented_bike_count'], c=bike_df_nw.temperature,cmap="RdBu", ax = ax[2])
scatter.set(xticks = range(24), xlabel='Hours in day', ylabel='rented_bike_count')
scatter.set_title('Not a Weekend: Rented Bike Count vs. Hour with Temperature Gradient')


plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

 Subplots provide sub-themes that amplify and widen the meaning of the story's central theme

##### 2. What is/are the insight(s) found from the chart?

*  We can see from the line plot above that the average number of bikes rented with temperature increases steadily, with a slight decrease at the highest temperature.

*  People prefer renting bikes in warm environments. Therefore, the demand for bicycles is high if the temperature is sufficiently warm, but extremely hot temperatures are also unsuitable for bike demand.

*  However, there is a slight decrease in count if the temperature is too high (the darkest of the blue dots).


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Temperature: People generally prefer to bike at moderate to high temperatures. We see the highest rental counts between 32 to 36 degrees Celcius

#### Chart - 8

In [None]:

# Chart - 8 visualization code

fig,ax = plt.subplots(1,3, figsize=(15,4))

# Univariate analysis
dist = sns.distplot(bike_df.snowfall, ax = ax[0])
dist.set_title('Distribution Plot of Snowfall')

# Bi-variate analysis 
# Rented Bike Count Vs Snowfall
scatter = sns.scatterplot(data=bike_df, x='snowfall', y='rented_bike_count', ax = ax[1])
scatter.set(xlabel='Snowfall', ylabel='Rented Bike Count', title='Scatter Plot of Rented Bike Count Vs Snowfall')

# Line Plot
group_snowfall = bike_df.groupby(['snowfall'])['rented_bike_count'].mean().reset_index()
line = sns.lineplot(data = group_snowfall, x ='snowfall', y = 'rented_bike_count', ax= ax[2])
line.set(xlabel='Snowfall', ylabel = ' Rented Bike Count', title='Average Bike Rented Vs Snowfall')

plt.tight_layout()
plt.show()

**Also creating a list of columns that can possibly contain outliers**

In [None]:
possible_outlier_cols = list(set(bike_df.describe().columns)-{'Rented Bike Count','Hour'})  

possible_outlier_cols

In [None]:

#Creating a boxplot to detect columns with outliers 

plt.figure(figsize=(25,25))
for index,item in enumerate(possible_outlier_cols):
  plt.subplot(5,6,index+1)
  sns.boxplot(bike_df[item])


In [None]:
#Creating a list of columns that contains outliers
outlier_cols = ['Rainfall(mm)','Wind speed (m/s)','Snowfall (cm)','Solar Radiation (MJ/m2)']
outlier_cols


In [None]:
Q1 = bike_df.quantile(0.25)
Q3 = bike_df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)




In [None]:
# listing features to remove outliers

features = list(bike_df.columns)
features = features[2:]
list_0 = ['Hour','Seasons','Holiday','Functioning Day','month','year','week']
new_features = [x for x in features if x not in list_0]

In [None]:
#Calculating the upper and lower fence for outlier removal
u_fence = Q3 + (1.5*IQR)
l_fence = Q1 - (1.5*IQR)


In [None]:
#Checking the number of outliers deleted
bike_df.info()

Due to outlier deletion, some null values have been created in these 5 columns.Now, we can either delete the observations with null values or impute them with some meaning full values. In this case I will be imputing them with the median value of each column.

Note: Usually mean is chosen to impute null values, but I'll be choosing the median because mean is affected very much by outliers whereas the median is not.

##### 2. What is/are the insight(s) found from the chart?

*  Distribution of snowfall is highly skewed to the positive side.

*  People prefer almost no or very less snowfall.

*  Outliers

#### Chart - 9

In [None]:
# Chart - 9 visualization code
numeric_features= bike_df.select_dtypes(exclude='object')



In [None]:
numeric_features.describe().transpose()


In [None]:
for col in numeric_features[:]:                                                     #plotting histogram
  sns.histplot(bike_df[col])
  plt.axvline(bike_df[col].mean(), color='magenta', linestyle='dashed', linewidth=2)
  plt.axvline(bike_df[col].median(), color='cyan', linestyle='dashed', linewidth=2)   
  plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

*  skew = measure of asymmetry of a distribution

*  kurtosis = quantify shape of a distribution

*  For numerical features, we can see that the majority of distributions are right-skewed. 

*  The distribution of rainfall, snowfall, and solar radiation is highly skewed to the right. It demonstrates that these columns have many outliers. Some columns are negatively skewed.

*  Some of the variables can get a normal distribution when outliers are removed. 

*  As a result, it appears that outliers should be removed before the transformation. 

*  First, we will get rid of outliers, and then we check to see if we need to use the transformation technique again.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

# ploting Regression plot of each columns of dataset v/s rented bike count columns

for col in numeric_features[:]:
  if col == 'Rented Bike Count':
    pass
  else:
    sns.regplot(x=bike_df[col],y=bike_df["rented_bike_count"],line_kws={"color": "red"})
  
  plt.show()


In [None]:
# Regression Plots with respect to Temperature, Humidity and Windspeed
fig = plt.figure(figsize=(18, 8))
axes = fig.add_subplot(1, 3, 1)
sns.regplot(data=bike_df, x='temperature', y='month',ax=axes)
axes.set(title='Reg Plot for Temperature vs. month')
axes = fig.add_subplot(1, 3, 2)
sns.regplot(data=bike_df, x='humidity', y='month',ax=axes, color='r')
axes.set(title='Reg Plot for Humidity vs. month')
axes = fig.add_subplot(1, 3, 3)
sns.regplot(data=bike_df, x='wind_speed', y='month',ax=axes, color='g')
axes.set(title='Reg Plot for Windspeed vs. month')
plt.show()

In [None]:
bike_df.info()

##### 1. Why did you pick the specific chart?

A regression model is able to show whether changes observed in the dependent variable are associated with changes in one or more of the explanatory variables.

##### 2. What is/are the insight(s) found from the chart?

*  We will use a regression plot to find this correlation. This also finds if the independent variable has a linear relationship with the dependent variable, which is an assumption that has to be satisfied for models like linear regression.

*  We can see that all the remaining columns that we have, have a linear relationship with the dependent variable. So we have satisfied the assumption and therefore we are good to go.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


The goal of feature engineering is to extract relevant information from the raw data and represent it in a way that can be easily understood by the machine learning model. The success of a machine learning model depends heavily on the quality of the features used as inputs, so feature engineering plays an important role in model performance.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#Extracting categorical features

categorical_features= bike_df.select_dtypes(include='object')


In [None]:
#Creating function to return all the unique values each categorical column can have
def cat_unique_vals(cat_cols,bike_df):
  for col in cat_cols:
    print("The values that the categorical column",col,"can take are:",bike_df[col].unique())

In [None]:
#Checking the possible values important and meaningful categorical columns can have.
categorical_columns=['seasons','holiday']
cat_unique_vals(categorical_columns,bike_df)

If feature has more than 35 categories we will consider it
as numerical_features, remaining features will be added to categorical_features.
'''

In [None]:
#Creating a function that performs a groupby operation and returns a dataframe for analysis
def create_df_analysis(col):
  return bike_df.groupby(col)['rented_bike_count'].sum().reset_index()

In [None]:
seasons_col = create_df_analysis('seasons')
seasons_col

In [None]:
#Creating a visualisation for the seasons column
plt.figure(figsize=(10,7))
splot = sns.barplot(data=seasons_col,x='seasons',y='rented_bike_count')
for p in splot.patches:
    splot.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
plt.xlabel("seasons", size=14)
plt.ylabel("rented_bike_count", size=14)
plt.show()

In [None]:
bike_df

In [None]:
categorical_features

In [None]:
#ploting Box plot to visualize and trying to get information from plot
for col in categorical_features:
  plt.figure(figsize=(8,8))
  sns.boxplot(x=bike_df[col],y=bike_df["rented_bike_count"])
  plt.show()

There are several encoding techniques, including:

One-hot encoding: creates a binary column for each unique category, with a value of 1 indicating the presence of the category and 0 indicating the absence.
Label encoding: assigns a unique integer value to each category.

Ordinal encoding: assigns an ordered integer value to each category based on the natural ordering of the categories.

Count encoding: replaces a categorical value with the number of times it appears in the dataset.



##### 1. Why did you pick the specific chart?

Box plots are useful as they provide a visual summary of the data enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness

##### 2. What is/are the insight(s) found from the chart?

Machine learning models can only work with numerical values and therefore important categorical columns have to converted/encoded into numerical variables. This process is known as Feature Encoding

#### Chart - 12 functioning day

In [None]:
# Chart - 12 visualization code

fig,ax = plt.subplots(1,3, figsize=(15,4))

# Univariate analysis
count = sns.countplot(data=bike_df, x='functioning_day', ax=ax[0])
count.set_title('Count Plot of Functioning Day')

# adding value count on the top of bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Bi-variate analysis 
# Rented Bike Count Vs Functioning Day
bar = sns.barplot(data=bike_df, x='functioning_day', y='rented_bike_count', ax = ax[1])
bar.set(xlabel='Functioning Day', ylabel='Rented Bike Count', title='Rented Bike Count Vs Functioning Day')

# Multi-variate analysis
cat = sns.barplot(data=bike_df, x='functioning_day', y='rented_bike_count', hue='session', ax= ax[2])
cat.set(xlabel='Functioning Day', ylabel = ' Rented Bike Count', title='Average Bike Rented Vs Functioning Day with session')

plt.tight_layout()
plt.show()

If it is a non Functioning_Day, total bike rented count is zero. Therefore, taking two approach to see which can give better result. First, with taking all values, secondly by removing nonfunctioning day value and then removing entire column. But, before taking any step it is better to find correlation of the column with our target( Rented_Bike_Count) column.

##### 1. Why did you pick the specific chart?

In creative writing, a subplot can reveal more about secondary characters, create plot twists, and add another dimension to a story. Most importantly, a good subplot raises the stakes for a main character

##### 2. What is/are the insight(s) found from the chart?

*  Dataset has more records of Functioning Day than no functioning day which is obvious as most of the days are working days.

*  Even though we have some counts of no functioning days still there is no bike rented on a no functioning day.

*  On a functioning day, the evening session has the most rented bike count.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*  Less demand on winter seasons

*  Sligthly Higher demand during Non holidays

*  Almost no demnad on non functioning day



#### Chart - 13 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Heatmap relative to all numeric columns
corr_matrix = bike_df.corr()
mask = np.array(corr_matrix)
mask[np.tril_indices_from(mask)] = False

fig = plt.figure(figsize=(10, 10))
sns.heatmap(corr_matrix, mask=mask, annot=True, cbar=True, vmax=0.8, vmin=-0.8, cmap='RdYlGn')
plt.show()


In [None]:
df1['Temperature(°C)'] = df1['Temperature(°C)'].fillna(df1['Temperature(°C)'].mean())
df1['Humidity(%)'] = df1['Humidity(%)'].fillna(df1['Humidity(%)'].mean())
df1['Wind speed (m/s)'] = df1['Wind speed (m/s)'].fillna(df1['Wind speed (m/s)'].mean())
df1['Visibility (10m)'] = df1['Visibility (10m)'].fillna(df1['Visibility (10m)'].mean())
df1['Dew point temperature(°C)'] = df1['Dew point temperature(°C)'].fillna(df1['Dew point temperature(°C)'].mean())
df1['Solar Radiation (MJ/m2)'] = df1['Solar Radiation (MJ/m2)'].fillna(df1['Solar Radiation (MJ/m2)'].mean())
df1['Rainfall(mm)'] = df1['Rainfall(mm)'].fillna(df1['Rainfall(mm)'].mean())
df1['Snowfall (cm)'] = df1['Snowfall (cm)'].fillna(df1['Snowfall (cm)'].mean())



In [None]:
# extracting correlation heatmap

plt.figure(figsize=(10,8))
sns.heatmap(df1.corr('pearson'),vmin=-1, vmax=1,cmap='coolwarm',annot=True, square=True)

##### 1. Why did you pick the specific chart?

A resource heatmap allows project managers and all individuals involved to identify which resources are overloaded or under-utilized, and which tasks or assignments need more resources. This makes resource planning more efficient and optimized, without taking a lot of time in analysis.

##### 2. What is/are the insight(s) found from the chart?

temp (true temperature) and atemp (feels like temperature) are highly correlated as expected

count is highly correlated with casual and registered as expected since count = casual + registered

We see a positive correlation between count and temperature (as was seen in the regplot). This is probably only true for the range of temperatures provided

We see a negative correlation between count and humidity. The more the humidity, the less people prefer to bike

Not a great amount of correlation between humidity and temperature, though
Count has a weak dependence on windspeed

The correlation coefficient is a numerical measure of the strength and direction of a linear relationship between two variables. In other words, it measures the extent to which changes in one variable are associated with changes in the other variable. The correlation coefficient ranges from -1 to 1, with -1 indicating a perfect negative correlation, 1 indicating a perfect positive correlation, and 0 indicating no correlation.

The correlation coefficient is an important tool in data analysis and machine learning, as it can help to identify relationships between variables and can be used in feature selection techniques to remove highly correlated features, which can reduce overfitting and improve the performance of the model.

#### Chart - 14 - Pair Plot 

In [None]:
# Pair Plot visualization code
bike_df

In [None]:
bike_df.hist(bins=50, figsize=(20,15))
plt.tight_layout()
plt.show()

In [None]:
sns.set_style('darkgrid')


In [None]:
sns.pairplot(bike_df)

##### 1. Why did you pick the specific chart?

Pair plots are a great method to identify trends for follow-up analysis and, fortunately, are easily implemented in Python! 

##### 2. What is/are the insight(s) found from the chart?

The Seaborn Pairplot function allows the users to create an axis grid via which each numerical variable stored in data is shared across the X- and Y-axis in the structure of columns and rows. We can create the Scatter plots in order to display the pairwise relationships in addition to the distribution plot displaying the data distribution in the column diagonally.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Limitations and Scope for Future Work

Below are few limitations in this analysis and ideas to improve model prediction accuracy

*  Since casual + registered = total count, we just predicted the total bike rental count by ignoring the casual and registered user information. Another (possibly better) method would be to train separate models for casual and registered users and add the two

*  One limitation in the provided training data set is that it lacked data with extreme weather condition data (weather = 4). Hence we had to modify it to weather = 3

*  Windspeed wasnt used due to very low correlation. This might have been due to several instances where windspeed = 0. One possible method could be to first estimate those windspeed and then use it as a feature to estimate count.

# **Conclusion**

Here are some solutions to manage Bike Sharing Demand.

*  The majority of rentals are for daily commutes to workplaces and colleges. Therefore open additional stations near these landmarks to reach their primary customers.

*  While planning for extra bikes to stations the peak rental hours must be considered, i.e. 7–9 am and 5–6 pm.

*  Maintenance activities for bikes should be done at night due to the low usage of bikes during the night time. Removing some bikes from the streets at night time will not cause trouble for the customers.

*  We see 2 rental patterns across the day in bike rental count - first for a Working Day where the rental count is high at peak office hours (8 am and 5 pm) and the second for a Non-working day where the rental count is more or less uniform across the day with a peak at around noon.

*  Hour of the day: Bike rental count is mostly correlated with the time of the day. As indicated above, the count reaches a high point during peak hours on a working day and is mostly uniform during the day on a non-working day.

*  Temperature: People generally prefer to bike at moderate to high temperatures. We see the highest rental counts between 32 to 36 degrees Celcius
Season: We see the highest number of bike rentals in the Spring (July to September) and Summer (April to June) Seasons and the lowest in the Winter (January to March) season.

*  Weather: As one would expect, we see the highest number of bike rentals on a clear day and the lowest on a snowy or rainy day

*  Humidity: With increasing humidity, we see a decrease in the bike rental count.

*  I have chosen the Light GBM model which is above all else I want better expectations for the rented_bike_count and time isn't compelling here. As a result, various linear models, decision trees, Random Forests, and Gradient Boost techniques were used to improve accuracy. I compared R2 metrics to choose a model.

*  Due to less no. of data in the dataset, the training R2 score is around 99% and the test R2 score is 92.5%. Once we get more data we can retrain our algorithm for better performance.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***