<a href="https://colab.research.google.com/github/ajay-bhise/Linear-Regression-Project/blob/main/Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **Bike Sharing Demand Prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**



Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

# ***Let's Begin !***

## Know Your Data

### Import Libraries

In [None]:
# Import the necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Dataset Loading

In [None]:
#mount google colab
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import data from csv to df

data_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Alma Better/Module 4 --- Machine Learning/Linear Regression capstone project - 2/SeoulBikeData.csv',encoding='unicode_escape')

### Dataset First View

In [None]:
# Dataset First
data_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
data_df.shape

### Dataset Information

In [None]:
# Dataset Info
data_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(data_df[data_df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(data_df.isnull().sum())

### What did you know about your dataset?

* The above dataset has 8760 rows and 14 columns.
* There are no mising values and duplicate values in the dataset.



```
# This is formatted as code
```

## Understanding Your Variables

In [None]:
# Dataset Columns
data_df.columns

In [None]:
# Dataset Describe
data_df.describe(include='all')

### Variables Description


* Date - Date
* Hour - Hour of the day (0-23)
* Temperature - Temperature of the day
* Humidity - Humidity measure
* Windspeed - Windspeed
* Visibility - Visibility measure
* Dew Point Temperature - Dew Point Temperature Measure
* Solar Radiation - Solar Radiation Measure
* Rainfall - Rainfall in mm
* Snowfall - Snowfall measure
* Seasons - 1= spring, 2 = summer, 3 = fall, 4 = winter
* Holiday - Whether a holiday or not
* Functional Day - Whether a functional day or not



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in data_df.columns.tolist():
  print("No. of unique values in ",i,"is",data_df[i].nunique(),".")

## 3. ***Data Wrangling***

In [None]:
# Make a copy of data_df so that we can experiment with the data

df = data_df.copy()

df.shape

In [None]:
df.columns

In [None]:
# Dependent variable 'Rented Bike Count'
plt.rcParams["figure.figsize"] = (7,7)
sns.distplot(df['Rented Bike Count'],color="Blue")

#### This visualization shows that distribution of the dependent variable is right skewed. Applying log transformation would help in making more like a normal/ guassian distribution.

# Data Preparation

* Some data points in the dataset are not in the desired fomrat

* Date column is not usable unless we extract day, month, year
* Seasons column is in the text format - need to do one hot encoding
* Holiday columns is in the text format - need to do one hot encoding
* same for the functional day column

In [None]:
# Saperating the Date data into day year months

df['parsed_date']=pd.to_datetime(data_df['Date'])

#Getting the months and days from date

df['month'] = df['parsed_date'].dt.month
df['weekday'] = df['parsed_date'].dt.weekday
df['year'] = df['parsed_date'].dt.year

#drop the date column
df.drop(columns=['parsed_date','Date'],axis=1,inplace=True)


In [None]:
df.columns

In [None]:
df['Seasons'].unique()

In [None]:
df['winter'] = np.where(df['Seasons'] == 'Winter',1,0)
df['Spring'] = np.where(df['Seasons'] == 'Spring',1,0)
df['Summer'] = np.where(df['Seasons'] == 'Summer',1,0)
df['Autumn'] = np.where(df['Seasons'] == 'Autumn',1,0)

In [None]:
df['Holiday'].unique()

In [None]:
df['is_holiday'] = np.where(df['Holiday'] == 'Holiday',1,0)
df['is_not_holiday'] = np.where(df['Holiday'] == 'No Holiday',1,0)


In [None]:
df['Functioning Day'].unique()

In [None]:
df['is_func_day'] = np.where(df['Functioning Day'] == 'Yes',1,0)
df['is_not_func_day'] = np.where(df['Functioning Day'] == 'No',1,0)

In [None]:
df.columns

# df.drop(columns=['is_func_day','is_not_func_day'],axis = 1,inplace = True)

In [None]:
# Removing the origional columns

df.drop(columns  = ['Seasons','Holiday','Functioning Day'],axis = 1,inplace = True)

In [None]:
df.columns

In [None]:
df.head()

# Data Visualizations

#### In this section we will check following three major points

* distribution of variables
* multicollinearity of feature variables
* relation of each variable with the label

In [None]:
df.columns

In [None]:
# Lets check the relation of each variable with the dependent variable using graphs

plt.rcParams['figure.figsize'] = (22, 15)

plt.subplot(3,3,1)
sns.distplot(df['Temperature(°C)'],color="Blue")
plt.title('Temperature(°C)')

plt.subplot(3,3,2)
sns.distplot(df['Temperature(°C)'],color="Blue")
plt.title('Temperature(°C)')

plt.subplot(3,3,3)
sns.distplot(df['Humidity(%)'],color="Blue")
plt.title('Humidity(%)')

plt.subplot(3,3,4)
sns.distplot(df['Wind speed (m/s)'],color="Blue")
plt.title('Wind speed (m/s)')

plt.subplot(3,3,5)
sns.distplot(df['Visibility (10m)'],color="Blue")
plt.title('Visibility (10m)')


plt.subplot(3,3,6)
sns.distplot(df['Dew point temperature(°C)'],color="Blue")
plt.title('Dew point temperature(°C)')

plt.subplot(3,3,7)
sns.distplot(df['Solar Radiation (MJ/m2)'],color="Blue")
plt.title('Solar Radiation (MJ/m2)')

plt.subplot(3,3,8)
sns.distplot(df['Rainfall(mm)'],color="Blue")
plt.title('Rainfall(mm)')

plt.subplot(3,3,9)
sns.distplot(df['Snowfall (cm)'],color="Blue")
plt.title('Snowfall (cm)')



* From above plots we get to know that some of the variable are not normally distributed
* To make the distributions normal/Guassian we need to apply log transformation.

In [None]:
df.columns

### Checking for multicollinearity

In [None]:
df_corr = df.corr()
sns.heatmap(abs(df_corr), annot=True, cmap='coolwarm')


* calulate VIF to check collinearity

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in ['Rented Bike Count']]])

## Observations

* There is correlation between Temperature and dew point temperature
* We are going to ignore the correlation between the derived variables and origional variables. If our model overfits we can handle it in the regularization.


In [None]:
df.head()

In [None]:
# Lets check the relation of each variable with the dependent variable using graphs

plt.rcParams['figure.figsize'] = (20, 10)

plt.subplot(3,3,1)
plt.scatter(df['Hour'],df['Rented Bike Count'])
plt.title('Hour')

plt.subplot(3,3,2)
plt.scatter(df['Temperature(°C)'],df['Rented Bike Count'])
plt.title('Temperature(°C)')

plt.subplot(3,3,3)
plt.scatter(df['Humidity(%)'],df['Rented Bike Count'])
plt.title('Humidity(%)')

plt.subplot(3,3,4)
plt.scatter(df['Wind speed (m/s)'],df['Rented Bike Count'])
plt.title('Wind speed (m/s)')

plt.subplot(3,3,5)
plt.scatter(df['Visibility (10m)'],df['Rented Bike Count'])
plt.title('Visibility (10m)')


plt.subplot(3,3,6)
plt.scatter(df['Dew point temperature(°C)'],df['Rented Bike Count'])
plt.title('Dew point temperature(°C)')

plt.subplot(3,3,7)
plt.scatter(df['Solar Radiation (MJ/m2)'],df['Rented Bike Count'])
plt.title('Solar Radiation (MJ/m2)')

plt.subplot(3,3,8)
plt.scatter(df['Rainfall(mm)'],df['Rented Bike Count'])
plt.title('Rainfall(mm)')

plt.subplot(3,3,9)
plt.scatter(df['Snowfall (cm)'],df['Rented Bike Count'])
plt.title('T9')


In [None]:
# Lets check the relation of each variable with the dependent variable using graphs

plt.rcParams['figure.figsize'] = (20, 10)

plt.subplot(3,3,1)
plt.scatter(df['month'],df['Rented Bike Count'])
plt.title('month')

plt.subplot(3,3,2)
plt.scatter(df['weekday'],df['Rented Bike Count'])
plt.title('weekday')

plt.subplot(3,3,3)
plt.scatter(df['year'],df['Rented Bike Count'])
plt.title('year')

plt.subplot(3,3,4)
plt.scatter(data_df['Seasons'],df['Rented Bike Count'])
plt.title('Seasons')

plt.subplot(3,3,5)
plt.scatter(data_df['Holiday'],df['Rented Bike Count'])
plt.title('Holiday')


plt.subplot(3,3,6)
plt.scatter(data_df['Functioning Day'],df['Rented Bike Count'])
plt.title('Functioning Day')




### With scatterplots above, we tried to compare each of the feature variable with the dependent variable.

* Temperature, Dew point Temperature, visibility are directly proportional to label.

* Windspeed , snowfall , rainfall, sun exposure are somewhat inversrly proportional to label.

# Model Implementation

In [None]:
df.columns

In [None]:
df['Rented Bike Count'].value_counts()

In [None]:
# comsidering the observations where Rented Bike Count is non zero
temp_df = df[df['Rented Bike Count'] != 0]

In [None]:
# Since distribution of the variable mentioned below is skewed, we have taken the log transformation to model the variables

temp_df['log_transformed_output'] = temp_df['Rented Bike Count'].apply(lambda x:np.log10(x))
temp_df['log_transformed_Visibility'] = temp_df['Visibility (10m)'].apply(lambda x:np.log10(x))
temp_df['log_transformed_Wind speed'] = temp_df['Wind speed (m/s)'].apply(lambda x:np.log10(x))
temp_df['log_transformed_solar_radiation'] = temp_df['Solar Radiation (MJ/m2)'].apply(lambda x:np.log10(x))
temp_df['log_transformed_snowfall'] = temp_df['Snowfall (cm)'].apply(lambda x:np.log10(x))
temp_df['log_transformed_rainfall'] = temp_df['Rainfall(mm)'].apply(lambda x:np.log10(x))

#replacing the inf values with 0
temp_df.replace([np.inf, -np.inf], 0, inplace=True)




X = temp_df[[ 'Hour','Temperature(°C)',
       'log_transformed_Wind speed', 'log_transformed_Visibility',
       'log_transformed_solar_radiation', 'log_transformed_rainfall', 'log_transformed_snowfall', 'month',
       'weekday', 'year', 'winter', 'Spring', 'Summer', 'Autumn', 'is_holiday',
       'is_not_holiday', 'is_func_day', 'is_not_func_day']]

Y = temp_df[['log_transformed_output']]

In [None]:
# Train-Test Split and scaling the variables

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X_train, X_test, y_train, y_test = train_test_split( X,Y , test_size = 0.1, random_state = 100)

std = MinMaxScaler()
X_train_std = std.fit_transform(X_train)
X_test_std = std.transform(X_test)
print(X_train_std.shape)
print(X_test_std.shape)

In [None]:
# Implementing Normal Linear regression

from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X_train_std, y_train)
normal_score = reg.score(X_train_std, y_train)
normal_intercept = reg.intercept_
coeff = reg.coef_

print("r2 score fot the model is ",normal_score)
print("intercept for the emodel is ",normal_intercept)
print("coeff for the model: ",coeff)

In [None]:
# Implementing Lasso regression

from sklearn.linear_model import Lasso

reg = Lasso(alpha=0.01).fit(X_train_std, y_train)
normal_score = reg.score(X_train_std, y_train)
normal_intercept = reg.intercept_
coeff = reg.coef_

print("r2 score fot the model is ",normal_score)
print("intercept for the emodel is ",normal_intercept)
print("coeff for the model: ",coeff)

# In the output we can see some of the parameters went to zero

In [None]:
# Implementing Ridge Regression

from sklearn.linear_model import Ridge

reg = Ridge(alpha=0.01).fit(X_train_std, y_train)

normal_score = reg.score(X_train_std, y_train)
normal_intercept = reg.intercept_
coeff = reg.coef_
print("r2 score fot the model is ",normal_score)
print("intercept for the emodel is ",normal_intercept)
print("coeff for the model: ",coeff)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [None]:
# Predict the valaues for training set to check the trainging set accuracy

y_train_ = reg.predict(X_train_std)

MSE  = mean_squared_error(10**(y_train), 10**(y_train_))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(10**(y_train), 10**(y_train_))
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score(10**(y_train), 10**(y_train_)))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))



In [None]:
# Predict the valaues for test set to check the trainging set accuracy

y_pred = reg.predict(X_test_std)

MSE  = mean_squared_error(10**(y_test), 10**(y_pred))
# MSE  = mean_squared_error((y_test), (y_pred))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:

r2 = r2_score(10**(y_test), 10**(y_pred))
print("R2 :" ,r2)
print("Adjusted R2 : ",1-(1-r2_score(10**(y_test), 10**(y_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))


### Observations

* Training set r2 score of normal linear regression and ridge regression is same.
* This can be because of train - test split randomness
* Implementing cross validation may help.

In [None]:
from sklearn.model_selection import GridSearchCV

ridge = Ridge()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=5)
ridge_regressor.fit(X_train_std, y_train)

In [None]:
y_pred_cv = ridge_regressor.predict(X_test_std)

In [None]:

MSE  = mean_squared_error(10**(y_test), 10**(y_pred_cv))
print("MSE :" , MSE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

In [None]:
r2 = r2_score(10**(y_test), 10**(y_pred_cv))
print("R2 :" ,r2)

print("Adjusted R2 : ",1-(1-r2_score(10**(y_test), 10**(y_pred_cv)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))

# **Conclusion**

#### Based on the model coefficient we came to the following conclusions.

* Rented Bike count shows positive correlation with temperature. So when the temperature is between 20-30 degree celcius, bike demand will be more.

* Rented bike count shows positive relationship with the visibility.

* Rental bike count varies inversly wrt rainfall, snowfall and sun irradiation.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***