# Linear Regression - BoomBikes

### Avinash Kumar C33 - 12/12/2021
### Kaggle Version 1.2 Final


## Problem Statement

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.


A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state. 


In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.


They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

> Which variables are significant in predicting the demand for shared bikes.

> How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors. 


**Business Goal:**

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market. 

## Importing Libraries

Suppress warnings for a clean notebook.

In [412]:
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

In [413]:
import numpy as np # Numpy Imported
import pandas as pd # pandas Imported
import matplotlib.pyplot as plt # Pyplot from matplotlib imported
import seaborn as sns # Seaborn imported
%matplotlib inline

sns.set()

## 1. Reading and Understanding the Data

In [414]:
bike= pd.read_csv("../input/bikeshare/day.csv")

Let's look at our data

In [415]:
bike.head()

Lets check the shape of the data

In [416]:
bike.shape

The dataset has 730 rows and 16 columns. lets gather more information

In [417]:
bike.info()

In [418]:
bike.describe()

From the above information it seems there are no missing data, all the columns contain exactly 730 not null counts.

Lets go ahead and look at the date column

In [419]:
bike['dteday'].dtype #check Datatype first

The column dteday needs to be converted to datetime datatype

In [420]:
bike['dteday'] =  pd.to_datetime(bike['dteday'],format='%d-%m-%Y')
bike['dteday'].dtype

Month and Date needs to be extracted from column dteday.

In [421]:
bike['year'] = pd.DatetimeIndex(bike['dteday']).year
bike['month'] = pd.DatetimeIndex(bike['dteday']).month

In [422]:
bike.head()

Since we have the year and month in correct format, the column yr and mnth can be dropped.

In [423]:
bike.drop(['yr','mnth'],axis=1,inplace=True)

In [424]:
bike.head()

The column holiday seems to not contain any usefull information and hence should be dropped.

In [425]:
bike.drop('holiday',axis=1,inplace=True)

Column dteday,instant,casual and registered are not usefull further and can be dropped.

In [426]:
bike.drop(['dteday','instant','casual','registered'],axis=1,inplace=True)

In [427]:
bike.head()

## 2: Encoding the Labels & Visualization

Lets now encode the columns as per understanding from the data dictionary

### **i. season**

* 1:spring
* 2:summer
* 3:fall
* 4:winter

In [428]:
codes = {1:'spring',2:'summer',3:'fall',4:'winter'}
bike['season'] = bike['season'].map(codes)

In [429]:
sns.barplot('season','cnt',data=bike)

It seems that rentals are more during the fall season followed by summer and winter.

### **ii. weathersit**

* 1: Clear, Few clouds, Partly cloudy, Partly cloudy
* 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
* 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
* 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

In [430]:
codes = {1:'Clear',2:'Mist',3:'Light Snow',4:'Heavy Rain'}
bike['weathersit'] = bike['weathersit'].map(codes)

In [431]:
sns.barplot('weathersit','cnt',data=bike)

Bikes are rented more in clear weather followed by mist.

### **iii. workingday**

* if day is neither weekend nor holiday is 1,
* else is 0

In [432]:
codes = {1:'working_day',0:'Holiday'}
bike['workingday'] = bike['workingday'].map(codes)

In [433]:
sns.barplot('workingday','cnt',data=bike,palette='cool')

It seems the bikes are rented more on working days but has only a minor boost over holidays.

### **iv. year**

* 2018:0
* 2019:1

In [434]:
codes = {2019:1,2018:0}
bike['year'] = bike['year'].map(codes)

In [435]:
sns.barplot('year','cnt',data=bike,palette='dark')

More bikes were rented in 2019 than in 2018 suggesting the increase in adoption by the customers.

### **v. Month**

* 1:Jan
* 2:Feb
* 3:Mar
* 4:Apr
* 5:May
* 6:June
* 7:July
* 8:Aug
* 9:Sep
* 10:Oct
* 11:Nov
* 12:Dec

In [436]:
codes = {1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'June',7:'July',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'}
bike['month'] = bike['month'].map(codes)

In [437]:
plt.figure(figsize=(10,5))
sns.barplot('month','cnt',hue='year',data=bike,palette='Paired')

### **vi. WeekDay:**

* 0:Mon
* 1:Tue
* 2:Wed
* 3:Thu
* 4:Fri
* 5:Sat
* 6:Sun

In [438]:
codes = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
bike['weekday'] = bike['weekday'].map(codes)

In [439]:
bike.groupby('weekday')['cnt'].max().plot(kind='bar')

### Bike Rentals are maximum on Saturday and Sunday

### **vii. temp**

In [440]:
plt.scatter('temp','cnt',data=bike)

It seems that  bikes are rented out more when the temperature is > 20. Hence customers prefer renting when the temeprature is higher.

### **viii. atemp**

In [441]:
plt.scatter('atemp','cnt',data=bike)

has the same pattern.

### ix. Humidity

In [442]:
plt.scatter('hum','cnt',data=bike)

Bikes are rented when the humidity is high.

## x. Windspeed

In [443]:
plt.scatter('windspeed','cnt',data=bike)

 Wind speed is highly correlated with temperature.

In [444]:
sns.distplot(bike['cnt'])

## 3: Visualizing the relationship between variables

In [445]:
sns.pairplot(bike)

In [446]:
plt.figure(figsize = (12,6))
sns.heatmap(bike.corr(),annot=True)

In [447]:
data= bike[['temp','atemp','hum','windspeed']]
sns.heatmap(data.corr(),annot=True)

atemp and temp are highly correlated and should be dropped to avoid multicollinearity.

In [448]:
bike.drop('atemp',axis=1,inplace=True)

In [449]:
bike.head()

## 4 : Categorical Variables

In [450]:
seasons = pd.get_dummies(bike['season'],drop_first=True)

working_day = pd.get_dummies(bike['workingday'],drop_first=True)

weather= pd.get_dummies(bike['weathersit'],drop_first=True)

month= pd.get_dummies(bike['month'],drop_first=True)

week_day= pd.get_dummies(bike['weekday'],drop_first=True)

In [451]:
bike= pd.concat([bike,seasons,working_day,weather,month,week_day],axis=1)

In [452]:
bike.head()

These categorical variables can be dropped as they are already dummy-encoded.

In [453]:
bike.drop(['season','workingday','weathersit','weekday','month'],axis=1,inplace=True)

In [454]:
bike.head()

## 5: Splitting the Data into Training and Testing Sets

Using 70-30 Split

In [455]:
from sklearn.model_selection import train_test_split

np.random.seed(0)
df_train, df_test = train_test_split(bike, train_size = 0.7, test_size = 0.3, random_state = 100)

### Rescaling

In [456]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()

In [457]:
num_vars=['temp','hum','windspeed','cnt']

df_train[num_vars]= scaler.fit_transform(df_train[num_vars])

In [458]:
plt.scatter('temp','cnt',data=df_train)

### Dividing into X and Y sets for the model building

In [459]:
y_train = df_train.pop('cnt')
X_train = df_train

## 6: Linear Model

Using Recursive feature elimination(RFE)

In [460]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [461]:
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm,10)
rfe = rfe.fit(X_train, y_train)

In [462]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [463]:
col = X_train.columns[rfe.support_]
col

Create X_test dataframe with RFE selected variables

In [464]:
X_train_rfe = X_train[col]

In [465]:
# Add a constant variable 
import statsmodels.api as sm  
X_train_rfe = sm.add_constant(X_train_rfe)

Run OLS to fit model

In [466]:
lm = sm.OLS(y_train,X_train_rfe).fit()

In [467]:
lm.summary()

In [468]:
X_train1= X_train_rfe.drop('Mon',1)

In [469]:
X_train2= sm.add_constant(X_train1)
lm1 = sm.OLS(y_train,X_train2).fit() 

In [470]:
lm1.summary()

Now we have all the required variables with p-values.

In [471]:
X_train_new= X_train2.drop('const',axis=1)

### Variance Inflation Factor(VIF)

In [472]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

VIF less than 5 for all variables

## 7: Residual Analysis

In [473]:
y_train_pred = lm1.predict(X_train2)

In [474]:
fig = plt.figure()
sns.distplot((y_train - y_train_pred), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)
plt.xlabel('Errors', fontsize = 18)     

## 8: Predictions

In [475]:
num_vars=['temp','hum','windspeed','cnt']

df_test[num_vars]= scaler.transform(df_test[num_vars])

Devide into X_test and y_test

In [476]:
y_test = df_test.pop('cnt')
X_test = df_test

Using our model to make predictions.

In [477]:

X_test_new = X_test[X_train_new.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [478]:
# Making predictions
y_test_pred = lm1.predict(X_test_new)

## 9: Model Evaluation

Plot y_test and y_pred to understand the spread.

In [479]:

fig = plt.figure()
plt.scatter(y_test,y_test_pred)
fig.suptitle('Actual vs Predictions', fontsize=20)             
plt.xlabel('Actual', fontsize=18)                          
plt.ylabel('Predictions', fontsize=16)                          

In [480]:
from sklearn.metrics import r2_score
r2_score(y_test, y_test_pred)

## Conclusion

* Spring Season seems to be affecting the business most.
* Misty days have an effect on renting.
* Temperature seems to be working in favour of the business.
* Sunday sees a major spike in renting.
* Rending is as usual on weekdays.