# Bike sharing case study

## Objective

Identify the significance & extent of variables behind the demand for shared bikes in the American market

In [1]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import warnings

%matplotlib inline
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

In [2]:
#load the dataset
inp0 = pd.read_csv('day.csv',header=0)
inp0.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,01-01-2018,1,0,1,0,6,0,2,14.110847,18.18125,80.5833,10.749882,331,654,985
1,2,02-01-2018,1,0,1,0,0,0,2,14.902598,17.68695,69.6087,16.652113,131,670,801
2,3,03-01-2018,1,0,1,0,1,1,1,8.050924,9.47025,43.7273,16.636703,120,1229,1349
3,4,04-01-2018,1,0,1,0,2,1,1,8.2,10.6061,59.0435,10.739832,108,1454,1562
4,5,05-01-2018,1,0,1,0,3,1,1,9.305237,11.4635,43.6957,12.5223,82,1518,1600


## I - Data understanding

**1. Number of rows & columns**

In [3]:
# number of rows & columns
inp0.shape

(730, 16)

In [4]:
#list all the columns
print(list(inp0.columns))

['instant', 'dteday', 'season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']


**2. Presence of Null or NaN values**

In [5]:
#null values in each colummn
print('Null values : ', [inp0[i].isnull().sum() for i in inp0.columns])

Null values :  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [6]:
print('NaN values : ', [inp0[i].isna().sum() for i in inp0.columns])

NaN values :  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


**Observation** : No NULL nor NaN values present in dataset.

**3. Identification of data quality issues within the dataset**

In [7]:
inp0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 730 entries, 0 to 729
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     730 non-null    int64  
 1   dteday      730 non-null    object 
 2   season      730 non-null    int64  
 3   yr          730 non-null    int64  
 4   mnth        730 non-null    int64  
 5   holiday     730 non-null    int64  
 6   weekday     730 non-null    int64  
 7   workingday  730 non-null    int64  
 8   weathersit  730 non-null    int64  
 9   temp        730 non-null    float64
 10  atemp       730 non-null    float64
 11  hum         730 non-null    float64
 12  windspeed   730 non-null    float64
 13  casual      730 non-null    int64  
 14  registered  730 non-null    int64  
 15  cnt         730 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.4+ KB


**Observations**<br>
1. instant : drop column because it contains index numbers which is not relevant for prediction.<br>
2. dteday  : subtract from 01.01.2018(assuming company inception) into a derived variable; drop column since already broken into
columns yr, mnth, weekday.<br>
3. season,weatherit  : categorical value so convert it into string values followed by encoding.<br>
4. yr,mnth,cnt : rename column name.<br>
5. temp, atemp, hum, windspeed : round to 2 decimal spaces; standardize using MinMaxScaler.<br>
6. casual, registered : derive new variable as a ratio of columns;summation contained in column cnt so all 3 acting as target variable therefore drop these columns.<br>

## II - Data Manipulation

**1. yr,mnth,cnt : rename column name**

In [8]:
#inp0 - initial dataframe read from file
#inp1 - copy of inp0; data manipulation

inp1 = inp0.copy()
inp1.rename(columns={'yr':'year','mnth':'month','cnt':'count'},inplace=True)

**2. Derived metric : ratio of casual to registered users**

In [9]:
inp1['castoreg'] = inp1.apply(lambda x : round( ( (x['casual']*100) / x['registered'] ),2 ),axis=1  )
inp1.head()

Unnamed: 0,instant,dteday,season,year,month,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,count,castoreg
0,1,01-01-2018,1,0,1,0,6,0,2,14.110847,18.18125,80.5833,10.749882,331,654,985,50.61
1,2,02-01-2018,1,0,1,0,0,0,2,14.902598,17.68695,69.6087,16.652113,131,670,801,19.55
2,3,03-01-2018,1,0,1,0,1,1,1,8.050924,9.47025,43.7273,16.636703,120,1229,1349,9.76
3,4,04-01-2018,1,0,1,0,2,1,1,8.2,10.6061,59.0435,10.739832,108,1454,1562,7.43
4,5,05-01-2018,1,0,1,0,3,1,1,9.305237,11.4635,43.6957,12.5223,82,1518,1600,5.4


**3. Derived metric : subtract dteday from 01.01.2018 assuming this was when the company started operations**

In [10]:
inp1['days_old']=(pd.to_datetime(inp1['dteday'],format= '%d-%m-%Y')-pd.to_datetime('01-01-2018',format= '%d-%m-%Y')).dt.days
inp1.head()

Unnamed: 0,instant,dteday,season,year,month,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,count,castoreg,days_old
0,1,01-01-2018,1,0,1,0,6,0,2,14.110847,18.18125,80.5833,10.749882,331,654,985,50.61,0
1,2,02-01-2018,1,0,1,0,0,0,2,14.902598,17.68695,69.6087,16.652113,131,670,801,19.55,1
2,3,03-01-2018,1,0,1,0,1,1,1,8.050924,9.47025,43.7273,16.636703,120,1229,1349,9.76,2
3,4,04-01-2018,1,0,1,0,2,1,1,8.2,10.6061,59.0435,10.739832,108,1454,1562,7.43,3
4,5,05-01-2018,1,0,1,0,3,1,1,9.305237,11.4635,43.6957,12.5223,82,1518,1600,5.4,4


**4. season,weatherit : categorical value so convert it into string values followed by encoding**

In [11]:
inp1.season.value_counts()

3    188
2    184
1    180
4    178
Name: season, dtype: int64

In [12]:
inp1.season = inp1.season.map({1:'spring', 2:'summer', 3:'fall', 4:'winter'})

**6. instant, dteday, casual, registered : drop the columns**

In [13]:
inp1.drop(['instant','dteday','casual','registered'],axis=1,inplace=True)

In [14]:
inp1.shape

(730, 14)

## III - Exploratory Data Analysis

## IV - Linear Regression model building

In [15]:
#dummy variables for features season and weatherit
status = pd.get_dummies(inp1.season)
status.head()

Unnamed: 0,fall,spring,summer,winter
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,1,0,0


In [16]:
#from sklearn.metrics import r2_score
#r2_score(y_test, y_pred)