# Module 1 Peer Reviewed Assignment

***

## Project Description

In this assignment you will be required to create two regression models using bike sharing data, interpret the diagnostics associated with those models, and create some predictions. You will first need to download the data, familiarize yourself with the data, and perform some data preparation tasks.

## Data Dictionary

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| instant: record index|	|
| dteday : date|	|
| season : season (1:spring, 2:summer, 3:fall, 4:winter)|	|
| yr : year (0: 2011, 1:2012)|	|
| mnth : month ( 1 to 12)|	|
| hr : hour (0 to 23)|	|
| holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)|	|
| weekday : day of the week|	|
| workingday : if day is neither weekend nor holiday is 1, otherwise is 0.|	|
|weathersit :|	|
| - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog|	|
|temp : Normalized temperature in Celsius. The values are divided to 41 (max) |	|
|atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max) |	|
|hum: Normalized humidity. The values are divided to 100 (max) |	|
|windspeed: Normalized wind speed. The values are divided to 67 (max) |	|
|casual: count of casual users |	|
|registered: count of registered users |	|
|cnt: count of total rental bikes including both casual and registered |	|

## Business Task

The folder will contain two datafiles hour.csv and day.csv. You will use day.csv for this assignment. The readme file in the folder has a description of the data, which you are encouraged to read so that you can successfully interpret the analytic results. 

There are 16 columns in the dataset. You will need to use columns: dteday, temp, and cnt. “cnt” is the outcome variable, or dependent variable.

## Import Libraries

In [1]:
import numpy as np
from numpy import count_nonzero
from numpy import median
from numpy import mean
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import random

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

import scipy.stats
from collections import Counter

import sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression, ElasticNet, Lasso, Ridge
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, auc, classification_report, confusion_matrix, f1_score
from sklearn.metrics import plot_confusion_matrix, plot_roc_curve

%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)

plt.rc('axes', titlesize=9)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

random.seed(0)
np.random.seed(0)
np.set_printoptions(suppress=True)

Autosaving every 60 seconds


## Exploratory Data Analysis

In [2]:
df = pd.read_csv("day.csv", parse_dates=["dteday"])

In [3]:
df

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.34,0.36,0.81,0.16,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.36,0.35,0.70,0.25,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.20,0.19,0.44,0.25,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.20,0.21,0.59,0.16,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.23,0.23,0.44,0.19,82,1518,1600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,727,2012-12-27,1,1,12,0,4,1,2,0.25,0.23,0.65,0.35,247,1867,2114
727,728,2012-12-28,1,1,12,0,5,1,2,0.25,0.26,0.59,0.16,644,2451,3095
728,729,2012-12-29,1,1,12,0,6,0,2,0.25,0.24,0.75,0.12,159,1182,1341
729,730,2012-12-30,1,1,12,0,0,0,1,0.26,0.23,0.48,0.35,364,1432,1796


Presenting data – Create an .rmd file in RStudio. Use a code chunk to report a summary of the data.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     731 non-null    int64         
 1   dteday      731 non-null    datetime64[ns]
 2   season      731 non-null    int64         
 3   yr          731 non-null    int64         
 4   mnth        731 non-null    int64         
 5   holiday     731 non-null    int64         
 6   weekday     731 non-null    int64         
 7   workingday  731 non-null    int64         
 8   weathersit  731 non-null    int64         
 9   temp        731 non-null    float64       
 10  atemp       731 non-null    float64       
 11  hum         731 non-null    float64       
 12  windspeed   731 non-null    float64       
 13  casual      731 non-null    int64         
 14  registered  731 non-null    int64         
 15  cnt         731 non-null    int64         
dtypes: datetime64[ns](1), floa

In [5]:
df.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.5,0.5,6.52,0.03,3.0,0.68,1.4,0.5,0.47,0.63,0.19,848.18,3656.17,4504.35
std,211.17,1.11,0.5,3.45,0.17,2.0,0.47,0.54,0.18,0.16,0.14,0.08,686.62,1560.26,1937.21
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.06,0.08,0.0,0.02,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.34,0.34,0.52,0.13,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.5,0.49,0.63,0.18,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.66,0.61,0.73,0.23,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.86,0.84,0.97,0.51,3410.0,6946.0,8714.0


In [6]:
df.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt'], dtype='object')

Preparing data - Extract the month names from the dteday column using lubridate package and save them in a new column month_name, which has a chr data type.

In [7]:
df["month_name"] = df["dteday"].dt.month

In [8]:
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,month_name
0,1,2011-01-01,1,0,1,0,6,0,2,0.34,0.36,0.81,0.16,331,654,985,1
1,2,2011-01-02,1,0,1,0,0,0,2,0.36,0.35,0.7,0.25,131,670,801,1
2,3,2011-01-03,1,0,1,0,1,1,1,0.2,0.19,0.44,0.25,120,1229,1349,1
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.21,0.59,0.16,108,1454,1562,1
4,5,2011-01-05,1,0,1,0,3,1,1,0.23,0.23,0.44,0.19,82,1518,1600,1


In [9]:
df["month_name"] = df["month_name"].astype("category")

In [10]:
df2 = df[['dteday','month_name', 'temp','cnt']]

In [11]:
df2.head()

Unnamed: 0,dteday,month_name,temp,cnt
0,2011-01-01,1,0.34,985
1,2011-01-02,1,0.36,801
2,2011-01-03,1,0.2,1349
3,2011-01-04,1,0.2,1562
4,2011-01-05,1,0.23,1600


Running regression models – You will run one simple linear regression model, Model1, and one multiple regression model, Model2, as described below. 

Model1: 40 points

a)	Use a code chunk to run a simple linear regression model where the dependent variable is cnt and the independent variable is month_name and save the model as Model1. 10 points

b)	Use a code chunk to report the summary for Model1. Below the code chunk, use regular text to comment on the R-squared. 10 points

c)	From the summary of Model1, identify which month is set as a reference. Use regular text (outside of a code chunk) to report the reference month’s predicted cnt. 10 points (2 points for identifying the reference month and 8 points for reporting the correct prediction)

d)	With either a code chunk or regular text, use the coefficient estimates from Model1 to report the predicted cnt for the months of January and June. 10 points (5 points for each correct prediction)

### Linear Regression (StatsModel)

In [12]:
df2.columns

Index(['dteday', 'month_name', 'temp', 'cnt'], dtype='object')

In [13]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   dteday      731 non-null    datetime64[ns]
 1   month_name  731 non-null    category      
 2   temp        731 non-null    float64       
 3   cnt         731 non-null    int64         
dtypes: category(1), datetime64[ns](1), float64(1), int64(1)
memory usage: 18.4 KB


In [14]:
X = df2[["month_name"]]

In [15]:
X

Unnamed: 0,month_name
0,1
1,1
2,1
3,1
4,1
...,...
726,12
727,12
728,12
729,12


In [16]:
model1 = smf.ols(formula='cnt ~ C(month_name)', data=df2).fit()

In [17]:
model1.summary()

0,1,2,3
Dep. Variable:,cnt,R-squared:,0.391
Model:,OLS,Adj. R-squared:,0.381
Method:,Least Squares,F-statistic:,41.9
Date:,"Sun, 26 Dec 2021",Prob (F-statistic):,4.25e-70
Time:,08:50:15,Log-Likelihood:,-6388.6
No. Observations:,731,AIC:,12800.0
Df Residuals:,719,BIC:,12860.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2176.3387,193.514,11.246,0.000,1796.419,2556.259
C(month_name)[T.2],478.9595,279.607,1.713,0.087,-69.985,1027.904
C(month_name)[T.3],1515.9194,273.670,5.539,0.000,978.631,2053.208
C(month_name)[T.4],2308.5613,275.941,8.366,0.000,1766.814,2850.308
C(month_name)[T.5],3173.4355,273.670,11.596,0.000,2636.147,3710.724
C(month_name)[T.6],3596.0280,275.941,13.032,0.000,3054.281,4137.775
C(month_name)[T.7],3387.3387,273.670,12.377,0.000,2850.051,3924.627
C(month_name)[T.8],3488.0806,273.670,12.746,0.000,2950.792,4025.369
C(month_name)[T.9],3590.1780,275.941,13.011,0.000,3048.431,4131.925

0,1,2,3
Omnibus:,11.507,Durbin-Watson:,0.499
Prob(Omnibus):,0.003,Jarque-Bera (JB):,6.947
Skew:,-0.002,Prob(JB):,0.031
Kurtosis:,2.522,Cond. No.,12.8


R2 score is low.

Month 1 is set as a reference.

The reference month’s predicted cnt and predicted cnt for the months of January and June.

In [18]:
test_jan = X[X["month_name"] == 1]

In [19]:
test_jan.head()

Unnamed: 0,month_name
0,1
1,1
2,1
3,1
4,1


In [20]:
model1.predict(exog=test_jan).head()

0   2176.34
1   2176.34
2   2176.34
3   2176.34
4   2176.34
dtype: float64

In [21]:
test_june = X[X["month_name"] == 6]

In [22]:
test_june.head()

Unnamed: 0,month_name
151,6
152,6
153,6
154,6
155,6


In [23]:
model1.predict(exog=test_june).head()

151   5772.37
152   5772.37
153   5772.37
154   5772.37
155   5772.37
dtype: float64

Model2: 40 points

a)	Use a code chunk to run a multiple linear regression model where the dependent variable is cnt and the independent variables are temp and month_name. Save the model as Model2. 10 points

b)	Use a code chunk to report the summary for Model2. Below the code chunk use regular text to comment on the R-squared. Please explain why the R-squared is different from the two simple regression models. 10 points (2 points for the summary, 8 points for the explanation)

c)	Compare the coefficient estimates for the month_nameJan variable in Model1 and Model2. With regular text explain why the coefficient estimates are different. 10 points (3 points for the comparison, 7 points for the explanation)

d)	With either a code chunk or regular text, use the coefficient estimates from Model2 to report the predicted cnt for the month of January when the temperature is .25. 10 points

In [24]:
model2 = smf.ols(formula='cnt ~ temp + C(month_name)', data=df2).fit()

In [25]:
model2.summary()

0,1,2,3
Dep. Variable:,cnt,R-squared:,0.447
Model:,OLS,Adj. R-squared:,0.438
Method:,Least Squares,F-statistic:,48.35
Date:,"Sun, 26 Dec 2021",Prob (F-statistic):,4.26e-84
Time:,08:50:15,Log-Likelihood:,-6353.2
No. Observations:,731,AIC:,12730.0
Df Residuals:,718,BIC:,12790.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,702.0767,252.546,2.780,0.006,206.261,1197.893
C(month_name)[T.2],87.5015,270.471,0.324,0.746,-443.507,618.510
C(month_name)[T.3],555.1158,284.086,1.954,0.051,-2.623,1112.855
C(month_name)[T.4],852.3127,313.412,2.719,0.007,236.999,1467.627
C(month_name)[T.5],939.0435,369.315,2.543,0.011,213.977,1664.110
C(month_name)[T.6],804.8452,419.310,1.919,0.055,-18.376,1628.066
C(month_name)[T.7],151.1336,459.776,0.329,0.742,-751.532,1053.800
C(month_name)[T.8],544.2343,432.051,1.260,0.208,-304.000,1392.469
C(month_name)[T.9],1220.5672,382.162,3.194,0.001,470.279,1970.856

0,1,2,3
Omnibus:,18.533,Durbin-Watson:,0.512
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9.819
Skew:,-0.04,Prob(JB):,0.00738
Kurtosis:,2.438,Cond. No.,24.9


The R2 has increased to 0.447. The introduction of temp variable has affected the model prediction.

Model 2 has lower coefficient estimates for the month_nameJan variable. The coefficients describe the mathematical relationship between each independent variable and the dependent variable.

If there are other predictor variables, all coefficients will be changed. All the coefficients are jointly estimated, so every new variable changes all the other coefficients already in the model.

Report the predicted cnt for the month of January when the temperature is .25

In [26]:
X2 = pd.Series({'temp': 0.25, 'month_name': 1})

In [27]:
X2

temp         0.25
month_name   1.00
dtype: float64

In [28]:
model2.predict(X2)

0   2260.86
dtype: float64

#### Python code done by DL