# Testing Knowledge 

In this notebook we will build several models based on the bike dataset

First let's import the dataset

In [1]:
import pandas as pd
df=pd.read_csv("data/day.csv")
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


Mow we can delete the non used variables

In [2]:
df=df[["season","temp","hum","windspeed","cnt"]]
df.head()

Unnamed: 0,season,temp,hum,windspeed,cnt
0,1,0.344167,0.805833,0.160446,985
1,1,0.363478,0.696087,0.248539,801
2,1,0.196364,0.437273,0.248309,1349
3,1,0.2,0.590435,0.160296,1562
4,1,0.226957,0.436957,0.1869,1600


#### 1 Let's find which predictor is driving of the using of Bike

In [5]:
import statsmodels.formula.api as smf
formula=["cnt ~ temp","cnt ~ hum","cnt ~ windspeed","cnt ~ season"]
for i in formula:
    result=smf.ols(formula=i,data=df).fit()
    print("\n")
    print("{} : R squared= {} ".format(i,result.rsquared))
    
    



cnt ~ temp : R squared= 0.3937487313729242 


cnt ~ hum : R squared= 0.010132146131519248 


cnt ~ windspeed : R squared= 0.05501135581553118 


cnt ~ season : R squared= 0.16491751116278974 


According to the four model ,the model cnt ~ temp explain the most the variation of the outcome cnt


#### 2.Looking at the squared of each season of each univarate model

In [13]:
season={1:"spring",2:"summer",3:"fall",4:"winter"}
variable=["temp","hum","windspeed"]
for i in range(1,5):
    print("---"*10)
    print("\n {}".format(season[i]))
    for predictor in variable:
        formula="cnt ~ {}".format(predictor)
        result=smf.ols(formula=formula,data=df[df["season"]==i]).fit()
        print("\n")
        print("cnt ~ {} : R squared= {} ".format(predictor,result.rsquared))
    
    

------------------------------

 spring


cnt ~ temp : R squared= 0.44783156923271294 


cnt ~ hum : R squared= 0.0015466046229688502 


cnt ~ windspeed : R squared= 0.007738534637298122 
------------------------------

 summer


cnt ~ temp : R squared= 0.22741853032512516 


cnt ~ hum : R squared= 0.10235334043914202 


cnt ~ windspeed : R squared= 0.05259634252037981 
------------------------------

 fall


cnt ~ temp : R squared= 0.0010770708700372777 


cnt ~ hum : R squared= 0.1025613557566093 


cnt ~ windspeed : R squared= 0.04007661515159755 
------------------------------

 winter


cnt ~ temp : R squared= 0.15813792312746722 


cnt ~ hum : R squared= 0.08296421842555601 


cnt ~ windspeed : R squared= 0.022203851487793136 


Interpretation

* Temp has the most influence in spring.
* Humidity is the most important factor in the fall.  
* windspeed have a very litle influence on the usage of the bike



#### 3 Interest us on the p value of each model for each season


In [18]:
for i in range(1,5):
    print("---"*10)
    print("\n {}".format(season[i]))
    for predictor in variable:
        formula="cnt ~ {}".format(predictor)
        result=smf.ols(formula=formula,data=df[df["season"]==i]).fit()
        print("\n")
        print("cnt ~ {} :  pvalue[{}]= {:.3f} ".format(predictor,predictor,result.pvalues[predictor]))

------------------------------

 spring


cnt ~ temp :  pvalue[temp]= 0.000 


cnt ~ hum :  pvalue[hum]= 0.599 


cnt ~ windspeed :  pvalue[windspeed]= 0.239 
------------------------------

 summer


cnt ~ temp :  pvalue[temp]= 0.000 


cnt ~ hum :  pvalue[hum]= 0.000 


cnt ~ windspeed :  pvalue[windspeed]= 0.002 
------------------------------

 fall


cnt ~ temp :  pvalue[temp]= 0.655 


cnt ~ hum :  pvalue[hum]= 0.000 


cnt ~ windspeed :  pvalue[windspeed]= 0.006 
------------------------------

 winter


cnt ~ temp :  pvalue[temp]= 0.000 


cnt ~ hum :  pvalue[hum]= 0.000 


cnt ~ windspeed :  pvalue[windspeed]= 0.047 


* the temp is not significant at fall

#### 4 Compare to models

In [27]:
predictors=["temp","temp + windspeed + hum"]
for predictor in predictors:
    formula="cnt ~ {}".format(predictor)
    result=smf.ols(formula=formula,data=df).fit()
    print("\n")
    print("cnt ~ {} :  Rsquared= {:.3f}  ".format(predictor,result.rsquared))
    print("cnt ~ {} :  log likelihood= {:.3f} ".format(predictor,result.llf))
    print("cnt ~ {} :  Adj R squared= {:.3f} ".format(predictor,result.rsquared_adj))




cnt ~ temp :  Rsquared= 0.394  
cnt ~ temp :  log likelihood= -6386.768 
cnt ~ temp :  Adj R squared= 0.393 


cnt ~ temp + windspeed + hum :  Rsquared= 0.461  
cnt ~ temp + windspeed + hum :  log likelihood= -6343.864 
cnt ~ temp + windspeed + hum :  Adj R squared= 0.459 


The cnt ~ temp + windspeed + hum is better than the cnt ~ temp because 
* R-squared and adjusted R-squared both increased.
* The log-likelihood increased.
* The confidence interval for the temperature coefficient is tighter.

#### 6 Analyse of the cnt ~ temp + season + hum + windspeed

In [28]:
result=smf.ols(formula="cnt ~ temp + season + hum + windspeed",data=df).fit()
print("log likelihood= {:.3f} ".format(result.llf))

log likelihood= -6310.231 


Adding the season improve the model beause the log likelihood increase

##### 10 New variable with the square of the predictors

In [33]:
import numpy as np
df["hum2"]=np.square(df.hum)
df["temp2"]=np.square(df.temp)
df["windspeed2"]=np.square(df.windspeed)
df.head()

Unnamed: 0,season,temp,hum,windspeed,cnt,hum^2,temp^2,windspeed^2,hum2,temp2,windspeed2
0,1,0.344167,0.805833,0.160446,985,0.649367,0.118451,0.025743,0.649367,0.118451,0.025743
1,1,0.363478,0.696087,0.248539,801,0.484537,0.132116,0.061772,0.484537,0.132116,0.061772
2,1,0.196364,0.437273,0.248309,1349,0.191208,0.038559,0.061657,0.191208,0.038559,0.061657
3,1,0.2,0.590435,0.160296,1562,0.348613,0.04,0.025695,0.348613,0.04,0.025695
4,1,0.226957,0.436957,0.1869,1600,0.190931,0.051509,0.034932,0.190931,0.051509,0.034932


In [35]:
model=["temp + temp2 + hum + windspeed","temp + hum + hum2 + windspeed", "temp + hum + windspeed + windspeed2","temp + hum + windspeed"]
for M in model:
    formula="cnt ~ {}".format(M)
    result=smf.ols(formula=formula,data=df).fit()
    print("\n")
    print("{} :  Rsquared= {:.3f}  ".format(formula,result.rsquared))
    print("{} :  log likelihood= {:.3f} ".format(formula,result.llf))
    print("{} :  Adj R squared= {:.3f} ".format(formula,result.rsquared_adj))



cnt ~ temp + temp2 + hum + windspeed :  Rsquared= 0.561  
cnt ~ temp + temp2 + hum + windspeed :  log likelihood= -6268.531 
cnt ~ temp + temp2 + hum + windspeed :  Adj R squared= 0.559 


cnt ~ temp + hum + hum2 + windspeed :  Rsquared= 0.480  
cnt ~ temp + hum + hum2 + windspeed :  log likelihood= -6330.366 
cnt ~ temp + hum + hum2 + windspeed :  Adj R squared= 0.478 


cnt ~ temp + hum + windspeed + windspeed2 :  Rsquared= 0.461  
cnt ~ temp + hum + windspeed + windspeed2 :  log likelihood= -6343.827 
cnt ~ temp + hum + windspeed + windspeed2 :  Adj R squared= 0.458 


cnt ~ temp + hum + windspeed :  Rsquared= 0.461  
cnt ~ temp + hum + windspeed :  log likelihood= -6343.864 
cnt ~ temp + hum + windspeed :  Adj R squared= 0.459 


The R squared and the log likelihood increase when we add the square of the predictors except the square of windspeed .So we can conclude that adding square of hum and temp improve the model but the square of the wind have not a effect in the model
