# Subject: Classical Data Analysis

## Session 1 - Regression

### Exercise 1 



Considering the OLS presented in Demo 2 develop a new regression analysis based on the independent variable “LSTAT — percentage of lower status of the population”. 

- Interpret and discuss the OLS Regression Results. 
- Commit scripts in your GitHub account. You should export your solution code (.ipynb notebook) and push it to your repository “ClassicalDataAnalysis”.


The following are the tasks that should complete and synchronize with your repository “ClassicalDataAnalysis” until October 13. Please notice that none of these tasks is graded, however it’s important that you correctly understand and complete them in order to be sure that you won’t have problems with further assignments.



# Linear Regression in Statsmodels

### Regression model with Statsmodels and without a constant:

In [13]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn import datasets

In [3]:
data = datasets.load_boston()

In [11]:
print(data['feature_names'])

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [14]:
df = pd.DataFrame(data.data, columns=data['feature_names'])

In [15]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [17]:
target = pd.DataFrame(data.target, columns=['MEDV'])

In [26]:
x = df['LSTAT']
y = target['MEDV']

In [30]:
model = sm.OLS(y, x).fit()

In [31]:
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.449
Model:,OLS,Adj. R-squared:,0.448
Method:,Least Squares,F-statistic:,410.9
Date:,"Sun, 08 Oct 2017",Prob (F-statistic):,2.7099999999999998e-67
Time:,12:39:52,Log-Likelihood:,-2182.4
No. Observations:,506,AIC:,4367.0
Df Residuals:,505,BIC:,4371.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
LSTAT,1.1221,0.055,20.271,0.000,1.013,1.231

0,1,2,3
Omnibus:,1.113,Durbin-Watson:,0.369
Prob(Omnibus):,0.573,Jarque-Bera (JB):,1.051
Skew:,0.112,Prob(JB):,0.591
Kurtosis:,3.009,Cond. No.,1.0


### Interpreting the Table 

Our R-squared is only 0.449, that means the model is not that good, the variable LSTAT alone is not enough to predict the value of the houses.   
The coeficient is 1.1221, when the value of the variable LSTAT changes by 1 the predicted value of MEDV changes by 1.1221 in the same direction.    
For the hypothesis testing with the P-value we have to reject the null hypothesis, that means that the variable LSTAT probably has a relevant effect on MEDV.

### Regression model with Statsmodels and with a constant:

In [33]:
x = sm.add_constant(x)

In [35]:
model = sm.OLS(y,x).fit()

In [36]:
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared:,0.544
Model:,OLS,Adj. R-squared:,0.543
Method:,Least Squares,F-statistic:,601.6
Date:,"Sun, 08 Oct 2017",Prob (F-statistic):,5.08e-88
Time:,13:01:24,Log-Likelihood:,-1641.5
No. Observations:,506,AIC:,3287.0
Df Residuals:,504,BIC:,3295.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,34.5538,0.563,61.415,0.000,33.448,35.659
LSTAT,-0.9500,0.039,-24.528,0.000,-1.026,-0.874

0,1,2,3
Omnibus:,137.043,Durbin-Watson:,0.892
Prob(Omnibus):,0.0,Jarque-Bera (JB):,291.373
Skew:,1.453,Prob(JB):,5.36e-64
Kurtosis:,5.319,Cond. No.,29.7


### Interpreting the Table 


Our R-squared is 0.544, its better than before but its still a low value.   
The coeficient for LSTAT is -0.95, when the value of the variable LSTAT changes by 1 the predicted value of MEDV changes by 0.95 in the opposite direction, it changed completely from the previous model.   
The coeficient for the constant is 34.5538, thats the predicted value of MEDV when LSTAT is zero.    
For the hypothesis testing with the P-value we keep rejecting the null hypothesis, that means that the variable LSTAT probably has a relevant effect on MEDV.

# Linear Regression in SKLearn 

In [38]:
from sklearn import linear_model

In [40]:
x = df['LSTAT']
y = target['MEDV']

In [54]:
x = x.values.reshape(-1,1)

In [None]:
lm = linear_model.LinearRegression()
model = lm.fit(x,y)

In [57]:
lm.score(x,y)

0.54414629758647992

In [58]:
lm.coef_

array([-0.95004935])

In [61]:
lm.intercept_

34.55384087938311