< [Notebook 13](PartIV2.ipynb) | [PyFinLab Index](ALWAYS-START-HERE.ipynb) | [Notebook 15](PartV1.ipynb) >

<a id = "ref00"></a>

<a><img src="figures/UUBS.png" width="180" height="180" border="10" /></a>

<hr>

### Notebook 14: Simple linear regression diagnostics

In this notebook, we consider how to diagnose simple linear regression models and validated them with regard to the associated assumptions. 


In [None]:
import pandas as pd 
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
housing = pd.read_csv('data/housing.csv',index_col=0)
housing.head()

In [None]:
model = smf.ols(formula='MEDV~LSTAT',data=housing).fit()

# here are intercept and slope paramters 
# found by the least squares estimation  
b0 = model.params[0]
b1 = model.params[1]

housing['BestResponse'] = b0 + b1*housing['LSTAT']

### Key assumptions behind linear regression model
1. Linearity 
2. Independence
3. Normality
4. Equal Variance

### Linearity

In [None]:
# we can view the scatter plot for a quick check
housing.plot(kind='scatter',x='LSTAT',y='MEDV',figsize=(10, 5),color='g')

### Independence

In [None]:
# observing the errors (residuals)
housing['error'] = housing['MEDV'] - housing['BestResponse']

In [None]:
# method 1: residual vs order plot
# error vs order plot (residual vs order) as a fast check 

plt.figure(figsize=(12, 6))
plt.title('Residual vs order')
plt.plot(housing.index, housing['error'],color='purple')
plt.axhline(y=0,color='red')
plt.show()

In [None]:
# method 2: Durbin-Watson Test
# rule of thumb: a Durbin-Watson Test statistic value in the  
# range 1.5-2.5 is generally taken as evidence of independence
model.summary()

### Normality

In [None]:
import scipy.stats as stats
z = (housing['error'] - housing['error'].mean())/housing['error'].std(ddof=1)

stats.probplot(z,dist='norm',plot=plt)
plt.title('Normal Q-Q plot')
plt.show()

### Equal variance

In [None]:
# Residual vs predictor plot
housing.plot(kind='scatter',x='LSTAT',y='error',figsize=(15, 8),color='green')
plt.title('Residual vs predictor')
plt.axhline(y=0,color='red')
plt.show()

<div align="left">
<a href="#G2" class="btn btn-default" data-toggle="collapse"><b>In conclusion!</b></a>

</div>
<div id="G2" class="collapse">
<br>

We see that the regression model (MEDV~LSTAT) violates all four assumptions. Therefore, we cannot draw statistical inference on the association between these variables using this model.

< [Notebook 13](PartIV2.ipynb) | [PyFinLab Index](ALWAYS-START-HERE.ipynb) | [Notebook 15](PartV1.ipynb) >

<div align="right"><a href="#ref00">back to top</a></div>