In [None]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score

import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
from patsy import dmatrices

import seaborn as sns
%matplotlib inline
sns.set(style="darkgrid", 
        rc={"figure.figsize":(12,8), 
            "axes.labelsize":14, 
            "xtick.labelsize":12, 
            "ytick.labelsize":12})

## A. Potential Problems with Linear Regression
 
1. Correlation of error terms
1. Non-linear relationship between $Y$ and $X$
1. Heteroscedasticity: non-constant variance of error terms
1. High-leverage points
1. Outliers
1. Collinearity

### Boston Housing Data

In [None]:
df = pd.read_csv("../../data/csv/Boston.csv")
df.head()

In [None]:
df.columns.tolist()

CRIM - per capita crime rate by town  
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.  
INDUS - proportion of non-retail business acres per town  
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)  
NOX - nitric oxides concentration (parts per 10 million)  
RM - average number of rooms per dwelling  
AGE - proportion of owner-occupied units built prior to 1940  
DIS - weighted distances to five Boston employment centres  
RAD - index of accessibility to radial highways  
TAX - full-value property-tax rate per \$10,000  
PTRATIO - pupil-teacher ratio by town  
BLACK - $1000(Bk - 0.63)^2$, where Bk is the proportion of blacks by town  
LSTAT - percentage lower status of the population  
MEDV - Median value of owner-occupied homes in $1000's

In [None]:
df.describe()

#### 2. Non-linear Relationship Between $Y$ and $X$

Identification:  
The plot of residuals vs. fitted (predicted) values $\hat{y_i}$ has a pattern

Solution:  
Transform $X$

#### 3. Heteroscedasticity

Identification:  
The plot of residuals vs. fitted values has a pattern

Solution:  
Transform $Y$

In [None]:
# Predictors
x = df.iloc[:,:-1]
x.head()

In [None]:
# Fit the linear model
lm = smf.ols(formula = "medv ~ x", data = df).fit()

In [None]:
fig = sns.regplot(x=lm.predict(), y=lm.resid, order=2)
fig.set(xlabel='Fitted Values', ylabel='Residuals', title='Response: Y');

The plot of residuals vs. fitted values has a parabolic shape. Let's try fitting the model with $\log(Y)$ and $\sqrt{Y}$:

In [None]:
lmLog = smf.ols(formula = "np.log(medv) ~ x", data = df).fit()
fig = sns.regplot(x=lmLog.predict(), y=lmLog.resid, fit_reg=False)
fig.set(xlabel='Fitted Values', ylabel='Residuals', title=r'Response: $log(Y)$');

In [None]:
lmSqrt = smf.ols(formula = "np.sqrt(medv) ~ x", data = df).fit()
fig = sns.regplot(x=lmSqrt.predict(), y=lmSqrt.resid)
fig.set(xlabel='Fitted Values', ylabel='Residuals', title=r'Response: $\sqrt{Y}$');

The plot with $log(Y)$ has a fan-in, funnel shape. The plot with $\sqrt{Y}$ has a better random pattern about 0.

#### 4. High-leverage Points

Definition:  
A predictor value $x_i$ that doesn't follow the pattern of the remaining predictor values, and thus affecting the estimated regression line.

Identification:  
$x_i$ is said to have high leverage if its leverage statistic $> \frac{p+1}{n}$, where $p=$ # of predictors and $n=$ # observations

Solution:  
Consider removing this $x_i$ from the overall dataset, particularly if it is also an outlier


In [None]:
df.shape

Any observation whose leverage statistic is greater than $\frac{p+1}{n}=\frac{13+1}{506}=0.0277$ counts as a high-leverage point.

In [None]:
influence = lm.get_influence()

In [None]:
# Calculate Leverage Statistic
leverage = influence.hat_matrix_diag
dfRes = pd.concat([df, pd.Series(leverage, name="leverage")], axis=1)
print dfRes.shape
dfRes.head()

In [None]:
# Top 5 high leverage data points
dfRes[dfRes["leverage"] > 0.0277].sort_values(by = "leverage", ascending = False).head()

#### 5. Outliers

Definition:  
$x_i$ is an outlier if the corresponding $y_i$ is far from the value predicted by the model

Identification:  
$x_i$ is an outlier if its studentized residual $>\left|3\right|$. A studentized residual $=\frac{e_i}{SE(e_i)}$

Solution:  
Consider removing this $x_i$ from the overall dataset

In [None]:
# Calculate Studentized Residuals
studentRes = influence.resid_studentized_external
dfRes = pd.concat([dfRes, pd.Series(studentRes, name="studentRes")], axis=1)
dfRes.head()

In [None]:
# Data points with high studentized residuals
dfRes[np.absolute(dfRes["studentRes"]) > 3]

The above 8 data points have both high studentized residuals > $|3|$ and high leverage of > 0.0277.

#### 6. Collinearity

Definition:  
Collinearity = two or more predictors are related to one another  
Multicollinearity = three or more predictors are related to one another

Identification:  
Large absolute values in the correlation matrix detects collinearity  
Large VIF (variance inflation factor) detects multicollinearity, where minimum VIF value is 1   
Convention: VIF > 10 is considered large and VIF > 5 is moderate

Solution:  
Either drop one of the correlated variables, or  
combine the collinear variables to form a new variable

In [None]:
# Correlation Matrix
corr = df.corr()
corr

In [None]:
# Correlation Heatmap

# generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# draw the heatmap with the mask and correct aspect ratio; detecting absolute correlations >= 0.7
sns.heatmap(np.absolute(corr), 
            mask=mask, 
            cmap=cmap, 
            vmax=.7, 
            center=0, 
            square=True, 
            linewidths=.5, 
            cbar_kws={"shrink": .5});

Lots of collinear variables (absolute correlation value of >=0.7):   
Indus, Nox, Age, Dis, Rad, Tax are all correlated with each other, and  
Medv, Lstat, Rm are correlated.  
We see strong patterns (not necessarily linear) in the graphs of each combination of these variables.