# Multiple Regression

This notebook will use Variance Inflation Factor to improve the Regression model

Variance inflation factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables. 

Variance Inflation Factor is a type dimensionality reduction technique where features with a high values of VIF will be removed.

In [77]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score

In [44]:
b = load_boston()
df = pd.DataFrame(b.data, columns=b.feature_names)
y = b.target

In [45]:
df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')

In [46]:
model = LinearRegression()
model.fit(df, y)
m1,m2,c = model.coef_[0], model.coef_[1], model.intercept_
print(f"The Coefficients are {m1}, {m2} and the intercept is {c}")

The Coefficients are -0.10801135783679647, 0.046420458366875736 and the intercept is 36.45948838509001


In [109]:
prediction = model.predict(df)
results = pd.DataFrame({
    'Actual MEDV': y,
    'Predicted MEDV': prediction,
    'Difference': np.abs(y - prediction)},
    
    columns = ['Actual MEDV', 'Predicted MEDV', 'Difference']
)
results.head()

Unnamed: 0,Actual MEDV,Predicted MEDV,Difference
0,24.0,30.003843,6.003843
1,21.6,25.025562,3.425562
2,34.7,30.567597,4.132403
3,33.4,28.607036,4.792964
4,36.2,27.943524,8.256476


In [131]:
R2 = np.mean(cross_val_score(model, df, y))
print(f'The R-Squared of the model is {R2}')

The R-Squared of the model is 0.35327592439587857


### Checking the Variance Inflation of each feature

Interpreting the Variance Inflation Factor

Variance inflation factors range from 1 upwards. The numerical value for VIF tells you (in decimal form) what percentage the variance (i.e. the standard error squared) is inflated for each coefficient. 

A rule of thumb for interpreting the variance inflation factor:

    1 = not correlated.
    Between 1 and 5 = moderately correlated.
    Greater than 5 = highly correlated.


In [49]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [123]:
def vif(df):
    VIF = pd.DataFrame()
    VIF['Variables'] = df.columns
    VIF['Variance Inflation Factor'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return VIF
VIFs = vif(df)
VIFs

Unnamed: 0,Variables,Variance Inflation Factor
0,CRIM,2.100373
1,ZN,2.844013
2,INDUS,14.485758
3,CHAS,1.152952
4,NOX,73.894947
5,RM,77.948283
6,AGE,21.38685
7,DIS,14.699652
8,RAD,15.167725
9,TAX,61.227274


We notice there are a lot of columns with a VIF greater than 5

In [124]:
col_to_drop = VIFs[VIFs['Variance Inflation Factor'] > 70]['Variables']
col_to_drop

4         NOX
5          RM
10    PTRATIO
Name: Variables, dtype: object

In [125]:
vif_df_thresh_5 = df.drop(col_to_drop, axis=1)

In [126]:
model_thresh_5 = LinearRegression()
model_thresh_5.fit(vif_df_thresh_5, y)

LinearRegression()

In [127]:
prediction_thresh_5 = model_thresh_5.predict(vif_df_thresh_5)

In [132]:
R2_thresh_5 = np.mean(cross_val_score(model_thresh_5, vif_df_thresh_5, y))
print(f'The R-Squared of the model is {R2_thresh_5}')

The R-Squared of the model is 0.4154282179101509


In [130]:
vif(vif_df_thresh_5)

Unnamed: 0,Variables,Variance Inflation Factor
0,CRIM,2.095375
1,ZN,2.432618
2,INDUS,13.14918
3,CHAS,1.131756
4,AGE,14.795487
5,DIS,9.234022
6,RAD,15.045607
7,TAX,54.461654
8,B,14.233397
9,LSTAT,8.396362
