# Multicollinearity
Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model. 
In general, multicollinearity can lead to wider confidence intervals that produce less reliable probabilities in terms of the effect of independent variables in a model. That is, the statistical inferences from a model with multicollinearity may not be dependable.
It makes it hard for interpretation of model and also creates overfitting problem. When independent variables are highly correlated, change in one variable would cause change to another and so the model results fluctuate significantly. The model results will be unstable and vary a lot given a small change in the data or model.


### How to check for Multicollinearity:
#### 1. Checking Correlation Matrix


In [56]:
import pandas as pd
from sklearn.datasets import load_boston
import statsmodels.api as sm
boston=load_boston()
X = pd.DataFrame(data=boston.data,columns=boston.feature_names)
Y = pd.DataFrame(data=boston.target,columns=['MEDV'])

In [57]:
corr=X.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
CRIM,1.0,-0.200469,0.406583,-0.0558916,0.420972,-0.219247,0.352734,-0.37967,0.625505,0.582764,0.289946,-0.385064,0.455621
ZN,-0.200469,1.0,-0.533828,-0.0426967,-0.516604,0.311991,-0.569537,0.664408,-0.311948,-0.314563,-0.391679,0.17552,-0.412995
INDUS,0.406583,-0.533828,1.0,0.062938,0.763651,-0.391676,0.644779,-0.708027,0.595129,0.72076,0.383248,-0.356977,0.6038
CHAS,-0.0558916,-0.0426967,0.062938,1.0,0.0912028,0.0912512,0.0865178,-0.0991758,-0.00736824,-0.0355865,-0.121515,0.0487885,-0.0539293
NOX,0.420972,-0.516604,0.763651,0.0912028,1.0,-0.302188,0.73147,-0.76923,0.611441,0.668023,0.188933,-0.380051,0.590879
RM,-0.219247,0.311991,-0.391676,0.0912512,-0.302188,1.0,-0.240265,0.205246,-0.209847,-0.292048,-0.355501,0.128069,-0.613808
AGE,0.352734,-0.569537,0.644779,0.0865178,0.73147,-0.240265,1.0,-0.747881,0.456022,0.506456,0.261515,-0.273534,0.602339
DIS,-0.37967,0.664408,-0.708027,-0.0991758,-0.76923,0.205246,-0.747881,1.0,-0.494588,-0.534432,-0.232471,0.291512,-0.496996
RAD,0.625505,-0.311948,0.595129,-0.00736824,0.611441,-0.209847,0.456022,-0.494588,1.0,0.910228,0.464741,-0.444413,0.488676
TAX,0.582764,-0.314563,0.72076,-0.0355865,0.668023,-0.292048,0.506456,-0.534432,0.910228,1.0,0.460853,-0.441808,0.543993


As we can see the features "Tax" and "Indus" are highly correlated, "Age" and "Nox" are highly correlated.

#### 1. Variance Inflation Factor (VIF)
It is a measure of multi-collinearity in the set of multiple regression variables. The higher the value of VIF the higher correlation between this variable and the rest.
In VIF method, we pick each feature and regress it against all of the other features. For each regression, the factor is calculated as :
$$ VIF=\frac{1}{1-R^2}$$
Generally, a VIF above 5 indicates a high multicollinearity. 

In [44]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = dataset.select_dtypes(include=numerics)
newdf.fillna(newdf.mean);

In [59]:
from statsmodels.stats.outliers_influence import variance_inflation_factor 
X=sm.add_constant(X)
# VIF dataframe 
vif_data = pd.DataFrame() 
vif_data["feature"] = X.columns 

# calculating VIF for each feature 
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))] 

print(vif_data)


    feature    VIF
0     const 585.27
1      CRIM   1.79
2        ZN   2.30
3     INDUS   3.99
4      CHAS   1.07
5       NOX   4.39
6        RM   1.93
7       AGE   3.10
8       DIS   3.96
9       RAD   7.48
10      TAX   9.01
11  PTRATIO   1.80
12        B   1.35
13    LSTAT   2.94


We get a value Infinity when $R^2$ is 1, indicating perfect multicollinearity. 
As we see RAD, TAX are highly correlated with the other variables.  

One approach may be the removal of regressors that are correlated.  Another may be principal component analysis or PCA.
