<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Outlier-detection-using-VIF" data-toc-modified-id="Outlier-detection-using-VIF-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Outlier detection using VIF</a></span></li><li><span><a href="#Using-numpy" data-toc-modified-id="Using-numpy-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Using numpy</a></span></li><li><span><a href="#Using-statsmodels" data-toc-modified-id="Using-statsmodels-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Using statsmodels</a></span></li><li><span><a href="#Using-scikit-learn" data-toc-modified-id="Using-scikit-learn-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Using scikit-learn</a></span></li></ul></div>

# Outlier detection using VIF
In case of collinearity, we can check whether two features are correlated or
not using correlation matrix. But, to check the multi-collinearity we can use
VIF to check for the multi-collinearity. Generally features with VIF > 10
are considered suspicious.

In linear regression collinearity harms the model performance and gives
unusual confidence intervals for the fitted parameters. Its good to remove
the highly correlated features.

# Using numpy

This approach uses matrix inversion using numpy linear algegra.
This is useful for small dataset, for large datasets, matrix inversion is 
much computationally expensive task. So, we need to use statsmodels variance
inflation factor (after adding constant term) which uses OLS method to fit
the data and uses R-squared values to get VIF.

In [1]:
import numpy as np
import pandas as pd


a = [1, 1, 2, 3, 4]
b = [2, 2, 3, 2, 1]
c = [4, 6, 7, 8, 9]
d = [4, 3, 4, 5, 4]

X = np.c_[a, b, c, d]

cc = np.corrcoef(X, rowvar=False)
matrix_vif = np.linalg.inv(cc)
arr_vif = matrix_vif.diagonal()


# pandas series
ser_vif = pd.Series(arr_vif, index='a b c d'.split())
print(ser_vif)

# pandas dataframe
df = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})
df_cor = df.corr()

df_vif = pd.DataFrame(np.linalg.inv(df.corr().values),
                      index = df_cor.index,
                      columns=df_cor.columns)

df_vif

a    22.95
b     3.00
c    12.95
d     3.00
dtype: float64


Unnamed: 0,a,b,c,d
a,22.95,6.453681,-16.301917,-6.453681
b,6.453681,3.0,-4.080441,-2.0
c,-16.301917,-4.080441,12.95,4.080441
d,-6.453681,-2.0,4.080441,3.0


# Using statsmodels

In [2]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.robust as smrb # smrb.mad() etc

In [3]:
df = pd.DataFrame(
    {'a': [1, 1, 2, 3, 4],
     'b': [2, 2, 3, 2, 1],
     'c': [4, 6, 7, 8, 9],
     'd': [4, 3, 4, 5, 4]}
)

In [4]:
X = df.values
X1 = sm.add_constant(X) # we need to add constant to get VIF
X1

array([[1., 1., 2., 4., 4.],
       [1., 1., 2., 6., 3.],
       [1., 2., 3., 7., 4.],
       [1., 3., 2., 8., 5.],
       [1., 4., 1., 9., 4.]])

In [5]:
vif = [variance_inflation_factor(X1, i) for i in range(len(X1))]
vif

[136.87499999999918,
 22.950000000000042,
 2.9999999999999987,
 12.950000000000006,
 3.000000000000005]

In [6]:
ser_vif = pd.Series(vif, index='constant a b c d'.split())
ser_vif

constant    136.875
a            22.950
b             3.000
c            12.950
d             3.000
dtype: float64

In [7]:
# using pandas to add constant
df_X1 = df.assign(const=1.0)
vif = [variance_inflation_factor(df_X1.values, i) for i in range(df_X1.shape[1])]
ser_vif = pd.Series(vif, index='a b c d constant'.split())
ser_vif

a            22.950
b             3.000
c            12.950
d             3.000
constant    136.875
dtype: float64

# Using scikit-learn
https://stackoverflow.com/questions/42658379/variance-inflation-factor-in-python

In [11]:
def my_sklearn_vif(exogs, data):
    import pandas as pd
    from sklearn.linear_model import LinearRegression

    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # form input data for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        X, y = data[not_exog], data[exog]

        # extract r-squared from the fit
        r_squared = LinearRegression().fit(X, y).score(X, y)

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif

In [12]:
import seaborn as sns

df = sns.load_dataset('car_crashes')
exogs = ['alcohol', 'speeding', 'no_previous', 'not_distracted']
my_sklearn_vif(exogs=exogs, data=df)

Unnamed: 0,VIF,Tolerance
alcohol,3.436072,0.29103
speeding,1.88434,0.53069
no_previous,3.113984,0.321132
not_distracted,2.668456,0.374749


In [14]:
df = pd.DataFrame(
    {'a': [1, 1, 2, 3, 4],
     'b': [2, 2, 3, 2, 1],
     'c': [4, 6, 7, 8, 9],
     'd': [4, 3, 4, 5, 4]}
)

exogs = ['a', 'b', 'c', 'd']
my_sklearn_vif(exogs=exogs, data=df)

Unnamed: 0,VIF,Tolerance
a,22.95,0.043573
b,3.0,0.333333
c,12.95,0.07722
d,3.0,0.333333
