# Variance Inflation Factor

### Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. In R use the corr function and in python this can by accomplished by using numpy's corrcoef function.

### Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.

### A common R function used for testing regression assumptions and specifically multicolinearity is "VIF()" and unlike many statistical concepts, its formula is straightforward:

### $$ V.I.F. = 1 / (1 - R^2). $$

### In the example below we will demostrate how to calculate the VIF using the inbuilt  variance_inflation_factor.

In [16]:
#making the required imports
import pandas as pd
import numpy as np

from statsmodels.stats.outliers_influence import variance_inflation_factor


In [17]:
#reading the sample file into a data frame
df = pd.read_csv('vif_file.csv')

In [18]:
df.shape

(5, 4)

In [19]:
df.values

array([[ 5,  2, 23,  3],
       [ 6,  4,  4,  2],
       [ 7,  9,  5,  7],
       [ 8,  6,  8,  7],
       [ 9,  8,  9,  9]], dtype=int64)

In [20]:
df

Unnamed: 0,alpha,bravo,charlie,delta
0,5,2,23,3
1,6,4,4,2
2,7,9,5,7
3,8,6,8,7
4,9,8,9,9


In [21]:
#getting the VIF values for individual columns into a data frame.
vif = pd.DataFrame()

In [22]:
vif['VIF value'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

vif['Feature'] = df.columns

In [23]:
vif

Unnamed: 0,VIF value,Feature
0,29.802022,alpha
1,33.305592,bravo
2,3.864328,charlie
3,24.446943,delta


### As we can see that the VIF value of all columns is high except for 'charlie'. As all columns explain the same variance in the data set. 

### Now we will change the one outlier value in columns 'charlie' and we expect that the VIF will rise.

In [24]:
#changing the value from 23 to 3. 
df['charlie'][0] = 3

In [25]:
df

Unnamed: 0,alpha,bravo,charlie,delta
0,5,2,3,3
1,6,4,4,2
2,7,9,5,7
3,8,6,8,7
4,9,8,9,9


In [26]:
#again calculate the VIF and put values per column into a data frame. 
vif = pd.DataFrame()
vif['VIF value'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

vif['Feature'] = df.columns

vif

Unnamed: 0,VIF value,Feature
0,46.465275,alpha
1,30.245177,bravo
2,65.490179,charlie
3,53.718695,delta


### So in summary we can say that VIF is a good way to detect the Multicolinearity in a dataset. It is advised to drop the columns with high vif value (>5). 