### Variance Inflation Factor (VIF)
The variance inflation factor (VIF) quantifies the severity of multicollinearity among variables in an ordinary least squares regression analysis. It provides an index that measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity.

#### Colinearity 
Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.

#### Steps for Implementing VIF

1. Run a multiple regression.
2. Calculate the VIF factors.
3. Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.

In [10]:
#Imports
import pandas as pd
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [11]:
df=pd.read_excel("data/DataSet_GasPrice_ Monthly.xlsx")
df.head()

Unnamed: 0,Days,Date,AveCoalPrice,OilPrice,GrossGasProd,TotGasCons,GasPrice,Weather,GasPriceStatus
0,31,2008-05-31,75.85,125.4,2153.316,1576.387,11.27,SPRING,HIGH
1,61,2008-06-30,81.18,133.88,2118.791,1604.249,12.69,SUMMER,HIGH
2,92,2008-07-31,89.19,133.37,2205.26,1708.641,11.09,SUMMER,HIGH
3,123,2008-08-31,87.05,116.67,2193.566,1682.924,8.26,SUMMER,HIGH
4,153,2008-09-30,85.63,104.11,1919.52,1460.924,7.67,FALL,HIGH


In [12]:
df = df[['GasPrice','Days', 'AveCoalPrice', 'OilPrice', 'GrossGasProd', 'TotGasCons']].dropna() #subset the dataframe

##### Step 1: Run a multiple regression

In [13]:
%%capture
#gather features
features = "+".join(df.columns)

# get y and X dataframes based on this regression:
y, X = dmatrices('GasPrice ~' + features, df, return_type='dataframe')

##### Step 2: Calculate VIF Factors

In [14]:
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

##### Step 3: Inspect VIF Factors

In [15]:
vif.round(1)

Unnamed: 0,VIF Factor,features
0,577.1,Intercept
1,3.7,GasPrice
2,7.7,Days
3,4.7,AveCoalPrice
4,1.6,OilPrice
5,6.6,GrossGasProd
6,1.2,TotGasCons


##### Interpretation
As expected, the Gross gas production has been on the increase, therefore has a high variance inflation factor with days. Both variables are likely to "explain" the same variance within this dataset. We would need to discard one of these variables before moving on to model building or risk building a model with high multicolinearity.