#### Import libraries

In [1]:
import pandas as pd
import numpy as np

from scipy.stats import zscore
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

#### Load the data

In [2]:
raw = pd.read_csv ('data.csv')
raw.head()

data = raw.copy()
data.head()

Unnamed: 0,id,price_000,yr_2001,yr_2002,yr_2003,yr_2004,yr_2005,yr_2006,apt,floor,...,pcnt_indu,pcnt_com,pcnt_insti,pcnt_vacant,pcn_green,homicides,house,ses_bin,lnprice,price_hi
0,40003,60000,1,0,0,0,0,0,1,5,...,0.0,0.0,0.49,0.0,1.74,39.92,0,0,11.0021,0
1,40007,140000,0,1,0,0,0,0,0,0,...,0.0,15.41,1.32,0.0,0.54,46.0,1,0,11.8494,1
2,40008,38000,0,1,0,0,0,0,1,1,...,0.0,8.16,5.57,0.0,1.55,45.87,0,0,10.54534,0
3,40010,110000,0,1,0,0,0,0,0,0,...,0.0,8.11,5.53,0.0,1.58,45.88,1,0,11.60824,1
4,40011,120000,0,1,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.4,46.0,1,0,11.69525,1


##### Recoding Dummy Variables by Hand (hint: use .get_dummies() instead)

In [46]:
data['year'] = np.where(data['yr_2001']==1, '2001',
                        np.where(data['yr_2002']==1, '2002',
                                np.where(data['yr_2003']==1, '2003',
                                        np.where(data['yr_2004']==1, '2004',
                                                np.where(data['yr_2005']==1, '2005',
                                                        np.where(data['yr_2006']==1, '2006', ''))))))

##### Standardizing Your Dataset

Now, let's run the same regression as we did in Lab 4, but let's standardize all the variables!

In [48]:
var_list = ['price_000','pop_dens','ses','house','area_m2','num_bath','pcn_green','homicides', 'year']
data = data[var_list].copy()

data.head()

Unnamed: 0,price_000,pop_dens,ses,house,area_m2,num_bath,pcn_green,homicides,year
0,60000,830.78,4,0,70,2,1.74,39.92,2001
1,140000,826.75,4,1,257,4,0.54,46.0,2002
2,38000,914.15,4,0,115,4,1.55,45.87,2002
3,110000,911.25,4,1,270,4,1.58,45.88,2002
4,120000,757.37,4,1,268,2,0.4,46.0,2002


In [49]:
y = data['price_000'] #Dependent Variable
ind_var = ['ses', 'house', 'area_m2', 'num_bath', 'pcn_green', 'homicides'] #Independent Variables

x_zscore = data[ind_var].apply(zscore).assign(Intercept = 1)
x_zscore.head(5)

Unnamed: 0,ses,house,area_m2,num_bath,pcn_green,homicides,Intercept
0,0.637882,-0.691424,-0.599922,-0.200787,-0.207483,-0.838889,1
1,0.637882,1.446291,1.571573,1.732213,-0.359941,-0.733409,1
2,0.637882,-0.691424,-0.077369,1.732213,-0.231622,-0.735664,1
3,0.637882,1.446291,1.722533,1.732213,-0.227811,-0.735491,1
4,0.637882,1.446291,1.699308,-0.200787,-0.377728,-0.733409,1


In [50]:
model = sm.OLS(y, x_zscore).fit() ### Let's save the results under model. This will be useful for other functions below.
model.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.501
Dependent Variable:,price_000,AIC:,97841.9134
Date:,2020-02-19 13:27,BIC:,97885.9296
No. Observations:,3976,Log-Likelihood:,-48914.0
Df Model:,6,F-statistic:,666.9
Df Residuals:,3969,Prob (F-statistic):,0.0
R-squared:,0.502,Scale:,2844000000.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
ses,18048.1090,926.9705,19.4700,0.0000,16230.7260,19865.4921
house,-13049.3826,1170.1614,-11.1518,0.0000,-15343.5563,-10755.2088
area_m2,49034.2471,1262.2083,38.8480,0.0000,46559.6097,51508.8845
num_bath,10849.3381,1004.8950,10.7965,0.0000,8879.1793,12819.4968
pcn_green,-780.9053,940.7346,-0.8301,0.4065,-2625.2736,1063.4631
homicides,2500.6565,985.1252,2.5384,0.0112,569.2576,4432.0554
Intercept,93511.0455,845.7550,110.5652,0.0000,91852.8905,95169.2005

0,1,2,3
Omnibus:,3250.518,Durbin-Watson:,0.846
Prob(Omnibus):,0.0,Jarque-Bera (JB):,132799.568
Skew:,3.619,Prob(JB):,0.0
Kurtosis:,30.372,Condition No.:,3.0


##### VIF - Let's unpack this

In [51]:
# We created a list of independent variables
ind_var = ['ses', 'house', 'area_m2', 'num_bath', 'pcn_green', 'homicides'] 

# We assigned them to the variable x, and included a column of ones (the intercept)
x = data[ind_var].assign(Intercept = 1) #Independent Variables

x.head(5)

Unnamed: 0,ses,house,area_m2,num_bath,pcn_green,homicides,Intercept
0,4,0,70,2,1.74,39.92,1
1,4,1,257,4,0.54,46.0,1
2,4,0,115,4,1.55,45.87,1
3,4,1,270,4,1.58,45.88,1
4,4,1,268,2,0.4,46.0,1


Take a look at the `variance_inflaction_factor function`. What does it require as inputs? Use the `shit`+`tab` command to examine the documentation or run the next line of code.

In [52]:
variance_inflation_factor?

So we can see that the function needs 2 things:
    1. the values defined as x above
    2. the variable (column number) for which you want to run the VIF

However, what happens when I run the following? 

In [53]:
variance_inflation_factor (x, 0)

TypeError: '(slice(None, None, None), 0)' is an invalid key

Yes - it errors out. Why?

Turns out, x is a dataframe and the VIF function doesn't know what a dataframe is. It wants the values of the dataframe in an array format. Look below - the numbers are the same, but the format is different.

In [54]:
x.head(2)

Unnamed: 0,ses,house,area_m2,num_bath,pcn_green,homicides,Intercept
0,4,0,70,2,1.74,39.92,1
1,4,1,257,4,0.54,46.0,1


In [55]:
x.values[0:2]

array([[  4.  ,   0.  ,  70.  ,   2.  ,   1.74,  39.92,   1.  ],
       [  4.  ,   1.  , 257.  ,   4.  ,   0.54,  46.  ,   1.  ]])

Now, let's examine the second parameter - the column number. If I input the column number manually, I will get one result. Play around with the second parameter and see if the IFV changes. What happens if you choose a number higher that 6? Why?

In [56]:
variance_inflation_factor(x.values, 5)

1.3567308654658161

In [57]:
variance_inflation_factor(x.values, 10)

IndexError: index 10 is out of bounds for axis 1 with size 7

That's right! There are only 7 columns and remember, the first column is assigned an index of 0. Therefore, if you use a number higher than 6, it will error out.

Now that we have the parameters of the function, the second part of this line of code is the "list logic". This is the similar to running a for loop, but coded in a single line. What we are saying is: run the `variance_inflation_function` for every value of `i` in the range 0 - 6. How does it know when to stop? It's using the `shape` function.

In [58]:
x.shape # Outputs the shape of the dataframe
x.shape[0] #Accesses the fist number - number of rows
x.shape[1] #Accesses the second number - number of columns
print(x.shape[0])
print(x.shape[1])

3976
7


Now, we can run the function and see the results

In [59]:
vif = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
vif

[1.2012757242368917,
 1.9142662712925385,
 2.227270076371359,
 1.4117317371948093,
 1.2372146659377956,
 1.3567308654658161,
 35.8375267326744]

Add the column names as the series index (or row name)!

In [60]:
pd.Series(vif, index=x.columns)

ses           1.201276
house         1.914266
area_m2       2.227270
num_bath      1.411732
pcn_green     1.237215
homicides     1.356731
Intercept    35.837527
dtype: float64