**ECO 5100 Introductory Statistics and Econometrics (Fall 2019)**

**Variance Inflation Factor (VIF)**

1) Let's study the relationship between height and wages among British men.

**Data from:**
Persico, Nicola, Andrew Postlewaite, and Dan Silverman. 2004. The Effect of Adolescent Experience on Labor Market Outcomes: The Case of Height. Journal of Political Economy 112 (5): 1019–1053.

**Variables for Height and Wages Data in Britain**

**gwage33**: Hourly wages (in British pounds) at age 33

**height33**: Height (in inches) measured at age 33

**height16**: Height (in inches) measured at age 16

**height07**: Height (in inches) measured at age 7

**momed**: Education of mother, measured in years

**daded**: Education of father, measured in years

**siblings**: Number of siblings

**Ht16Noisy**: Height (in inches) measured at age 16 with measurement error added in


In [0]:
import pandas as pd
file = "https://github.com/VitorKamada/ECO6100/raw/master/Data/heightwage_british_males_multivariate.csv" 
df = pd.read_csv(file)

In [2]:
df.head()

Unnamed: 0,siblings,height07,height16,height33,gwage33,momed,daded,Ht16Noisy
0,,47.007847,65.354324,66.929131,12.428572,9.5,9.5,70.244843
1,4.0,49.999981,,70.078743,7.027027,,,
2,4.0,49.015743,,70.078743,5.121951,,,
3,3.0,47.992119,,68.897636,8.333333,10.5,10.5,
4,,45.984222,,66.141731,11.333333,9.5,9.5,


(a)	Persico, Postlewaite, and Silverman (2004) argue that adolescent height is most relevant because it is height at adolescent ages that affects the self-confidence to develop interpersonal skills at a young age. Estimate a model with wages at age 33 as the dependent variable and height both at age 33 and at age 16 as independent variables. What happens to the coefficient on height at age 33? Explain what is going on here.

The coefficient on adolescent height (height16) is statistically significant (p = 0.015), with the coefficient indicating that an additional inch of height is associated with 0.32 more British pounds per hour at age 33. There is no discernible effect of adult height on wages. 

In [0]:
import numpy as np            
df['const'] = 1

In [4]:
import statsmodels.api as sm
naive = sm.OLS(df['gwage33'], df[['const', 'height33']],
                    missing='drop').fit()
print(naive.summary())

                            OLS Regression Results                            
Dep. Variable:                gwage33   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     13.58
Date:                Sat, 05 Oct 2019   Prob (F-statistic):           0.000232
Time:                        23:01:02   Log-Likelihood:                -14062.
No. Observations:                3597   AIC:                         2.813e+04
Df Residuals:                    3595   BIC:                         2.814e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -10.2042      5.303     -1.924      0.0

In [5]:
result1 = sm.OLS(df['gwage33'], df[['const', 'height33', 'height16']],
                    missing='drop').fit()
print(result1.summary())

                            OLS Regression Results                            
Dep. Variable:                gwage33   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     6.423
Date:                Sat, 05 Oct 2019   Prob (F-statistic):            0.00165
Time:                        23:01:02   Log-Likelihood:                -10301.
No. Observations:                2592   AIC:                         2.061e+04
Df Residuals:                    2589   BIC:                         2.063e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -9.5891      6.763     -1.418      0.1

(b)	Let’s keep going. Add height at age 7 to the above model and discuss the results. Be sure to note changes in sample size (and its possible effects), and discuss the implications of adding a variable with the statistical significance observed for the “height at age 7” variable.

The results are different because now the coefficient on adolescent height (height16) is no longer statistically significant. The coefficients on the other two height variables are also not statistically significant.
However, note that the coefficient on adolescent height is almost statistically significant (with p = 0.104) and that we have introduced two factors that likely reduce our statistical power (and thereby increase our standard errors). First, the sample size is smaller (because data was missing for almost 400 people on their height at age 7) and we have added a variable that may be correlated with the other height variables (more on this below). 


In [6]:
result2 = sm.OLS(df['gwage33'], df[['const', 'height33', 'height16', 'height07']],
                    missing='drop').fit()
print(result2.summary())

                            OLS Regression Results                            
Dep. Variable:                gwage33   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     3.094
Date:                Sat, 05 Oct 2019   Prob (F-statistic):             0.0260
Time:                        23:01:02   Log-Likelihood:                -8667.2
No. Observations:                2199   AIC:                         1.734e+04
Df Residuals:                    2195   BIC:                         1.737e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -7.9765      7.212     -1.106      0.2

(c)	Is there multicollinearity in the model from part (b)? Qualify the degree of multicollinearity and indicate its consequences. Specify whether the multicollinearity will bias coefficients or have some other effect.

First, it seems plausible that the various height measures will be collinear because people who are tall probably tended to be taller at each age. One direct way to assess multicollinearity is the “vif” command (the variance inflation factor). The results are below and indicate the inflation of the variance of the coefficient estimates because of multicollinearity. The variance of the height16 variable, for example, is 2.99 times larger than it would be if the height16 variable were not at all linearly correlated with the other two independent variables.
Note this does not say the variance on this variable is incorrect, only that it is larger than it would be if we had independent variables that were not linearly related. Multicollinearity also does not bias the coefficient estimates.


Note this does not say the variance on this variable is incorrect, only that it is larger than it would be if we had independent variables that were not linearly related. Multicollinearity also does not bias the coefficient estimates.

We can also use an auxiliary regression to directly assess the degree of multicollinearity. Below we show the results when height33 is a dependent variable and the other two height variables are independent variables. The other two variables are strong linear predictors of height33 and the R2 is 0.6335. We can see how the R2 from this equation is related to the vif by noting (below the regression results) that the vif above for height33 = 1/(1 - R2) where the R2 is from the auxiliary regression in which height33 is the dependent variable. 

In [7]:
result3 = sm.OLS(df['height33'], df[['const', 'height16', 'height07']],
                    missing='drop').fit()
print(result3.summary())

                            OLS Regression Results                            
Dep. Variable:               height33   R-squared:                       0.633
Model:                            OLS   Adj. R-squared:                  0.633
Method:                 Least Squares   F-statistic:                     1898.
Date:                Sat, 05 Oct 2019   Prob (F-statistic):               0.00
Time:                        23:01:02   Log-Likelihood:                -4135.1
No. Observations:                2199   AIC:                             8276.
Df Residuals:                    2196   BIC:                             8293.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         20.7149      0.805     25.744      0.0

In [8]:
vif_height33 = 1/(1-result3.rsquared)
vif_height33

2.7283033771949476

In [9]:
result4 = sm.OLS(df['height16'], df[['const', 'height33', 'height07']],
                    missing='drop').fit()
vif_height16 = 1/(1-result4.rsquared)
vif_height16

2.9896686747771

In [10]:
result5 = sm.OLS(df['height07'], df[['const', 'height33', 'height16']],
                    missing='drop').fit()
vif_height07 = 1/(1-result5.rsquared)
vif_height07

2.304329526258211

(d)	Perhaps characteristics of parents affect height (some parents force kids to eat veggies, whereas others give them only French fries and Fanta). Add the two parental education variables to the model and discuss results. Include only height at age 16 (meaning we do not include the height at ages 33 and 7 for this question—although feel free to include them too on your own; the results are interesting). 

We see that mother’s education is significantly associated with higher wages (p = 0.024), but that father’s education is not. This may be surprising given that this is a sample of men. The adolescent height variable is statistically significant (p = 0.002) and roughly the same magnitude as we have seen earlier.  

In [11]:
result6 = sm.OLS(df['gwage33'], df[['const', 'height16', 'momed', 'daded']],
                    missing='drop').fit()
print(result6.summary())

                            OLS Regression Results                            
Dep. Variable:                gwage33   R-squared:                       0.012
Model:                            OLS   Adj. R-squared:                  0.010
Method:                 Least Squares   F-statistic:                     8.594
Date:                Sat, 05 Oct 2019   Prob (F-statistic):           1.14e-05
Time:                        23:01:02   Log-Likelihood:                -8738.4
No. Observations:                2190   AIC:                         1.748e+04
Df Residuals:                    2186   BIC:                         1.751e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -16.1086      6.306     -2.554      0.0

(e)	Perhaps kids had their food stolen by greedy siblings. Add the number of siblings to the model and discuss results.

The sibling variable does indeed indicate that wages are lower for British men with more siblings. The effect is statistically significant (p = 0.024). This is consistent with the idea that parents are less able to invest time and money into children the more children they have. There are also other possibilities (perhaps subcultures that have more children also are less well off), so we should consider it suggestive rather than definitive. The coefficient on height continues to be statistically significant, but the coefficient on mother’s education is no longer significant. This suggests that one reason that mother’s education was predictive of higher wages was that women with more education tended to have fewer children.

In [12]:
result7 = sm.OLS(df['gwage33'], df[['const', 'height16', 'momed', 'daded','siblings']],
                    missing='drop').fit()
print(result7.summary())

                            OLS Regression Results                            
Dep. Variable:                gwage33   R-squared:                       0.012
Model:                            OLS   Adj. R-squared:                  0.010
Method:                 Least Squares   F-statistic:                     6.088
Date:                Sat, 05 Oct 2019   Prob (F-statistic):           7.23e-05
Time:                        23:01:02   Log-Likelihood:                -7859.8
No. Observations:                1970   AIC:                         1.573e+04
Df Residuals:                    1965   BIC:                         1.576e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -11.5214      6.924     -1.664      0.0