## Introduction

- The fundamental assumption for consistency of least-squares estimators is that the model error term is unrelated to the regressors, i.e., $E(u|x) = 0$.
- If this assumption fails, the ordinary least-squares (OLS) estimator is inconsistent and the OLS estimator can no longer be given a causal interpretation.

- The instrumental-variables (iv) estimator provides a consistent estimator under the very strong assumption that valid instruments exist, where the instruments z are variables that are correlated with the regressors x that satisfy $E(u|z) = 0$.
- The IV approach is the original and leading approach for estimating the parameters of the models with endogenous regressors and errors-in-variables models.

## IV example

- We consider a model with one endogenous regressor, several exogenous regressors, and one or more excluded exogenous variables that serve as the identifying instruments.
- The dataset is an extract from the Medical Expenditure Panel Survey (MEPS) of individuals over the age of 65 years, similar to the dataset described before.

- The equation to be estimated has the dependent variable ldrugexp, the log of total out-of-pocket expenditures on prescribed medications. 
- The regressors are

- an indicator for whether the individual holds either employer or union-sponsored health insurance (hi_empunion), 
- number of chronic conditions (totchr), 
- and four sociodemographic variables: age in years (age), indicators for whether female (female) and whether black or Hispanic (blhisp), and the natural logarithm of annual household income in thousands of dollars (linc).

- We treat the health insurance variable hLempunion as endogenous.
- The intuitive justification is that having such supplementary insurance on top of the near universal Medicare insurance for the elderly may be a choice variable.
- Even though most individuals in the sample are no longer working, those who expected high future medical expenses might have been more likely to choose a job when they were working that would provide supplementary health insurance upon retirement.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf 

In [2]:
data=pd.read_stata('data/mus06data.dta')
controls = ['totchr', 'age', 'female', 'blhisp', 'linc']
data[['ldrugexp','hi_empunion']+controls].describe([])

Unnamed: 0,ldrugexp,hi_empunion,totchr,age,female,blhisp,linc
count,10391.0,10391.0,10391.0,10391.0,10391.0,10391.0,10089.0
mean,6.479664,0.379655,1.860745,75.046386,0.579732,0.17034,2.743271
std,1.363393,0.485324,1.290131,6.69368,0.493626,0.375956,0.913144
min,0.0,0.0,0.0,65.0,0.0,0.0,-6.907755
50%,6.677083,0.0,2.0,74.0,1.0,0.0,2.74316
max,10.180172,1.0,9.0,91.0,1.0,1.0,5.744476


In [3]:
data=data.dropna(subset=['linc'])

### Available instruments

- We consider four potential instruments for hi_empunion. 
- Two reflect the income status of the individual and two are based on employer characteristics.

- The `ssiratio` instrument is the ratio of an individual's social security income to the individual's income from all sources, with high values indicating a significant income constraint.
- The `lowincome` instrument is a qualitative indicator of low-income status.
- The `firmsz` instrument measures the size of the firm's employed labor force, 
- and the `multlc` instrument indicates whether the firm is a large operator with multiple locations. 

In [4]:
instruments = ['ssiratio', 'lowincome', 'multlc', 'firmsz']
data.loc[:,instruments].describe([])

Unnamed: 0,ssiratio,lowincome,multlc,firmsz
count,10089.0,10089.0,10089.0,10089.0
mean,0.536544,0.187432,0.062048,0.140528
std,0.36782,0.390277,0.241254,2.170324
min,0.0,0.0,0.0,0.0
50%,0.504522,0.0,0.0,0.0
max,9.25062,1.0,1.0,50.0


- We have four available instruments for one endogenous regressor. 
- The obvious approach is to use all available instruments, because in theory this leads to the most efficient estimator. 
- In practice, it may lead to larger small-sample bias because the small-sample biases of IV estimators increase with the number of instruments (Hahn and Hausman 2002) .

- At a minimum, it is informative to view the gross correlation between endogenous variables and instruments and between instruments. 
- When multiple instruments are available, then it is actually the partial correlation after controlling for other available instruments that matters. 
- This important step is deferred to later sections.

### IV estimation of an exactly identified model

- We begin with IV regression of ldrugexp on the endogenous regressor `hi_empunion`, instrumented by the single instrument `ssiratio`, and several exogenous regressors.

In [5]:
from linearmodels.iv import IV2SLS

formula = 'ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ ssiratio]'
res = IV2SLS.from_formula(formula, data).fit(cov_type='robust')
print(res.first_stage)
print('\n'.join(res.summary.as_text().split('\n')[8:]))

  from pandas import (Categorical, DataFrame, Index, MultiIndex, Panel, Series,


    First Stage Estimation Results    
                           hi_empunion
--------------------------------------
R-squared                       0.0761
Partial R-squared               0.0179
Shea's R-squared                0.0179
Partial F-statistic             65.806
P-value (Partial F-stat)     4.441e-16
Partial F-stat Distn           chi2(1)
Intercept                       1.0290
                              (17.705)
totchr                          0.0128
                              (3.4896)
female                         -0.0734
                             (-7.6226)
age                            -0.0086
                             (-12.184)
linc                            0.0484
                              (7.3266)
blhisp                         -0.0627
                             (-5.1084)
ssiratio                       -0.1916
                             (-8.1121)
--------------------------------------

T-stats reported in parentheses
T-stats use same covariance typ

### IV estimation of an overidentified model

- We next consider estimation of an overidentified model. 
- Then different estimates are obtained by 2SLS estimation and by different variants of GMM.

- We use two instruments, `ssiratio` and `multlc`, for `hi_empunion`, the endogenous regressor. 
- The first estimator is 2SLS; obtained with standard errors that correct for heteroskedaticity
- The second estimator is optimal GMM given heteroskedastic errors.

- The third continuously updating GMM estimator simultaneously optimizes the moment conditions and the weighting matrix.
- The fourth estimator is one that illustrates optimal GMM with clustered errors by clustering on age. 
- The final estimator is the same as the first but reports default standard errors that do not adjust for heteroskedasticity.

In [6]:
from linearmodels.iv import IVGMM, IVGMMCUE
from linearmodels.iv.results import compare

formula = 'ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ ssiratio+multlc]'
res_2sls = IV2SLS.from_formula(formula, data).fit(cov_type='robust')
res_gmm = IVGMM.from_formula(formula, data).fit(cov_type='robust')
res_gmm_cue = IVGMMCUE.from_formula(formula, data).fit(cov_type='robust')
res_gmm_clustered = IVGMM.from_formula(formula, data).fit(cov_type='clustered', clusters=data.age)
res_2sls_std = IV2SLS.from_formula(formula, data).fit(cov_type='unadjusted')
compare([res_2sls,res_gmm,res_gmm_cue,res_gmm_clustered,res_2sls_std])

0,1,2,3,4,5
,Model 0,Model 1,Model 2,Model 3,Model 4
Dep. Variable,ldrugexp,ldrugexp,ldrugexp,ldrugexp,ldrugexp
Estimator,IV-2SLS,IV-GMM,IV-GMM,IV-GMM,IV-2SLS
No. Observations,10089,10089,10089,10089,10089
Cov. Est.,robust,robust,robust,clustered,unadjusted
R-squared,0.0414,0.0406,0.0388,0.0406,0.0414
Adj. R-squared,0.0409,0.0400,0.0382,0.0400,0.0409
F-statistic,1955.4,1952.6,1949.2,1613.6,1882.3
P-value (F-stat),0.0000,0.0000,0.0000,0.0000,0.0000
==================,===========,===========,===========,===========,============


###  Testing for regressor endogeneity

- The Hausman test principle provides a way to test whether a regressor is endogenous.
- The test usually compares just the coefficients of the endogenous variables in IV and OLS estimations.

In [7]:
res.wu_hausman()

Wu-Hausman test of exogeneity
H0: All endogenous variables are exogenous
Statistic: 25.3253
P-value: 0.0000
Distributed: F(1,10081)
WaldTestStatistic, id: 0x26571d90bc8

### Tests of overidentifying restrictions

- The validity of an instrument cannot be tested in a just-identified model. 
- But it is possible to test the validity of overidentifying instruments in an overidentified model provided that the parameters of the model are estimated using optimal GMM.

In [8]:
res_gmm.j_stat

H0: Expected moment conditions are equal to 0
Statistic: 1.0475
P-value: 0.3061
Distributed: chi2(1)
WaldTestStatistic, id: 0x26571c1cc08

- When all instruments are included the story changes, and some of the additional instrument (`lowincome` or `firmsz`) appear to be endogenous.

In [9]:
formula='ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ ssiratio+multlc+lowincome+firmsz]'
res_gmm_all = IVGMM.from_formula(formula,data).fit(cov_type='robust')
res_gmm_all.j_stat

H0: Expected moment conditions are equal to 0
Statistic: 11.5903
P-value: 0.0089
Distributed: chi2(3)
WaldTestStatistic, id: 0x2657444c308

## Weak instruments

- Our concern is with whether the instrument is weak, because then asymptotic theory can provide a poor guide to actual finite-sample distributions.

### Weak instruments

- There are several approaches for investigating the weak IV problem, based on analysis of the first-stage reduced-form equation(s) and, particularly, the F statistic for the joint significance of the key instruments.

####  Diagnostics for weak instruments

- The simplest method is to use the pairwise correlations between any endogenous regressor and instruments.

In [10]:
data.loc[:,['hi_empunion']+instruments].corr()

Unnamed: 0,hi_empunion,ssiratio,lowincome,multlc,firmsz
hi_empunion,1.0,-0.212431,-0.116419,0.119849,0.037352
ssiratio,-0.212431,1.0,0.253946,-0.190433,-0.044578
lowincome,-0.116419,0.253946,1.0,-0.062465,-0.008232
multlc,0.119849,-0.190433,-0.062465,1.0,0.187275
firmsz,0.037352,-0.044578,-0.008232,0.187275,1.0


### Just-identified model

- We consider a just-identified model with one endogenous regressor, with `hi_empunion` instrumented by one variable, `ssiratio`.

In [11]:
res.first_stage.diagnostics

Unnamed: 0,rsquared,partial.rsquared,shea.rsquared,f.stat,f.pval,f.dist
hi_empunion,0.076055,0.017921,0.017921,65.805858,4.440892e-16,chi2(1)


### Overidentified model

- For a model with a single endogenous regressor that is overidentified, the output is of the same format as the previous example. 
- The F statistic will now be a joint test for the several instruments.

In [12]:
res_gmm_all.first_stage.diagnostics

Unnamed: 0,rsquared,partial.rsquared,shea.rsquared,f.stat,f.pval,f.dist
hi_empunion,0.082054,0.024298,0.024298,179.469746,0.0,chi2(4)


### Sensitivity to choice of instruments

- Is this result sensitive to the choice of the instrument?
- To address this question, we compare results for four just-identified specifications, each estimated using just one of the four available instruments. 
- We present a table with the structural-equation estimates for OLS and for the four IV estimations, followed by the F statistic for each of the four IV estimations.

In [13]:
res_ols = IV2SLS.from_formula('ldrugexp ~ 1 + hi_empunion + totchr + female + age + linc + blhisp', data).fit(cov_type='robust')
res_2sls1 = IV2SLS.from_formula('ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ ssiratio]', data).fit(cov_type='robust')
res_2sls2 = IV2SLS.from_formula('ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ lowincome]', data).fit(cov_type='robust')
res_2sls3 = IV2SLS.from_formula('ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ multlc]', data).fit(cov_type='robust')
res_2sls4 = IV2SLS.from_formula('ldrugexp ~ 1 + totchr + female + age + linc + blhisp + [hi_empunion ~ firmsz]', data).fit(cov_type='robust')
compare([res_ols,res_2sls1,res_2sls2,res_2sls3,res_2sls4])

0,1,2,3,4,5
,Model 0,Model 1,Model 2,Model 3,Model 4
Dep. Variable,ldrugexp,ldrugexp,ldrugexp,ldrugexp,ldrugexp
Estimator,OLS,IV-2SLS,IV-2SLS,IV-2SLS,IV-2SLS
No. Observations,10089,10089,10089,10089,10089
Cov. Est.,robust,robust,robust,robust,robust
R-squared,0.1770,0.0640,0.1768,-0.0644,-0.9053
Adj. R-squared,0.1765,0.0634,0.1763,-0.0651,-0.9064
F-statistic,2262.6,2000.9,2250.6,1734.1,950.64
P-value (F-stat),0.0000,0.0000,0.0000,0.0000,0.0000
==================,===========,===========,===========,===========,===========


In [14]:
df=pd.concat([res_2sls1.first_stage.diagnostics,res_2sls2.first_stage.diagnostics,res_2sls3.first_stage.diagnostics,res_2sls4.first_stage.diagnostics])
df.index=instruments
df

Unnamed: 0,rsquared,partial.rsquared,shea.rsquared,f.stat,f.pval,f.dist
ssiratio,0.076055,0.017921,0.017921,65.805858,4.440892e-16,chi2(1)
lowincome,0.064237,0.00536,0.00536,59.265091,1.376677e-14,chi2(1)
multlc,0.064313,0.005441,0.005441,52.673484,3.939071e-13,chi2(1)
firmsz,0.060123,0.000987,0.000987,14.390114,0.0001485803,chi2(1)


## Better inference with weak instruments

- A different approach when weak instruments are present is to apply an alternative asymptotic theory that may be more appropriate when instruments are weak or to use estimators other than 2SLS, for which the usual asympotic theory may provide a more reasonable approximation when instruments are weak.

### LIML estimator

- The literature suggests several alternative estimators that are asymptotically equivalent to 2SLS but may have better finite sample properties than 2SLS.
- The leading example is the LIML estimator.
- It is an ML estimator for obvious reasons and is a limited-information estimator when compared with a full-information approach that specifies structural equations (rather than first-stage equations) for all endogenous variables in the model.

In [15]:
from linearmodels.iv import IVLIML

res_liml = IVLIML.from_formula(formula, data).fit(cov_type='robust')
print('\n'.join(res_liml.summary.as_text().split('\n')[8:]))

                                                                              
                              Parameter Estimates                              
             Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------
Intercept       6.8043     0.2538     26.811     0.0000      6.3069      7.3017
totchr          0.4504     0.0102     44.139     0.0000      0.4304      0.4705
female         -0.0219     0.0316    -0.6912     0.4895     -0.0838      0.0401
age            -0.0134     0.0029    -4.6842     0.0000     -0.0190     -0.0078
linc            0.0884     0.0214     4.1347     0.0000      0.0465      0.1303
blhisp         -0.2186     0.0391    -5.5888     0.0000     -0.2953     -0.1420
hi_empunion    -0.9156     0.1989    -4.6037     0.0000     -1.3054     -0.5258

Endogenous: hi_empunion
Instruments: ssiratio, multlc, lowincome, firmsz
Robust Covariance (Heteroskedastic)
Debiased: F

- We compare estimators for an overidentified model with three instruments for `hi_empunion`.

In [16]:
res_2sls = IV2SLS.from_formula(formula, data).fit(cov_type='robust')
res_gmm = IVGMM.from_formula(formula, data).fit(cov_type='robust')
compare([res_2sls,res_liml,res_gmm])

0,1,2,3
,Model 0,Model 1,Model 2
Dep. Variable,ldrugexp,ldrugexp,ldrugexp
Estimator,IV-2SLS,IV-LIML,IV-GMM
No. Observations,10089,10089,10089
Cov. Est.,robust,robust,robust
R-squared,0.0720,0.0597,0.0829
Adj. R-squared,0.0715,0.0592,0.0824
F-statistic,2017.7,1990.4,2042.1
P-value (F-stat),0.0000,0.0000,0.0000
==================,===========,===========,===========


## 3SLS systems estimation

- The preceding estimators are asymmetric in that they specify a structural equation for only one variable, rather than for all endogenous variables. 
- For example, we specified a structural model for `ldrugexp`, but not one for `hi_empunion`. 
- A more complete model specifies structural equations for all endogenous variables.

- For the example below, we need to provide a structural model for `hi_empunion` in addition to the structural model already specified for `ldrugexp`. 
- We suppose that `hi_empunion` depends on the single instrument `ssiratio`, on `ldrugexp`, and on `totchr`, `female` and `blhisp`.

In [17]:
from linearmodels import IV3SLS

ldrugexp='ldrugexp ~ 1 + hi_empunion + totchr + age + female + linc + blhisp'
hi_empunion='hi_empunion ~ 1 + ldrugexp + totchr + female + blhisp + ssiratio'
equations = dict(ldrugexp=ldrugexp, hi_empunion=hi_empunion)
res_3sls = IV3SLS.from_formula(equations, data).fit(cov_type='unadjusted')
print('\n'.join(res_3sls.summary.as_text().split('\n')[8:]))

                                                                              
                Equation: ldrugexp, Dependent Variable: ldrugexp               
             Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------
Intercept       5.8251     0.1531     38.058     0.0000      5.5251      6.1251
hi_empunion     0.1532     0.0261     5.8716     0.0000      0.1020      0.2043
totchr          0.4399     0.0096     45.971     0.0000      0.4212      0.4587
age            -0.0034     0.0019    -1.7983     0.0721     -0.0071      0.0003
female          0.0652     0.0252     2.5909     0.0096      0.0159      0.1145
linc            0.0071     0.0139     0.5128     0.6081     -0.0202      0.0345
blhisp         -0.1457     0.0338    -4.3105     0.0000     -0.2119     -0.0794
            Equation: hi_empunion, Dependent Variable: hi_empunion            
            Parameter  Std. Err.     T-sta