It should be clear by now that omitted variables can seriously bias estimates of causal effects.

Omitted variables can be a cause of _endogeneity_ problems, evidenced by model residuals and one or more causal variables being correlated.  The result is biased estimates.

A problematic omitted variable can be of various types. It can be a known (to the researcher) variable for which data are actually available, but not included in a model's specification.  It can be a known variable for which data aren't available. Or, it can be an unknown variable that a causal variable and an outcome variable are related to.  What's omitted might measurement error in a causal variable that's not modeled.  Other sources of endogeneity problems include when an outcome variable causes a causal variable, "reverse" causality, or "simultaneity."

When endogeneity exists, it may be possible to mitigate it using one or more _instrumental variables_, "IVs."  

### What's an Instrumental Variable?

Consider a treatment or exposure variable X, and an outcome variable Y.  Regressing Y on X results in model disturbances $\epsilon_i's$ that are correlated with X.  

An IV is a variable that:  

* does not directly cause Y
* is not correlated with $\epsilon$
* "causes" X

There must be at least as many IVs as there are X's needing instruments in a given model.

### Do a Daggity Graph

Graph a model in which X is a Tx/exposure, Y is an outcome variable, IV is an instrument.  Connect X and Y through an unobserved variable U.

### Load Some Data

There are some "toy" data in the file 'cholExer.csv' The variables are:

* cholDrop = drop in total cholesterol after participating in a four week exercise program in which participants exercised three days a week, and recorded the length of time the exercised on each day in a diary.
* exercise = average number of minutes/day across the three days in the four weeks of the program
* secretSauce = you'll see

### Get Needed Stuff

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
import numpy as np

In [2]:
# This can just be commented out.  It's a kluge for one of my environ
# import os
# os.environ['R_HOME']='/home/lynd/anaconda36/lib/R'

In [2]:
import rpy2
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.interactive as r
import rpy2.interactive.packages
base=importr('base')
utils=importr('utils')
stats=importr('stats')

In [5]:
# input the data from cholExer.csv
cholEx=pd.read_csv('data/cholExer.csv')  
#Note that this is a pands DataFrame.  Below you'll need to covert it to an R data.frame.  See the Regression notebook.

### Bivariate and Two Stage Least Squares (2SLS) Regressions.

**EXERCISE** Estimate a regression model using the R fundction lm() that predicts cholDrop with exercise. Extract the residuals, and plot them by the exercise variable.  

TIP: You can use seaborn to plot residuals vs. exercise using seaborn, like the Regression notebook.

**EXERCISE** Now, imagine that "secretSauce" might be an instrument for exercise. First, regress exercise on secretSauce. Then, regress cholDrop on the values of exercise predicted by secretSauce.

This is a basic example of using a *two stage least squares* (2SLS) procedure. secretSauce is used as an instrument for exercise. How does the regression coefficient estimate for predicted exercise in the second stage model compare to the estimate you got with your earlier model, the model that regressed cholDrop on the "uninstrumented" exercise predictor?


### Using Software for IV Regression

Using 2SLS to apply IVs like above produces results that require adjustment of instrumented predictors' variances and covariances. The adjustment can be done manually, but it's more convenience to use software that takes care of that. Let's try using the ivreg method in the R package AER.

In [4]:
rAER=importr('AER')

In [8]:
cholIVFormula='cholDrop~exercise | secretSauce'  
# Just one endogenous regressor.  See the AER doc for specifying mixes of exogenous and endogenous precdictors

In [7]:
# Here's what might work for you:
# cholIVResults=rAER.ivreg(cholIVFormula,data='yourRDatFrame')
# base.summary(cholIVResults)

**QUESTION**  

How do your results from using AER's ivreg function differ from your 2SlS results?  

**QUESTION**

Finally, given what the cholesterol and exercise variables appear to be, what variables might fill the role of `secretSauce` as an IV?