# Instrumental variables

## IV example on mock dataset

### Constructing the dataset

Create four random series of length $N=1000$

- $x$: education
- $y$: salary
- $z$: ambition
- $q$: early smoking 

such that:

1. $x$ and $z$ cause $y$
2. $z$ causes $x$
3. $q$ is correlated with $x$, not with $z$


A problem arises when the confounding factor $z$ is not observed. In that case, we can estimate the direct effect of $x$ on $y$ by using $q$ as an instrument.

Run the follwing code to create a mock dataset.

In [40]:
import numpy as np
import pandas as pd

In [41]:
N = 100000
ϵ_z = np.random.randn(N)*0.1
ϵ_x = np.random.randn(N)*0.1
ϵ_q = np.random.randn(N)*0.01
ϵ_y = np.random.randn(N)*0.01

In [54]:
z = 0.1 + ϵ_z
x = 0.1 + z + ϵ_x
q = 0.5 + 0.1234*ϵ_x + ϵ_q
y  = 1.0 + 0.9*x + 0.4*z + ϵ_y

In [55]:
df = pd.DataFrame({
    "x": x,
    "y": y,
    "z": z,
    "q": q
})

__Describe the dataframe. Compute the correlations between the variables. Are they compatible with the hypotheses for IV?__

### OLS Regression

Use `linearmodels` to  run a regression estimating the effect of $x$ on $y$ (note the slight API change w.r.t. `statsmodels`). Comment.

In [58]:
from linearmodels import OLS

In [59]:
model = OLS.from_formula("y ~ x", df)
res = model.fit()
res.summary

0,1,2,3
Dep. Variable:,y,R-squared:,0.9643
Estimator:,OLS,Adj. R-squared:,0.9643
No. Observations:,100000,F-statistic:,2.677e+06
Date:,"Tue, Feb 13 2024",P-value (F-stat),0.0000
Time:,22:35:15,Distribution:,chi2(1)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,0.9998,0.0002,6088.5,0.0000,0.9995,1.0001
x,1.1003,0.0007,1636.2,0.0000,1.0990,1.1016


> your comment

__Assume briefly that `z` is known and control the regression by `z`. What happens?__

In [8]:
# your code

### Regress again $y$ on $x$, this time controling for missing variable $z$.



In [None]:
from linearmodels import IV2SLS
formula = (
    # your formula
)
mod = IV2SLS.from_formula(formula, df)
res = mod.fit()
res

0,1,2,3
Dep. Variable:,y,R-squared:,0.6068
Estimator:,OLS,Adj. R-squared:,0.6067
No. Observations:,100000,F-statistic:,1.518e+05
Date:,"Mon, Mar 27 2023",P-value (F-stat),0.0000
Time:,21:54:44,Distribution:,chi2(2)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,0.9997,0.0005,1818.9,0.0000,0.9987,1.0008
x,0.9007,0.0032,283.90,0.0000,0.8945,0.9069
z,0.3993,0.0035,113.00,0.0000,0.3924,0.4062


In [60]:
# Comment

### Instrumental variable

Make a causality graph, summarizing what you know from the equations.

Use $q$ to instrument the effect of x on y. Comment.

In [None]:
from linearmodels import IV2SLS
formula = (
    # your formula
)
mod = IV2SLS.from_formula(formula, df)
res = mod.fit()
res

> comment

## Return on Education

We follow the excellent R [tutorial](https://www.econometrics-with-r.org/12-6-exercises-10.html) from the (excellent) *Econometrics with R* book.

The goal is to measure the effect of schooling on earnings, while correcting the endogeneity bias by using distance to college as an instrument.

__Download the college distance dataset with `statsmodels`. Describe the dataset and extract the dataframe.__

https://vincentarelbundock.github.io/Rdatasets/datasets.html

In [64]:
import statsmodels.api as sm
ds = sm.datasets.get_rdataset("CollegeDistance", "AER")

In [65]:
# describe dataset

In [66]:
df = ds.data

In [14]:
# describe dataframe

__How is `education` encoded? Create a binary variable `education_binary` to replace it.__

__Plot an histogram of distance to college.__

__Run the naive regression $income_{binary}=\beta_0 + \beta_1 \text{education} + u$.__



__Augment the regression with `unemp`, `hispanic`, `af-am`, `female` and `urban`. Notice that categorical variables are encoded automatically. What are the treatment values? Change it using the syntax (`C(var,Treatment='ref')`)__

__Comment the results and explain the selection problem__

__Explain why distance to college might be used to instrument the effect of schooling.__

__Run an IV regression, where `distance` is used to instrument schooling.__

look at: 
    https://bashtage.github.io/linearmodels/
   (two-stage least squares)

__Comment the results. Compare with the R tutorials.__