# Instrumental variables

## Functions and Loops

Go back to the first notebook. Learn how to write functions and loops.

In [1]:
import pandas

In [6]:
fname = "dataset.csv"
df = pandas.read_csv(fname)
display(df.describe())
df


Unnamed: 0,gdp,date
count,4.0,4.0
mean,552.5,2000.5
std,519.623261,0.57735
min,100.0,2000.0
25%,103.75,2000.0
50%,552.5,2000.5
75%,1001.25,2001.0
max,1005.0,2001.0


Unnamed: 0,country,gdp,date
0,usa,1000,2000
1,usa,1005,2001
2,france,100,2000
3,france,105,2001


In [16]:
def import_and_print(fname, print_statistics=True):
    # here is the body of the function
    df = pandas.read_csv(fname)
    if print_statistics==True:
        display("Summary Statistics")
        display(df.describe())
    return df

In [17]:
import_and_print("dataset.csv")

'Summary Statistics'

Unnamed: 0,gdp,date
count,4.0,4.0
mean,552.5,2000.5
std,519.623261,0.57735
min,100.0,2000.0
25%,103.75,2000.0
50%,552.5,2000.5
75%,1001.25,2001.0
max,1005.0,2001.0


Unnamed: 0,country,gdp,date
0,usa,1000,2000
1,usa,1005,2001
2,france,100,2000
3,france,105,2001


In [18]:
import_and_print("dataset_2.csv", False)

Unnamed: 0,country,gdp,date
0,usa,1000,2000
1,usa,1005,2001
2,france,100,2000
3,france,105,2001


In [20]:
def f(x): # no side effect
    return x**2 + 1

In [22]:
import time

In [25]:
def g(x): # that one has side effects
    print("Calculating...")
    time.sleep(10)
    return x**2 + 1

In [24]:
g(1)

Calculating...


2

In [27]:
def h(x): # that one has only side effects
    print("Calculating...")
    time.sleep(10)
    y =  x**2 + 1
    print(f"Found it! {y}")

In [28]:
h(1)

Calculating...
Found it! 2


Documenting code:

- adding comments: `# ...`
- add docstrings: just after the function name, a string explaining what the function does

In [32]:
def import_and_print(fname, print_statistics=True):
    "Import a dataframe from a filename, print the main statistics and return the dataframe."
    
    
    # import the file to get a dataframe
    df = pandas.read_csv(fname)
    
    if print_statistics==True:
        display("Summary Statistics")
        # we print default summary statistics computed by pandas
        display(df.describe())
        
    return df

In [34]:
import_and_print?

[0;31mSignature:[0m [0mimport_and_print[0m[0;34m([0m[0mfname[0m[0;34m,[0m [0mprint_statistics[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Import a dataframe from a filename, print the main statistics and return the dataframe.
[0;31mFile:[0m      ~/Teaching/dbe/session_6/<ipython-input-32-fb18d0d82cec>
[0;31mType:[0m      function


In [35]:
def import_and_print(fname, print_statistics=True):
    """Import a dataframe from a filename.
    
    fname (string): filename
    print_statistics (boolean): if True print summary statistics
    
    """
    
    
    # import the file to get a dataframe
    df = pandas.read_csv(fname)
    
    if print_statistics==True:
        display("Summary Statistics")
        # we print default summary statistics computed by pandas
        display(df.describe())
        
    return df

In [36]:
import_and_print?

[0;31mSignature:[0m [0mimport_and_print[0m[0;34m([0m[0mfname[0m[0;34m,[0m [0mprint_statistics[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Import a dataframe from a filename.

fname (string): filename
print_statistics (boolean): if True print summary statistics
[0;31mFile:[0m      ~/Teaching/dbe/session_6/<ipython-input-35-97c3580f10df>
[0;31mType:[0m      function


In [37]:
(lambda x: x**2-1)   (3)

8

In [38]:
f = (lambda x: x**2-1)
f(3)

8

In [40]:
## anonymous functions are useful for the groupby pandas function

In [41]:
df

Unnamed: 0,country,gdp,date
0,usa,1000,2000
1,usa,1005,2001
2,france,100,2000
3,france,105,2001


In [52]:
# naive approach
l = []
for country in df["country"].unique():
    print(f"Selecting country {country}")
    sel = df["country"]==country
    sdf = df[sel] # sub dataframe with the right country
    print( sdf.mean() )
    l.append(df_s.mean())

Selecting country usa
 gdp     1002.5
 date    2000.5
dtype: float64
Selecting country france
 gdp      102.5
 date    2000.5
dtype: float64


In [61]:
def todo(sdf): print( sdf.mean())

In [62]:
df.groupby("country").apply( todo )

 gdp      102.5
 date    2000.5
dtype: float64
 gdp     1002.5
 date    2000.5
dtype: float64


In [63]:
df.groupby("country").apply( lambda sdf: (sdf.mean()) )

Unnamed: 0_level_0,gdp,date
country,Unnamed: 1_level_1,Unnamed: 2_level_1
france,102.5,2000.5
usa,1002.5,2000.5


## Baby example on mock dataset

### Constructing the dataset

Create four random series of length $N=1000$

- $x$: education
- $y$: salary
- $z$: ambition
- $q$: early smoking 

such that:

1. $x$ and $z$ cause $y$
2. $z$ causes $x$
3. $q$ is correlated with $x$, not with $z$

(all relations are linear, add random shocks where needed)

Create a dataset `df`


In [64]:
import numpy

In [128]:
N = 100000

In [129]:
ϵ_z = numpy.random.randn(N)*0.01
ϵ_x = numpy.random.randn(N)*0.01
ϵ_q = numpy.random.randn(N)*0.01
ϵ_y = numpy.random.randn(N)*0.01

In [130]:
z = 0.1 + ϵ_z
x = 0.1 + z + ϵ_x
q = 0.5 + 0.1234*ϵ_x + ϵ_q
y  = 1.0 + 0.9*x + 0.4*z + ϵ_y

In [131]:
df = pandas.DataFrame({
    "x": x,
    "y": y,
    "z": z,
    "q": q
})

In [132]:
df.corr()


Unnamed: 0,x,y,z,q
x,1.0,0.831152,0.708497,0.079905
y,0.831152,1.0,0.694133,0.05407
z,0.708497,0.694133,1.0,-0.006022
q,0.079905,0.05407,-0.006022,1.0


### Naive approach

Run a regression to estimate the effect of $x$ on $y$. Control by $z$.
What happens ?

In [133]:
import linearmodels
from statsmodels.formula import api

In [134]:
model = api.ols("y ~ x", df)
res = model.fit()
res.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.691
Model:,OLS,Adj. R-squared:,0.691
Method:,Least Squares,F-statistic:,223400.0
Date:,"Wed, 09 Mar 2022",Prob (F-statistic):,0.0
Time:,12:14:59,Log-Likelihood:,314880.0
No. Observations:,100000,AIC:,-629800.0
Df Residuals:,99998,BIC:,-629700.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.0005,0.000,2150.251,0.000,1.000,1.001
x,1.0974,0.002,472.679,0.000,1.093,1.102

0,1,2,3
Omnibus:,2.727,Durbin-Watson:,1.991
Prob(Omnibus):,0.256,Jarque-Bera (JB):,2.727
Skew:,0.008,Prob(JB):,0.256
Kurtosis:,3.019,Cond. No.,73.5


In [135]:
model = api.ols("y ~ x + z", df)
res = model.fit()
res.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.713
Model:,OLS,Adj. R-squared:,0.713
Method:,Least Squares,F-statistic:,124300.0
Date:,"Wed, 09 Mar 2022",Prob (F-statistic):,0.0
Time:,12:15:00,Log-Likelihood:,318620.0
No. Observations:,100000,AIC:,-637200.0
Df Residuals:,99997,BIC:,-637200.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.0007,0.000,2232.389,0.000,1.000,1.002
x,0.8997,0.003,283.880,0.000,0.893,0.906
z,0.3938,0.004,88.054,0.000,0.385,0.403

0,1,2,3
Omnibus:,2.38,Durbin-Watson:,1.99
Prob(Omnibus):,0.304,Jarque-Bera (JB):,2.379
Skew:,0.007,Prob(JB):,0.304
Kurtosis:,3.019,Cond. No.,166.0


### Instrumental variable

Use $q$ to instrument the effect of x on y. Comment.

In [136]:
# difference between linearmodels and statsmodels:
# linearmodels does not include the constant by defulat

In [137]:
from linearmodels import IV2SLS

In [138]:
formula = (
    "y ~ 1 + [x ~ q]"
)
mod = IV2SLS.from_formula(formula, df)
res = mod.fit()
res

0,1,2,3
Dep. Variable:,y,R-squared:,0.6670
Estimator:,IV-2SLS,Adj. R-squared:,0.6669
No. Observations:,100000,F-statistic:,883.43
Date:,"Wed, Mar 09 2022",P-value (F-stat),0.0000
Time:,12:15:00,Distribution:,chi2(1)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,1.0413,0.0060,173.27,0.0000,1.0295,1.0531
x,0.8934,0.0301,29.723,0.0000,0.8345,0.9523


## Return on Education

We follow the excellent R [tutorial](https://www.econometrics-with-r.org/12-6-exercises-10.html) from the (excellent) *Econometrics with R* book.

The goal is to measure the effect of schooling on earnings, while correcting the endogeneity bias by using distance to college as an instrument.

__Download the college distance and make a nice dataframe. Discribe the dataset. Plot an histogram of distance.__

https://vincentarelbundock.github.io/Rdatasets/datasets.html

__Run the naive regression $\log(\text{wage})=\beta_0 + \beta_1 \text{education} + u$__



__Augment the regression with `unemp`, `hispanic`, `af-am`, `female` and `urban`__

__Comment the results and explain the selection problem__

__Explain why distance to college might be used to instrument the effect of schooling.__

__Run an IV regression, where `distance` is used to instrument schooling.__

look at: 
    https://bashtage.github.io/linearmodels/
   (two-stage least squares)

__Comment the results. Compare with the R tutorials.__