<a href="https://colab.research.google.com/github/brighamfrandsen/econ484/blob/master/examples/movies%20IV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import necessary libraries
!git clone https://github.com/brighamfrandsen/econ484.git
%cd econ484/utilities
from preamble import *
%cd ../data

# Example: Instrumental Variables Estimation of the Effect of Social Spillovers on Movie-going

This notebook will illustrate the entire supervised machine learning process in the context of predicting movie attendance based on the weather on opening weekend.

### Figure out your question

What is the effect of opening-weekend attendance on subsequent weekend attendance at a movie?

## Obtain a labeled dataset

In [None]:
import pandas as pd
import numpy as np

In [None]:
moviedata=pd.read_csv('./movies_cleaned.csv')
print(moviedata.head())
print("Shape: {}".format(str(moviedata.shape)))

Let's define our "label" (y) vector, our "treatment" vector (d), and our instrument matrix (Z):

In [None]:
y = moviedata.loc[:,'y_ticketsales']
d = moviedata.loc[:,['x_openingsales']]
Z = moviedata.filter(like='open_',axis=1)
print('our y vector is:\n',y.head)
print('our d vector is:\n',d.head)
print('our instrument matrix is:\n',Z.head)

## Start with OLS of y on d. Be sure to print import necessary packages and print out coefficients!

### Try yourself first!

### Cheat if you need to

In [None]:
from sklearn import linear_model

ols = linear_model.LinearRegression().fit(d,y)
print('OLS coefficient: ',ols.coef_)

## Now do "manual" two-stage least squares where you first regress d on Z, obtain predicted values, then regress y on the predicted values. Be sure to print out final coefficient on d-hat!

### Try yourself first

### Cheat if you need to

In [None]:

ols_fs = ols.fit(Z,d)
dhat = ols_fs.predict(Z)
tsls = ols.fit(dhat,y)

print('2SLS coefficient: ',tsls.coef_)


## Now do ML-augmented two-stage least squares using Random Forest to obtain the fitted values

###Try yourself first

In [None]:
# import necessary packages and create prediction "object"

# first grow random forest:

# now get random forest predictions to use as instrument:

# do "first stage" using random forest predictions as instrument:

# finally, 2nd stage regression:


### Cheat

In [None]:
# import necessary packages and create prediction "object"
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(max_depth=2,max_features='sqrt')
# first grow random forest:
rf.fit(Z,d)
# now get random forest predictions to use as instrument:
dtilde=rf.predict(Z).reshape(-1,1)
# do "first stage" using random forest predictions as instrument:
fs_rf=ols.fit(dtilde,d)
dhat_rf=fs_rf.predict(dtilde)
# finally, 2nd stage regression:
tsls_rf=ols.fit(dhat_rf,y)
print('2SLS+Random Forest coefficient: ',tsls_rf.coef_)

##Now do Belloni, Chernozhukov, Hansen Post-Lasso 2SLS

### Try yourself first

In [None]:
# hint 1: don't forget to scale the Zs before doing Lasso
# hint 2: to select the columns of a matrix corresponding to a set of nonzero coefficients, you can do something like:
# Z_selected = Z[:,model.coef_!=0]

### Cheat

In [None]:
# Lasso tends to work better with standardized variables, so:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(Z)
Z_scaled = scaler.transform(Z)

# create lasso object, setting the penalty parameter
lasso=linear_model.LassoCV()

# predict d using Z_scaled:
lasso.fit(Z_scaled,d)
print(lasso.coef_)
# grab just the Zs with nonzero coeffs
Z_selected=Z_scaled[:,lasso.coef_!=0]

# do the first stage regression via OLS using the selected Zs and get the fitted values:
postlasso_fs = ols.fit(Z_selected,d)
dhat_postlasso = postlasso_fs.predict(Z_selected)

# do 2nd stage regression using the post-lasso fitted values:
tsls_postlasso = ols.fit(dhat_postlasso,y)
print('Post-Lasso 2SLS coefficient: ',tsls_postlasso.coef_)
print(lasso.coef_)


In [None]:
Z_selected.shape

## Now go back to ML-augmented 2SLS and try with several different prediction methods