# Causal Inference
# School of Information, University of Michigan 
## Week 2

### Resources:
- Course Manual, which can be found in Coursera
- ["Nike Says Its \$250 Running Shoes Will Make You Run Much Faster. What if That’s Actually True?"](assets/NYT_Nike.pdf)

## Part 1

### Background
Nike claims that its $250 running shoes called “Vaporfly” will make you run much faster!

### Data 

The data file “lecture2_match_reg.csv” contains 5 variables for 24,699 runners that qualified for and ran the same marathon. Below are the descriptions of each variable in the data: 

- \`age\`: age of runner (min value: 18, max value: 55)
- \`male\`: dummy variable for gender; equal to 0 if female, 1 if male 
- \`marathoner_type\`: 
    - “seasoned” if runner has at least 3 prior completed marathons, 
    - “enthusiastic” if runner has completed 1 or 2 prior completed marathons, 
    - “first_timer” if this is a runner’s first time running a marathon 
- \`vaporfly\`: 1 if a runner’s racing shoe is Nike Vaporfly, 0 otherwise 
- \`race_time\`: marathon completion time in seconds

In [1]:
#Import Statements. Run this cell.
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from causalinference import CausalModel

In [2]:
#Uploading data for assignment. Run this cell.
data_marathon = pd.read_csv('assets/lecture2_match_reg.csv')

#Uncomment below to see the first five lines of the dataframe.
data_marathon.head()

Unnamed: 0,age,marathoner_type,vaporfly,race_time,male
0,41,enthusiastic,1,11755.176,1
1,42,enthusiastic,1,14980.95,0
2,39,enthusiastic,0,12342.542,1
3,29,enthusiastic,0,13142.107,1
4,34,enthusiastic,1,13255.874,0


## Questions

Using data on race times, we want to explore the accuracy of this claim using marathon runners data. To be more specific, Nike’s claim is that runners run 4% faster with Vaporfly shoes.

**Note**: You can refer to the manual for the methods we use in the assignment if you need to. 

**Use the data_marathon dataframe uploaded above to answer the questions below unless otherwise specified.**

**1.** In order to be able to interpret our results in the same format (i.e. percentage change), we want to conduct our analysis over the (natural) log of race times. Transform the original race_time variable by taking its natural log, call the new variable `ln_race_time`, and add it to the data_marathon dataframe. We will use *ln_race_time* as our outcome variable throughout the entire assignment. (1 pt)

**Tip**: Use the `.head()` or `.sample()` methods to check your work.

In [3]:
data_marathon['ln_race_time'] = np.log(data_marathon['race_time'])
#raise NotImplementedError()

In [4]:
# Hidden Tests, checking the value of ln_race_time.

**2.** Compute the means of ln_race_time for runners who wore Vaporfly and for those who did not. What is the difference between the means across those two groups? Assign the value to the variable `mean_diff1_2`. Ensure that the response is correct to at least 4 decimal points and that its data type is float. (1 pt)

In [5]:
mean_diff1_2 = data_marathon[data_marathon['vaporfly'] == 1]['ln_race_time'].mean() - data_marathon[data_marathon['vaporfly'] == 0]['ln_race_time'].mean()
#raise NotImplementedError()

In [6]:
# Hidden Tests, checking the value of mean_diff1_2.

**3.** Suppose the only thing that matters for race time, besides potentially shoes, is age. We want to estimate average treatment effects (ATE) of wearing Nike Vaporfly shoes, using nearest neighbor matching on variable *age* with respect to Euclidean distance. Use the `CausalModel` module and assign the ATE to the variable `ate1_3`. Ensure that the data type of *ate1_3* is float. (Round to four decimal places.) (2 pts)

**Use the data_sample1_3 dataframe for your response below (Question 3); data_sample1_3 is a subsample of our marathon data, which will save computational time for this question.**

In [7]:
data_sample1_3 = data_marathon.sample(n = 2000, random_state = 123)

In [8]:
model = CausalModel(Y = data_sample1_3.ln_race_time.values, #outcome
                    D = data_sample1_3.vaporfly.values, #treatment
                    X = data_sample1_3.age.values) #covariates

model.est_via_matching()
ate1_3 = round(model.estimates['matching']['ate'], 4)
#raise NotImplementedError()

In [9]:
# Hidden Tests, checking the value of ate1_3.

**4.** We want to conduct propensity score matching to estimate the ATE of running with Nike Vaporfly shoes.

**Note: We are back to using the original dataframe data_marathon.**

**4a.** Create binary variables called `seasoned` and `enthusiastic` that are equal to 1 for the corresponding *marathoner_type* values and 0 otherwise. Add these variables to the data_marathon dataframe. (1 pt)

**Tip**: Use the `np.where()` method to create the binary variables.

In [10]:
data_marathon['seasoned'] = np.where(data_marathon['marathoner_type'] == 'seasoned', 1, 0)
data_marathon['enthusiastic'] = np.where(data_marathon['marathoner_type'] == 'enthusiastic', 1, 0)
#raise NotImplementedError()

In [11]:
# Hidden Tests, checking the values of seasoned and enthusiastic.

**4b.** Use propensity score matching with a logit model to estimate the ATE of wearing Nike Vaporfly shoes. Use the variables *age*, *male*, *seasoned*, *enthusiastic* and the method `est_via_matching()` from the CausalModel module. Assign the ATE to the variable `ate1_4b` and ensure that its data type is float. (Round to four decimal places.) (2 pts)

**Use the data_sample1_4b dataframe for your response below (Question 3); data_sample1_3 is a subsample of our marathon data, which will save computational time for this question.**

In [12]:
data_sample1_4b = data_marathon.sample(n = 2000, random_state = 123)

In [13]:
covariate_array = np.array([[data_sample1_4b.age.values],
                            [data_sample1_4b.male.values],
                            [data_sample1_4b.seasoned.values],
                            [data_sample1_4b.enthusiastic.values]
                           ]).T.reshape((2000, 4)) #.T.reshape() flips the array

#Model
model = CausalModel(Y = data_sample1_4b.ln_race_time.values, #outcome
                    D = data_sample1_4b.vaporfly.values, #treatment
                    X = covariate_array) #covariates

#Estimating propensity scores and matching
model.est_propensity()
model.est_via_matching()
#View summary table
ate1_4b = round(model.estimates['matching']['ate'], 4)
#raise NotImplementedError()

In [14]:
#Hidden Tests, checking the value of ate1_4b.

## Part 2

> “Although regression is a many-splendored thing, we think of it as an automated matchmaker. Specifically, regression estimates are weighted averages of multiple matched comparisons of the sort constructed for the groups in our stylized matching matrix.” (Mastering Metrics, p. 49)

We now want to use controlled regression to conduct our analysis. 

----

Note that Question 1 asks you to estimate a mis-specified regression model; Question 2 requires you to describe the OVB in words using the information provided; Question 3 asks you to calculate the OVB (which normally is not possible).

**1.** Using robust standard errors, regress *ln_race_time* on variables *vaporfly*, *male*, *seasoned*, and *enthusiastic*. Assign the results (using the `.get_robustcov_results()` method) to the variable `ols_robust2_1`.  (2 pts)

In [15]:
#Results of base model
reg = smf.ols(formula = 'ln_race_time ~ vaporfly + male + seasoned + enthusiastic', data = data_marathon).fit()

#Add robust standard errors to reg
ols_robust2_1 = reg.get_robustcov_results(cov_type= 'HC1')

#View summary table
print(ols_robust2_1.summary())
#raise NotImplementedError()

                            OLS Regression Results                            
Dep. Variable:           ln_race_time   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     2437.
Date:                Fri, 11 Feb 2022   Prob (F-statistic):               0.00
Time:                        21:39:44   Log-Likelihood:                 17850.
No. Observations:               24699   AIC:                        -3.569e+04
Df Residuals:                   24694   BIC:                        -3.565e+04
Df Model:                           4                                         
Covariance Type:                  HC1                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        9.6532      0.001   7842.522   

In [16]:
# Hidden Tests, checking the coefficients and standard errors of ols_model2_1.

**2.** Vaporfly shoes are pretty expensive, with a retail price of $250. Younger runners, who we expect would have lower income levels, may find it too expensive to purchase. Furthermore, it has been documented that in endurance type races, such as marathons, older runners are actually faster than their younger peers. Answer the following questions based on this information. 

**2a.** Based on the information above, would you expect the effect of the omitted variable *age* on the outcome variable *ln_race_time* to be positive or negative? (1 pt)

**Note**: This question will be manually graded.

The omitted variable age would be negative on the outcome variable ln_race_time, as older runners in marathons are actually faster than younger runners and therefore the expectation would be that ln_race_time would decrease, thus the negative effect.

**2b.** If we were to regress the omitted variable *age* on the treatment variable *vaporfly*, would you expect the coefficient in front of the variable *vaporfly* to be positive or negative? (1 pt)

**Note**: This question will be manually graded.

If we were to regress the omitted variable age (Y variable) on vaporfly (X variable) we would expect the coefficient (slope) on the vaporfly variable to be positive. There would be a positive relationship between these two variables as vaporfly shoes are expensive and would be less likely to be purchased by younger runners. Therefore, as a runner gets older, they would be more likely to have purchased the vaporfly shoe.

**2c.** State the OVB (omitted variable bias) formula. Based on it, would you expect the omitted variable bias to be positive or negative? Explain. (2 pts) 

**Note**: This question will be manually graded.

The OVB formula is Y = a + ((B1 * P1) +...(Bi * Pi)) + <s>((y1 * A1) +...(yi * Ai)</s> + u where Y is the dependent outcome variable (ln_race_time), a is the intercept, u is the error value, P1...Pi denotes all of the variables accounted for by the regression model (vaporfly, male, etc.), B1...Bi are the coefficients for each of these variables and denote whether the variable has a positive or negative relationship with ln_race_time. Lastly, the A1...Ai variables are the OMITTED variables that are not accounted for by a regression model and (y1...yi) their coefficients and relationship to the outcome variable if it were to have been included.
Based on this formula, I would expect the omitted variable bias to be negative. As runners become older, they perform better in endurance races and would be more likely to have the necessary income to purchase vaporfly shoes (moreso than younger runners), which would lead to faster marathon times, as well, and thus a negative relationship with ln_race_time.

**3.** Let’s confirm the omitted variable bias. Run the regression from Part 2, Question 1 but this time also control for runners’ age by including the additional control variable age. 

In [17]:
#Results of base model
reg_2 = smf.ols(formula = 'ln_race_time ~ vaporfly + male + seasoned + enthusiastic + age', data = data_marathon).fit()

#Add robust standard errors to reg
omitted_age = reg_2.get_robustcov_results(cov_type= 'HC1')

#View summary table
print(omitted_age.summary())
#raise NotImplementedError()

                            OLS Regression Results                            
Dep. Variable:           ln_race_time   R-squared:                       0.464
Model:                            OLS   Adj. R-squared:                  0.464
Method:                 Least Squares   F-statistic:                     3824.
Date:                Fri, 11 Feb 2022   Prob (F-statistic):               0.00
Time:                        21:39:44   Log-Likelihood:                 21081.
No. Observations:               24699   AIC:                        -4.215e+04
Df Residuals:                   24693   BIC:                        -4.210e+04
Df Model:                           5                                         
Covariance Type:                  HC1                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        9.9345      0.003   2861.720   

Compare the two coefficients in front of the treatment variable vaporfly, what do you observe? Based on your observation, are the results consistent with your previous intuitive analysis of OVB? (2 pts) 

**Note**: This question will be manually graded.

Following the inclusion of the omitted variable age, the vaporfly coefficient increased from -.0658 to -.0426. This result is consistent with the previous intutive analysis of OVB. The expectation was that age would have a negative relationship with ln_race_time which was confirmed above, and the increase of the vaporfly coefficient shows that the omission of age lead to the effect of vaporfly being a bit overstated in the initial model. When age was accounted for, this omitted variable, with its negative relationship to ln_race_time, increased the vaporfly coefficient. This made clear that the original model was biased as age's relationship to ln_race_time was not accounted for.