# Investigating the Effect of Nike Vaporfly Shoes on Marathon Performance

## Introduction

Nike claims that its $250 running shoes, the Vaporfly, will make you run much faster. This project aims to explore the accuracy of this claim using data from marathon runners. Specifically, we will investigate whether runners using Vaporfly shoes run significantly faster than those who do not, using methods such as nearest neighbor matching, propensity score matching, and controlled regression.

## Data Description

We use the dataset `lecture2_match_reg.csv`, which contains information on 24,699 runners who qualified for and ran the same marathon. The dataset includes the following variables:
- **age**: Age of the runner (min: 18, max: 55)
- **male**: Gender (0 = Female, 1 = Male)
- **marathoner_type**: Experience level ("seasoned", "enthusiastic", "first_timer")
- **vaporfly**: 1 if the runner wore Nike Vaporfly shoes, 0 otherwise
- **race_time**: Marathon completion time in seconds

## Methodology

To evaluate Nike's claim, we transformed the race time into its natural logarithm to interpret the results as percentage changes. We used various statistical methods to estimate the average treatment effect (ATE) of wearing Vaporfly shoes, including:
1. **Nearest Neighbor Matching** on the variable `age`
2. **Propensity Score Matching** using variables `age`, `male`, `seasoned`, and `enthusiastic`
3. **Controlled Regression** with robust standard errors

## Analysis and Results

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from causalinference import CausalModel
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Load the dataset
data_marathon = pd.read_csv('assets/lecture2_match_reg.csv')

# Transform race_time to its natural logarithm
data_marathon["ln_race_time"] = np.log(data_marathon["race_time"])

# Display the first few rows to verify the transformation
data_marathon.head()

Unnamed: 0,age,marathoner_type,vaporfly,race_time,male,ln_race_time
0,41,enthusiastic,1,11755.176,1,9.372049
1,42,enthusiastic,1,14980.95,0,9.614535
2,39,enthusiastic,0,12342.542,1,9.420807
3,29,enthusiastic,0,13142.107,1,9.483577
4,34,enthusiastic,1,13255.874,0,9.492196


In [3]:
# Calculate the mean difference in ln_race_time
vap_mean = data_marathon["ln_race_time"][data_marathon["vaporfly"] == 1].mean()
no_vap_mean = data_marathon["ln_race_time"][data_marathon["vaporfly"] == 0].mean()
mean_diff1_2 = round(vap_mean - no_vap_mean, 4)
mean_diff1_2

-0.064

In [4]:
# Sample a subset for computational efficiency
data_sample1_3 = data_marathon.sample(n=2000, random_state=123)

# Create the causal model
model_c = CausalModel(Y=np.array(data_sample1_3["ln_race_time"]),
                      D=np.array(data_sample1_3["vaporfly"]),
                      X=np.array(data_sample1_3["age"]))

# Estimate ATE via matching
model_c.est_via_matching()
ate1_3 = round(model_c.estimates["matching"]["ate"], 4)
ate1_3

-0.0372

In [6]:
# Create binary variables for seasoned and enthusiastic
data_marathon["seasoned"] = np.where(data_marathon["marathoner_type"] == "seasoned", 1, 0)
data_marathon["enthusiastic"] = np.where(data_marathon["marathoner_type"] == "enthusiastic", 1, 0)

In [7]:
# Sample a subset for computational efficiency
data_sample1_4b = data_marathon.sample(n=2000, random_state=123)

# Create the causal model with additional covariates
model_c_4 = CausalModel(Y=np.array(data_sample1_4b["ln_race_time"]),
                        D=np.array(data_sample1_4b["vaporfly"]),
                        X=np.array(data_sample1_4b[["age", "male", "seasoned", "enthusiastic"]]))

# Estimate propensity scores and ATE via matching
model_c_4.est_propensity()
model_c_4.est_via_matching()
ate1_4b = round(model_c_4.estimates["matching"]["ate"], 4)
ate1_4b

-0.0374

In [8]:
# Fit the regression model without age
model_reg = smf.ols(formula='ln_race_time ~ vaporfly + male + seasoned + enthusiastic', data=data_marathon).fit()
ols_robust2_1 = model_reg.get_robustcov_results(cov_type='HC1')
print(ols_robust2_1.summary())

                            OLS Regression Results                            
Dep. Variable:           ln_race_time   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     2437.
Date:                Wed, 24 Jul 2024   Prob (F-statistic):               0.00
Time:                        14:51:29   Log-Likelihood:                 17850.
No. Observations:               24699   AIC:                        -3.569e+04
Df Residuals:                   24694   BIC:                        -3.565e+04
Df Model:                           4                                         
Covariance Type:                  HC1                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        9.6532      0.001   7842.522   

In [9]:
# Fit the regression model with age included
model_reg_with_age = smf.ols(formula='ln_race_time ~ vaporfly + male + seasoned + enthusiastic + age', data=data_marathon).fit()
ols_robust2_3 = model_reg_with_age.get_robustcov_results(cov_type='HC1')
print(ols_robust2_3.summary())

                            OLS Regression Results                            
Dep. Variable:           ln_race_time   R-squared:                       0.464
Model:                            OLS   Adj. R-squared:                  0.464
Method:                 Least Squares   F-statistic:                     3824.
Date:                Wed, 24 Jul 2024   Prob (F-statistic):               0.00
Time:                        14:51:36   Log-Likelihood:                 21081.
No. Observations:               24699   AIC:                        -4.215e+04
Df Residuals:                   24693   BIC:                        -4.210e+04
Df Model:                           5                                         
Covariance Type:                  HC1                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        9.9345      0.003   2861.720   

###  Omitted Variable Bias (OVB) Analysis

#### a Expected Effect of Omitted Variable `age` on `ln_race_time`

Younger runners may find the expensive Vaporfly shoes unaffordable. Furthermore, it is documented that older runners tend to perform better in endurance races like marathons. Therefore, I expect the effect of the omitted variable `age` on `ln_race_time` to be negative, as older runners (who are faster) are likely to reduce their race times.

#### b Expected Coefficient of `vaporfly` on `age`

As age increases, runners might have more disposable income to afford the Vaporfly shoes. Thus, I would expect a positive coefficient in front of the variable `vaporfly` when regressing the omitted variable `age` on the treatment variable `vaporfly`.

#### c. OVB Formula and Expected Bias

The formula for omitted variable bias (OVB) is given by:

* $\hat{β}^{s} =\hat{β}^{l} +\hat{θ}^{l}\hat{π}_{1}$
* $A_{i} =π_{0}+π_{1}S_{i} +v_{i}$
* The OVB is $\hat{θ}^{l}\hat{π}_{1}$,  $\hat{π}_{1}$ is coefficient on  vaporfly($S_{i}$) in a regression of age($A_{i}$) on $S_{i}$
* OVB =("effect" of age on ln_race_time)×("effect" of vaporfly on age )
* The "effect" of age on ln_race_time is negative and the "effect" of vaporfly on age is positive, so the omitted variable bias is negative

Where $\hat{θ}^{l}$ is the coefficient on `age` in the regression of `ln_race_time` on `age` and other covariates, and $\hat{π}_{1}$ is the coefficient on `vaporfly` in the regression of `age` on `vaporfly` and other covariates.

Given that the effect of `age` on `ln_race_time` is negative and the effect of `vaporfly` on `age` is positive, I expect the OVB to be negative.

### 3. Confirming Omitted Variable Bias



In [11]:
# Fit the regression model with age included
model_reg_with_age = smf.ols(formula='ln_race_time ~ vaporfly + male + seasoned + enthusiastic + age', data=data_marathon).fit()
ols_robust2_3 = model_reg_with_age.get_robustcov_results(cov_type='HC1')
print(ols_robust2_3.summary())

                            OLS Regression Results                            
Dep. Variable:           ln_race_time   R-squared:                       0.464
Model:                            OLS   Adj. R-squared:                  0.464
Method:                 Least Squares   F-statistic:                     3824.
Date:                Wed, 24 Jul 2024   Prob (F-statistic):               0.00
Time:                        15:06:37   Log-Likelihood:                 21081.
No. Observations:               24699   AIC:                        -4.215e+04
Df Residuals:                   24693   BIC:                        -4.210e+04
Df Model:                           5                                         
Covariance Type:                  HC1                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        9.9345      0.003   2861.720   

### Conclusion

Our analysis shows that wearing Nike Vaporfly shoes is associated with a significant reduction in marathon completion times, even after controlling for age and other variables. The nearest neighbor matching and propensity score matching methods both indicate a positive treatment effect, supporting Nike's claim that Vaporfly shoes can help runners achieve faster times. The regression analysis further confirms these findings and highlights the importance of controlling for potential confounders such as age.

### References

- ["Nike Says Its $250 Running Shoes Will Make You Run Much Faster. What if That’s Actually True?"](assets/NYT_Nike.pdf)