# Lecture worksheet 14 solutions

## Problem 1 (racial discrimination): 

In this problem, we analyze the resumes dataset from [Bertrand, M. and Mullainathan, S. (2004)](https://www.nber.org/system/files/working_papers/w9873/w9873.pdf).

We have defined functions to implement the estimators that we talked about in lecture. Read through the code and make sure you understand what the functions are doing.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

In [2]:
# Functions for lecture worksheet

def get_Neyman_ATE(data_df, outcome_var, treatment_var, treatment_label=1):
    
    treated_indicator = (data_df[treatment_var] == treatment_label)
    Y_t = data_df[treated_indicator][outcome_var]
    Y_c = data_df[~treated_indicator][outcome_var]
    tau_hat = Y_t.mean() - Y_c.mean()
    
    return tau_hat

def get_Neyman_var(data_df, outcome_var, treatment_var, treatment_label=1):
    
    treated_indicator = (data_df[treatment_var] == treatment_label)
    Y_t = data_df[treated_indicator][outcome_var]
    Y_c = data_df[~treated_indicator][outcome_var]
    s1_sq = np.var(Y_t, ddof=1) # Originally a typo here, we had written np.std instead of np.var
    s0_sq = np.var(Y_c, ddof=1)
    n_1 = len(Y_t)
    n_0 = len(Y_c)
    V_Neyman_hat = s1_sq/n_1 + s0_sq/n_0
    
    return V_Neyman_hat

def get_symmetric_CI(estimate, sd, coverage=0.95):
    
    alpha = (1-coverage)/2
    z_alpha = stats.norm.ppf(1-alpha)
    CI = (estimate - z_alpha*sd, estimate + z_alpha*sd)
    
    return CI

def get_Neyman_null_p_val(estimate, sd):
    
    p_val = 1 - stats.norm.cdf(estimate / sd)
    
    return p_val

In [3]:
resumes_df = pd.read_csv("resume.csv")

1a) What is the difference-in-means estimate of the ATE? How would you interpret this?

In [4]:
ATE = get_Neyman_ATE(resumes_df, "call", "race", "white")
np.round(ATE, 3)

0.032

**Solution**: Resumes with stereotypically black names had 3.2% fewer callbacks then resumes with stereotypically white names.

1b) Compute a 95\% confidence interval for the ATE.

In [5]:
sd = np.sqrt(get_Neyman_var(resumes_df, "call", "race", "white"))
get_symmetric_CI(ATE, sd)

(0.01677459474666388, 0.04729111367222729)

1c) Compute the Neyman null p-value and compare it to the null p-value from the Fisher exact test. Which is smaller? Why do you think this is the case? Does this mean that one test is better than the other?

In [6]:
p_val = get_Neyman_null_p_val(ATE, sd)
p_val

1.9383722123067493e-05

In [7]:
tab = pd.crosstab(resumes_df["call"], resumes_df["race"])
_, p_val = stats.fisher_exact(tab, alternative="greater")
p_val

2.379373553950655e-05

**Solution**: Here, as opposed to lecture, we do one-sided Fisher's exact test (so the alternative is that some units had positive treatment effect). Both $p$-values are quite similar, but the exact test $p$-value is slightly bigger. Contrary to my (Yan Shuo's) original belief, this is [pretty representative](https://projecteuclid.org/journals/statistical-science/volume-32/issue-3/A-Paradox-from-Randomization-Based-Causal-Inference/10.1214/16-STS571.full) of what usually happens, and the underlying reasons more complex than I thought (hence way beyond the scope of the course).

Despite the Fisher null being more stringent than the Neyman null, the Neyman test is only asymptotically valid, and can be shown to be anti-conservative in smaller samples.

Both tests have their uses. Fisher's exact test is finite sample valid, and is especially useful for small sample sizes, but they whereas the Neyman null is often more interesting.

*Note: There was originally a typo in the code which made the Neyman $p$-value much larger.*

1d) (**optional**) Compute the two p-values again, but for the subgroup of men.

In [8]:
men_resumes_df = resumes_df[resumes_df["sex"] == "male"]

In [9]:
ATE = get_Neyman_ATE(men_resumes_df, "call", "race", "white")
np.round(ATE, 3)

0.03

In [10]:
sd = np.sqrt(get_Neyman_var(men_resumes_df, "call", "race", "white"))
get_symmetric_CI(ATE, sd)

(-1.779562305029292e-05, 0.060833507985448315)

In [11]:
p_val = get_Neyman_null_p_val(ATE, sd)
p_val

0.02506707445738776

In [12]:
tab = pd.crosstab(men_resumes_df["call"], men_resumes_df["race"])
_, p_val = stats.fisher_exact(tab, alternative="greater")
p_val

0.032876140169162425

## Problem 2 (compliance): 

In lecture, we talked about the problem of compliance, where individuals do not comply with their assigned treatment, and cross over to the other treatment group (e.g. individuals randomized into Medicaid don't enroll in Medicaid and vice versa). Whenever this happens, researchers usually have two choices:
1. Analyze the individuals as randomized (intention-to-treat analysis), or 
2. Analyze the individuals according to the treatment group (per protocol analysis).

Discuss what are the pros and cons of both approaches.

**Solution**: Under per protocol analysis, treatment is no longer randomized, so we might have confounding. Under the intention to-treat analysis, we are not measuring the treatment effect, but the effect of the randomization.

## Problem 3 (hypothetical randomized experiments): 

Think of a hypothesis you would like to test, and construct a hypothetical randomized experiment to test it. Are there hypotheses that are fundamentally untestable?

**Solution**: I'm interested in the effect of working from home on mental health. Hypothetically, we could randomize individuals into working from home vs not.

Some interventions may not be well-defined. For instance, it may not be meaningful to test the effect of obesity on mortality, because obesity is so fundamentally tied to other health aspects that it cannot be isolated.