# Lecture worksheet 14

## Problem 1 (racial discrimination): 

In this problem, we analyze the resumes dataset from [Bertrand, M. and Mullainathan, S. (2004)](https://www.nber.org/system/files/working_papers/w9873/w9873.pdf).

We have defined functions to implement the estimators that we talked about in lecture. Read through the code and make sure you understand what the functions are doing.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

In [None]:
# Functions for lecture worksheet

def get_Neyman_ATE(data_df, outcome_var, treatment_var, treatment_label=1):
    
    treated_indicator = (data_df[treatment_var] == treatment_label)
    Y_t = data_df[treated_indicator][outcome_var]
    Y_c = data_df[~treated_indicator][outcome_var]
    tau_hat = Y_t.mean() - Y_c.mean()
    
    return tau_hat

def get_Neyman_var(data_df, outcome_var, treatment_var, treatment_label=1):
    
    treated_indicator = (data_df[treatment_var] == treatment_label)
    Y_t = data_df[treated_indicator][outcome_var]
    Y_c = data_df[~treated_indicator][outcome_var]
    s1_sq = np.std(Y_t, ddof=1)
    s0_sq = np.std(Y_c, ddof=1)
    n_1 = len(Y_t)
    n_0 = len(Y_c)
    V_Neyman_hat = s1_sq/n_1 + s0_sq/n_0
    
    return V_Neyman_hat

def get_symmetric_CI(estimate, sd, coverage=0.95):
    
    alpha = (1-coverage)/2
    z_alpha = stats.norm.ppf(1-alpha)
    CI = (estimate - z_alpha*sd, estimate + z_alpha*sd)
    
    return CI

def get_Neyman_null_p_val(estimate, sd):
    
    p_val = 1 - stats.norm.cdf(estimate / sd)
    
    return p_val

2a) What is the difference-in-means estimate of the ATE? How would you interpret this?

2b) Compute a 95\% confidence interval for the ATE.

2c) Compute the Neyman null p-value and compare it to the null p-value from the Fisher exact test. Which is smaller? Why do you think this is the case? Does this mean that one test is better than the other?

2d) (**optional**) Compute the two p-values again, but for the subgroup of men.

## Problem 2 (compliance): 

In lecture, we talked about the problem of compliance, where individuals do not comply with their assigned treatment, and cross over to the other treatment group (e.g. individuals randomized into Medicaid don't enroll in Medicaid and vice versa). Whenever this happens, researchers usually have two choices:
1. Analyze the individuals as randomized (intention-to-treat analysis), or 
2. Analyze the individuals according to the treatment group (per protocol analysis).

Discuss what are the pros and cons of both approaches.

## Problem 3 (hypothetical randomized experiments): 

Think of a hypothesis you would like to test, and construct a hypothetical randomized experiment to test it. Are there hypotheses that are fundamentally untestable?