### Hypothesis testing on Bertrand and Mullainathan data

In the past, we looked into bootstrap hypothesis testing, to check if $F=G$. Now we will use this method to check if white resumes get more callbacks than black resumes, at a rate which would be unlikely by chance.

Start off by reading in the data

In [4]:
import pandas as pd 
df = pd.read_stata("lakisha_aer.dta")

Warmup: make a smaller data frame using the `race` and `call` variables. These fields store the race of the resume, along with an indicator showing if the resume received a callback.

Use pandas to make one list of all call variables for white resumes, and one list of call variables for black resumes. `white_call` and `black_call` should each be a list of integers.

In [1]:
white_call = None
black_call = None

- Make a histogram of white callbacks, and black callbacks. Show the histograms side-by-side
- Be sure to use the appropriate number of [bins](https://stackoverflow.com/questions/54918651/controlling-bin-widths-in-altair)

In [2]:
import altair as alt


Eyeballing the histograms, does it seem like the distributions are different? Does the difference look big or small? Does it could have arisen by chance?

[Your answer here]

## Testing a hypothesis

Use the bootstrap hypothesis testing code to check if the distributions in callback rates are different. 
- Note: I have adjusted the code a little bit to make it run faster for this example.
- Note: You will need around 10,000 samples to get a non-zero p-value!

In [3]:
from tqdm import tqdm_notebook as tqdm 
import numpy as np
from random import shuffle

def bootstrap_hypothesis_testing(x, N, M, test_stat_function, T_obs, B=100):
    '''
    Implement bootsrap hypothesis testing by computing B samples and returning a p-value
    
    inputs:
        x [list]: a sample from F
        N [int]: the number of samples in the original treatment group
        M [int]: the number of samples in the original control group
        B [int]: the number of batches to sample
    '''
    c = 0
    df = pd.DataFrame({"data": x})
    for b in tqdm(range(B), total=B): # a cool thing to learn, not essential
        sample = df.sample(replace=True, n = N + M)
        zstar = sample["data"][0:N].to_list()
        ystar = sample["data"][N:].to_list()
        T = test_stat_function(ystar, zstar)
        if T >= T_obs:
            c += 1
    p = c/B
    return p

def TS(x, y):
    return np.mean(x) - np.mean(y)

## Interpret your results 

What p-value did you compute? How would you interpret this p-value? 

[Your answer here]