# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

## Q1
The test that is most appropriate for this situation is the two-sample t-test. A two-sample t-test is used to test the difference between two population means. A common application is to determine whether the means are equal between those two groups, which in this case is 'black-sounding' and 'white-sounding'. 

More specifically, we're going to test whether these two groups have similar call back rates from prospective employers.

Additionally, the Central Limit Theorem tells us that if we take the mean of the samples (n) and plot the frequencies of their mean, we will get a normal distribution. Also, as the sample size (n) increases, the distribution will look more and more like a normal distribution. 

The sample size (n) has to be large (usually n >= 30) if the population from where the same is taken is non-normal. If the population follows the normal distribution then the sample size n can be either small or large. Based off of this, as long as we keep our sample size (n) greater than 30 than the Central Limit Thereom will hold. 

## Q2
We're trying to determine if there is a difference between the two population means (i.e. d = 0). When the null hypothesis states that there is no difference the null and alternative hypothesis are often stated as follows: 

- Ho: μ1 = μ2

- Ha: μ1 ≠ μ2

With Ho being the null hypothesis and Ha being the alternative hypothesis. 

In [5]:
w = data[data.race=='w']
b = data[data.race=='b']

# Solution to Q3

To do the two-sample bootstrap test, we shift both arrays to have the same mean, since we are simulating the hypothesis that their means are, in fact, equal. 

We then draw bootstrap samples out of the shifted arrays and compute the difference in means. This constitutes a bootstrap replicate, and we generate many of them. 

The p-value is the fraction of replicates with a difference in means greater than or equal to what was observed. 

### Quick note: since our sample size is greater than 30, the t-distribution and z-distribution will look approximately the same. 

In this example, we'll be assessing a 95% two-tailed confidence interval, which corresponds with a z-score of 1.96 and -1.96.

### Bootstrap Method

In [6]:
# compute the mean callback rate of all the observations
population_mean = np.mean(data['call'])
empirical_diff_means = np.mean(w['call']) - np.mean(b['call'])
print('The mean callback rate of the population is ', population_mean)
print('The difference of means between white-sounding and black-sounding is', empirical_diff_means)

The mean callback rate of the population is  0.08049281686544418
The difference of means between white-sounding and black-sounding is 0.03203285485506058


In [7]:
# generate shifted arrays
w_shifted = w['call'] - np.mean(w['call']) + population_mean
b_shifted = b['call'] - np.mean(b['call']) + population_mean

In [8]:
# function to draw random sample, of size that is equal to the length of the data
def bootstrap_replicate_1d(data, func):
    return func(np.random.choice(data, size=len(data)))

# function to store means of the bootstrap replicates, with default size of 1 
def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data, func)

    return bs_replicates

In [9]:
# compute 1000 bootstrap replicates from shifted arrays
bs_replicates_w = draw_bs_reps(w_shifted, np.mean, size=1000)
bs_replicates_b = draw_bs_reps(b_shifted, np.mean, size=1000)

# get replicates of difference of means
bs_replicates = bs_replicates_w - bs_replicates_b

# compute and print p-value
p = np.sum(bs_replicates >= empirical_diff_means) / len(bs_replicates)

print('p-value =', p)

p-value = 0.0


### Frequentist method

In [10]:
# take random sample from both white-sounding and black-sounding data sets
w_random = w['call'].sample(n = 100, replace = True, random_state = 5)
b_random = b['call'].sample(n = 100, replace = True, random_state = 5)

Let p1 represent the population mean proportion call back rate for white-sounding names and p2 represent population mean proportion call back rate for black-sounding names.

Null Hypothesis
H0: p1 = p2 (or H0: p1 – p2 = 0)

Alternative Hypothesis
Ha : p1 > p2 (or Ha : p1 – p2 > 0)

Use a significance level of α < 0.05

In [11]:
# get callback mean proportion from both samples
w_random_mean = np.mean(w_random)
b_random_mean = np.mean(b_random)

# get standard deviation of the samples
w_random_std = np.std(w_random)
b_random_std = np.std(b_random)

# standard error of the samples
w_random_se = w_random_std / len(w_random)
b_random_se = b_random_std / len(b_random)

print('The difference between the two sample proportions is', w_random_mean - b_random_mean)

# calculate the overall sample proportion
sample_mean = (w_random_mean + b_random_mean) / 2

print('The combined sample proportion is', sample_mean)

# calculate the combined standard error
sample_se = np.sqrt((sample_mean * (1 - sample_mean)) * (1/len(w_random) + 1 / len(b_random)))

print('The combined standard error is', sample_se)

# calculate the test statistic
sample_test_statistic = (w_random_mean - b_random_mean - 0) / sample_se

print('The test statistic is', sample_test_statistic)
print(sample_se)

The difference between the two sample proportions is 0.08000000193715096
The combined sample proportion is 0.09999999962747097
The combined standard error is 0.042426406800948106
The test statistic is 1.8856181319452014
0.042426406800948106


The p-value is the probability of being at or beyond (in this case to the right of) ~1.88. Our significane level of 0.05 with 99 degrees of freedom corresponds to a (one-tailed) t-value of ~1.66. Now what does this mean?

We can reject our null hypothesis that the callback proportion is 0 if our p-value < significance level. In this case with a t-score of ~1.88 and 99 degrees of freedom, our p-value is equal to 0.0315. Since 0.0315 < 0.05, we can reject our null hypothesis in favor of the alternative hypothesis, that the callback proportion for white-sounding names is greater than that of black-sounding names. 

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

# Q4 

In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given given the null hypothesis. More precisely, a study's defined significance level, α, is the probability of the study rejecting the null hypothesis, given that it were true; and the p-value of a result, p, is the probability of obtaining a result at least as extreme, given that the null hypothesis were true. The result is statistically significant when p < α.

In any observation that involves drawing a sample from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone. However, if the p-value of an observed effect is less than the significance level, one may conclude that the effect reflects the characteristics of the whole population, thereby rejecting the null hypothesis. (*https://en.wikipedia.org/wiki/Statistical_significance)

In the specific case of this assignment, we observed that the likelihood of encountering a difference in call back rates between white-sounding and black-sounding names to be 0.0315 (i.e. our p-value). Our established signficance level at the start had been 0.05, and since 0.0315 < 0.05, the result is statistically significant meaning that the difference is not due to sampling error alone. 

# Q5

Because we are able to reject the possibility that the observed effect in this case was not due to sampling error alone, the data supports the alternative hypothesis that call back rates for white-sounding names is higher than black-sounding names. 

This indicates that race/name does play a role in regards to the hiring process and provides evidence of racial discrimination in the United States labor market. 