# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [39]:
import pandas as pd
import numpy as np
from scipy import stats
import math

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


###### What test is appropriate for this problem? Does CLT apply?

We will be able to test whether there is a significant difference between the proportion of whites who were responded to versus the proportion of blacks who were responded to via a hypothesis test comparing two proportions. We will treat the white response rate as one Bernoulli distribution and the black response rate as another Bernoulli distribution. The central limit theorem applies as the distribution of n Bernoulli trials (i.e. the Binomial distribution) will approximate the normal distribution with sufficiently large n and values of p that are not too small. Specifically, if n*p and n*(1-p) are larger than 10, then the binomial distribution will be closely approximated by the normal distribution.

In [13]:
sample_mean_black = sum(data[data.race=='b'].call)/len(data[data.race=='b'])
sample_mean_black

0.064476386036960986

In [15]:
sample_mean_white = sum(data[data.race=='w'].call)/len(data[data.race=='w'])
sample_mean_white

0.096509240246406572

In [20]:
n_black = len(data[data.race=='b'])
n_black

2435

In [19]:
n_white = len(data[data.race=='w'])
n_white

2435

In [44]:
print(n_black * sample_mean_black > 10)
print(n_black * (1 - sample_mean_black) > 10)
print(n_white * sample_mean_white > 10)
print(n_white * (1 - sample_mean_white) > 10)

True
True
True
True


n*p and n*(1-p) are indeed larger than 10, so the binomial distribution will be closely approximated by the normal distribution

###### What are the null and alternate hypotheses?

The null hypothesis is that the probability of a response for a white person equals the probability of a response for a black person, or, that the probability of a response for a white person minus the probability of a response for a black person equals zero. The alternate hypothesis is that this is not the case.

###### Compute margin of error, confidence interval, and p-value.
We will calculate the sampling mean and standard error of the sampling statistic of the differences of the white and black binomial distributions, assuming the null hypothesis. We will then use this to calculate a Z-score to see how likely it is that we sampled the mean and standard error that we did given the null hypothesis. For this exercise, let's use a significance level of 5%.

In [28]:
sample_mean_diff = sample_mean_white - sample_mean_black
sample_mean_diff

0.032032854209445585

The sampling variance of the sampling statistic is the sum of the variances of the two underlying distributions that we are subracting from each other. The variance of the sample mean of the sampling distribution of a Binomial distribution is p*\(1-p)/n. In this case, since our null hypothesis is that the probability of a response for a black person is equal to that of a white person, the variance for the sum of the two distributions simplifies to 2p*(1-p)/1000. From here, we will calculate p as the sum of all responses divided by the sum of all applications and substitute tha value into our variance formula.

In [34]:
sample_mean_all = sum(data.call)/len(data)
sample_mean_all

0.080492813141683772

In [37]:
variance_diff = 2*sample_mean_all*(1-sample_mean_all)/1000
variance_diff

0.00014802744034844351

In [41]:
standard_error_diff = math.sqrt(variance_diff)
standard_error_diff

0.012166652799699822

We can now calculate a Z-score for this situation, and determine the probability of getting the sample mean and standard_error that we did given the null hypothesis.

In [43]:
Z = (sample_mean_diff - 0) / standard_error_diff
Z

2.6328403330648102

Examining a Z-table, we see that the minimum Z-score we need to reject the null hypothesis with a significance level of 5% is 1.96. This is indeed the case, so we can reject the null hypothesis and assume that the rate of response for whites is not equal to that of blacks.

The margin of error is equal to the standard error times the z-value for the confidence level that we have chosen.

In [46]:
margin_of_error = 1.96 * standard_error_diff
margin_of_error

0.023846639487411652

The confidence interval is the sample mean plus or minus the margin of error.

In [49]:
sample_mean_diff - margin_of_error, sample_mean_diff + margin_of_error

(0.0081862147220339333, 0.055879493696857241)

The p-value for our null hypothesis assumption, i.e., the probability that the null hypothesis is correct and that we randomly gathered the sample (with the sample mean and standard error that we did), is calculated as follows:

P(Z>2.63) = (1 - .9957) / 2

In [51]:
p_value = (1 - .9957)/2
p_value

0.0021499999999999853

###### Write a story describing the statistical significance in the context or the original problem.

###### Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?