
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [112]:
import pandas as pd
import numpy as np
from scipy import stats
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from math import *
import statsmodels.api as sm

In [12]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [13]:
data[['race','call']].head()

Unnamed: 0,race,call
0,w,0.0
1,w,0.0
2,b,0.0
3,b,0.0
4,w,0.0


In [14]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

# 1.What test is appropriate? Does CLT apply?

Hypothesis testing of the difference of two population proportions is appropriate, given that we are concerned as to whether the proportion of black resume responses differs from the proportion of white resume responses.  

Let's check to see if the CLT applies:

In [46]:
n_b=len(data[data.race=='b'])
n_w=len(data[data.race=='w'])
p_b=sum(data[data.race=='b'].call)/n_b
p_w=sum(data[data.race=='w'].call)/n_w
print 'Sample size of black resumes is: %r\n\
Sample size of white resumes is: %r\n\
Proportion of black resumes that received responses is %.3f\n\
Proportion of white resumes that received responses %.3f'%(n_b,n_w,p_b,p_w)

Sample size of black resumes is: 2435
Sample size of white resumes is: 2435
Proportion of black resumes that received responses is 0.064
Proportion of white resumes that received responses 0.097


If each of the number of white resume responses, white resume non-responses, black resume responses, black resume non-responses are all greater than 5, then the CLT applies and we can do hypothesis testing!  As you'll see below, they are all much greater than 5.

In [49]:
n_b*p_b, n_b*(1-p_b),n_w*p_w, n_w*(1-p_w)

(157.0, 2278.0, 235.0, 2200.0)

# 2. What are the null and alternate hypotheses?

**Null**: proportion of black resumes that received respones = proportion of white resumes that received responses

**Alternative**: proportion of black resume responses != proportion of white resume responses


# 3. Find margin of error, confidence interval, and p-value.

In [103]:
p_avg=(n_w*p_w+n_b*p_b)/(n_w+n_b)
standard_error=sqrt((p_avg*(1.0-p_avg))*((1.0/n_b)+(1.0/n_w)))
z=((p_w-p_b)-0)/standard_error
margin_of_error=1.96*sqrt((p_b*(1.0-p_b)/n_b)+(p_w*(1.0-p_w)/(n_w)))
conf_lower=(p_w-p_b)-margin_of_error
conf_upper=(p_w-p_b)+margin_of_error
p_value=(1.0-stats.norm.cdf(z))*2

In [102]:
print 'z: %.2f \n\
margin of error: %.3f \n\
95%% confidence interval: (%.3f, %.3f) \n\
p value: %.5f' %(z,margin_of_error,conf_lower,conf_upper,p_value)

z: 4.11 
margin of error: 0.015 
95% confidence interval: (0.017, 0.047) 
p value: 0.00004


# 4.  Discuss statistical significance.

We are 95% confident that the population proportion of white resumes that received responses is between .017 and .047 percentage points greater than the population proportion of black resumes that received respones.  

There is only a .004% likelihood of observing the observed difference in sample proportions or a difference more extreme assuming that the difference in population proportions is zero.  This low p value suggests that a greater proportion of white resumes receive responses compared with black resumes.  