
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
Perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 

****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [6]:
data = pd.io.stata.read_stata('us_job_market_discrimination.dta')

In [7]:
# number of callbacks for balck-sounding names
sum(data[data.race=='b'].call)

157.0

****

# Exercise

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.
    
You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
   
****

In [8]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


1) What test is appropriate for this problem? Does CLT apply?

Lets split the data by race

In [9]:
race_data = data[['race','call']]
race_data_white = race_data[race_data.race=='w']
race_data_black = race_data[race_data.race!='w']

len(race_data_black)

2435

In [10]:
len(race_data_white)

2435

These are large samples, which is a condition for CLT to apply.

2) What are the null and alternate hypotheses?

Let the sample probability of sucess (being called back) for a white person be p(W) and the sample probability of success (being called back) for a black person be p(B).

In [11]:
nB=len(data[data.race=='b'])
nW=len(data[data.race=='w'])

In [12]:
callBackRatioBlack=sum(data[data.race=='b'].call)/nB
callBackRatioWhite=sum(data[data.race=='w'].call)/nW
print ('callBackRatioBlack: ', callBackRatioBlack)
print ('callBackRatioWhite: ', callBackRatioWhite)

('callBackRatioBlack: ', 0.064476386036960986)
('callBackRatioWhite: ', 0.096509240246406572)


H0: callBackRatioBlack = callBackRatioWhite
HA: callBackRatioBlack < callBackRatioWhite

We will use a one-tailed test, with a significance level of 0.05

In [13]:
prop=sum(data.call)/len(data)

SE=((prop*(1-prop)/nB)+(prop*(1-prop)/nW))**0.5
zScore=(callBackRatioWhite-callBackRatioBlack)/SE
p_value = stats.norm.sf(abs(zScore))# using one-sided test
p_value

1.9919434187925383e-05

p_value is less than 0.05, rejecting the null Hypothesis. Race has a significant impact on the rate of callbacks for resumes.

3) Compute margin of error, confidence interval, and p-value.

In [14]:
critical_value=stats.norm.ppf(1-0.05/2)# two sided
SE_c = ( (callBackRatioWhite*(1-callBackRatioWhite) / nW) + (callBackRatioBlack*(1-callBackRatioBlack) / nB) ) ** 0.5
#Margin of Error
Margin_of_Error=critical_value*SE_c
Margin_of_Error

0.015255126028214831

In [17]:
#Confidence Interval
ME = Margin_of_Error
Confidence_Interval = ((callBackRatioWhite-callBackRatioBlack) - ME, (callBackRatioWhite-callBackRatioBlack) + ME)
Confidence_Interval

(0.016777728181230755, 0.047287980237660412)

4) Discuss statistical significance:

There is a 95% chance that if a random sample is chosen from job candidates of the USA then the ratio of the people with White Names in the sample getting a call will be greater than the people with Black Names.
We can state this with a degree of confidence (95%) and a 100% sense of disappointment that a candidates with a White Name is 1.68% to 4.73% more likely to be called as compared to a candidate with a Black Name