# Examining Racial Discrimination in the US Job Market
## Molly McNamara

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
pd.set_option("display.max_columns", 500)

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,email,computerskills,specialskills,firstname,sex,race,h,l,call,city,kind,adid,fracblack,fracwhite,lmedhhinc,fracdropout,fraccolp,linc,col,expminreq,schoolreq,eoe,parent_sales,parent_emp,branch_sales,branch_emp,fed,fracblack_empzip,fracwhite_empzip,lmedhhinc_empzip,fracdropout_empzip,fraccolp_empzip,linc_empzip,manager,supervisor,secretary,offsupport,salesrep,retailsales,req,expreq,comreq,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,1,0,0,1,0,Allison,f,w,0.0,1.0,0.0,c,a,384.0,0.98936,0.0055,9.527484,0.274151,0.037662,8.706325,1.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,6,1,1,1,0,Kristen,f,w,1.0,0.0,0.0,c,a,384.0,0.080736,0.888374,10.408828,0.233687,0.087285,9.532859,0.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,1,1,0,1,0,Lakisha,f,b,0.0,1.0,0.0,c,a,384.0,0.104301,0.83737,10.466754,0.101335,0.591695,10.540329,1.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,5,0,1,1,1,Latonya,f,b,1.0,0.0,0.0,c,a,384.0,0.336165,0.63737,10.431908,0.108848,0.406576,10.412141,0.0,5,,1.0,,,,,,,,,,,,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,5,1,1,1,0,Carrie,f,w,1.0,0.0,0.0,c,a,385.0,0.397595,0.180196,9.876219,0.312873,0.030847,8.728264,0.0,some,,1.0,9.4,143.0,9.4,143.0,0.0,0.204764,0.727046,10.619399,0.070493,0.369903,10.007352,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [4]:
# Race and call seem to be the relevant variables here so this can be made a bit simpler by subsetting the dataset to just these columns.
df = data[['race','call']]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 2 columns):
race    4870 non-null object
call    4870 non-null float32
dtypes: float32(1), object(1)
memory usage: 95.1+ KB


## 1. What test is appropriate for this problem? Does CLT apply?

In [5]:
# data for white sounding names
white = df[df.race=='w'].call
# data for black sounding names
black = df[df.race=='b'].call
# number of callbacks for white-sounding names
num_w_cb = sum(white)
# number of callbacks for black-sounding names
num_b_cb = sum(black)
# number of white-sounding names
num_w = len(white)
# number of black-sounding names
num_b = len(black)

In [6]:
# Calculate proportion for each group that received a call back
p_w = num_w_cb/num_w
print("Percent of white-sounding names that received a call back =", p_w * 100)
p_b = num_b_cb/num_b
print("Percent of black-sounding names that received a call back =", p_b * 100)

Percent of white-sounding names that received a call back = 9.65092402464
Percent of black-sounding names that received a call back = 6.4476386037


9.7% of white sounding names received a call back as compared to 6.4% of black sounding names.  The question is if this is actually a significant difference.

We could consider this a binomial distribution (callbacks versus not) for the two populations and test if the rate is different between the two using a z-test as the variances are known and the sample size is large.  CLT should apply as the sample size is more than sufficiently large that the distributions should be normal.

## 2. What are the null and alternate hypotheses?


The null hypothesis is that there is no difference between the rate of callbacks between the black and white sounding names.  The alternate hypothesis is that there is in fact a difference between the two.

## 3. Compute margin of error, confidence interval, and p-value.

In [7]:
#Calculate the standard error
SE = np.sqrt((black.std()**2)/black.count() + (white.std()**2)/white.count())
Diff = black.mean() - white.mean()
print('Standard Error =', SE)
#Calculate the margin of error
print('Margin of error =', 1.96 * SE)
#Calculate the confidence interval
print('95% Confidence Interval =', (-1.96 * SE + Diff, 1.96 * SE + Diff))

Standard Error = 0.00778490691981
Margin of error = 0.0152584175628
95% Confidence Interval = (-0.047291272417895609, -0.016774437292225546)


In [8]:
Z = Diff / SE
p_value = stats.norm.sf(abs(Z))*2 #twosided
print('p value =', p_value)

p value = 3.87618889548e-05


## 4. Write a story describing the statistical significance in the context or the original problem.


In the context of the original problem, the statistical test demonstrates that there is a significant difference between black and white sounding names in terms of callback rate.  This may suggest that some sort of discrimination is at play and requires additional investigation.

## 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

No, in fact, as the analysis did not consider the many other factors in the dataset, there might be confounding variables. It could be that the names just happen to correlate with years of experience, education or other factors that were the true influences in the decision to call back or not.  A better analysis would incorporate all of these factors and see what the biggest predictors are of callback status.  