# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
****

In [12]:
import pandas as pd
import numpy as np
import statsmodels.stats.proportion as pstats
import scipy.stats as stats

In [4]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        

In [6]:
print("Number of callbacks for black-sounding names is", sum(data[data.race=='b'].call))
print("Number of callbacks for white-sounding names is", sum(data[data.race=='w'].call))
print("Total number of resume with black-sounding names is", sum(data.race=='b'))
print("Total number of resume with black-sounding names is", sum(data.race=='w'))

Number of callbacks for black-sounding names is 157.0
Number of callbacks for white-sounding names is 235.0
Total number of resume with black-sounding names is 2435
Total number of resume with black-sounding names is 2435


In [7]:
# Extracting the columns of interest "race" and "call" to dataframe df, and separate it into two group of data 
# of different races
df = data.loc[:,['race','call']]
df_b = df[df.race=='b']
df_w = df[df.race=='w']

__ Q1. What test is appropriate for this problem? Does CLT apply? __

The two sample hypothesis test for difference of proportion would be approapriate here. 

When we want to carry out inferences on one population (build a confidence interval or do a significance test), the accuracy of our methods depend on a few conditions. It's important to check whether or not these conditions have been met, otherwise the calculations and conclusions that follow aren't actually valid.

The conditions are

Independence: which can be simply examined by the 10% rule. The 10% rule says that if the sample size is less than the 10% of the population size, then the sample is independent.
The distribution needs to be approximately normal. It needs the sample has at least 10 success and 10 failure.
Randomization: which means the data needs to come from a random sample or randomized experiment.

In our case, all conditions are met.

__ Q2. What are the null and alternate hypotheses? __

The null hypotheses is that a resume with a white-sounding name has equal chance of callback success rate to that of a  resume witha black-sounding name. And the alternate hypotheses is that the change does have effect to the call back success.

__ Q3. Compute margin of error, confidence interval, and p-value. __

Let's explicitly set the significance level to 5% beforehand. 

In [8]:
# number of callback success in df_w.
s1 = sum(df_w.call)
# total number of trial in df_w.
n1 = df_w.race.size
# number of callback success in df_b.
s2 = sum(df_b.call)
# total number of trial in df_b.
n2 = df_b.race.size

print("Number of callbacks for black-sounding names:      %.f"%s1)
print("Total number of resume with black-sounding names:  %d" %n1)
print("Number of callbacks for white-sounding names:      %.f"%s2)
print("Total number of resume with black-sounding names:  %d" %n2)

Number of callbacks for black-sounding names:      235
Total number of resume with black-sounding names:  2435
Number of callbacks for white-sounding names:      157
Total number of resume with black-sounding names:  2435


In [9]:
# callback success rate of a resume with a white-sounding name. 
rate_w = s1/n1
# callback success rate of a resume with a black-sounding name. 
rate_b = s2/n2
# the observed difference of proportion
prop_diff = rate_w - rate_b

print('Callback success rate of a resume with a white-sounding name is %.2f.' %rate_w)
print('Callback success rate of a resume with a black-sounding name is %.2f.' %rate_b)
print('A resume with white-sounding name has a %.2f percent higher callback success rate than that of a resume with a \
back-sounding name.'%prop_diff)

Callback success rate of a resume with a white-sounding name is 0.10.
Callback success rate of a resume with a black-sounding name is 0.06.
A resume with white-sounding name has a 0.03 percent higher callback success rate than that of a resume with a back-sounding name.


Assume the null hypothesis is correct, let's calculate the probability that the difference of callback success rate is at least as large as observed. 

In [37]:
z,p = pstats.proportions_ztest([s1,s2],[n1,n2],alternative='larger')
# the standard deviation of the sampling distribution of difference of proportion.
theta = np.sqrt(rate_w*(1-rate_w)/n1+rate_b*(1-rate_b)/n2)
# confidence interval:
low,high = stats.norm.interval(0.95,loc=prop_diff,scale=theta)
# margin of error:
moe = (high-low)/2

print('The probability that the null hypothesis is correct is only',p)
print("The 95%% confidence interval of the difference of callback success rate is from %.3f to %.3f."%(low,high))
print("The margin of error is %.3f." %moe)

The probability that the null hypothesis is correct is only 1.99194341879e-05
The 95% confidence interval of the difference of callback success rate is from 0.017 to 0.047.
The margin of error is 0.015.


This tells us that a resume with a white-sounding name does has a higher callback success rate, because even the lower end of the 95% confidence interval is bigger than 0. 

__ Q4. Does the analysis mean that race/name is the most important factor in callback succes? If not, how to amend the analysis?__

This analysis does not mean that race/name is the most important factor in callback success. It only proves that the race/name has an effect to the callback success. In order to find our the most important factor, I would calculate the pearson-correlation-coefficient between the callback rate and other quanlities and find the quanlity which are most positively correlated to the callback success. 