# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [4]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [8]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [6]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">

The question of whether or not race is a significant factor in the callback rates on a resume requires comparing the proportion of resumes with black-sounding names and the proportion of resumes with white-sounding names that got callbacks.

I do not know the standard deviation for the entire population, so I will be performing a two-sample t test. Because this is a binomial distribution, the Central Limit Theorem does apply. 

<br>
**H0** = Race and callback are independent of each other <br>
**H1** = Race and callback are dependent on each other 

In [14]:
w = data[data.race=='w']
b = data[data.race=='b']

235

In [58]:
# Compute the proportion for each category
n = len(w)
Pw = len(w[w.call == 1])/n
Pb = len(b[b.call == 1])/n
print('The proportion of white-sounding names that received a callback:', round(Pw, 3))
print('The proportion of black-sounding names that received a callback:', round(Pb, 3))

# Compute the margin of error
varw = np.var(w.call)
varb = np.var(b.call)
margin_of_error = 1.96 * np.sqrt((varw/n) + (varb/n))
print('Margin of Error:', round(margin_of_error, 3))

# Compute the confidence interval
diff_of_mean = Pw - Pb
CI = diff_of_mean + (np.array([-1, 1])) * margin_of_error
print('Difference of Means:', round(diff_of_mean, 3))
print('95% Confidence Interval:', CI)

# Compute the p-value
t = diff_of_mean / np.sqrt((varw/n) + (varb/n))
df = 2*n - 2
p = 1 - stats.t.cdf(t, df=df)
print('t statistic:', round(t, 3))
print('p-value:', p)

The proportion of white-sounding names that received a callback: 0.097
The proportion of black-sounding names that received a callback: 0.064
Margin of Error: 0.015
Difference of Means: 0.032
95% Confidence Interval: [ 0.01677757  0.04728814]
t statistic: 4.116
p-value: 1.96293959963e-05


<div class="span5 alert alert-success">
<p> A group of researchers recently conducted a study where they randomly assigned either a black-sounding or white-sounding name to identical resumes and counted how many resumes received a callback. They wanted to know if racial bias played a part in the callback rates based on a job candidate's name. Using the data from the study, I was able to test whether or not race and name played a significant role in a candidate's callback success. </p>

<p>The study showed that resumes with a white-sounding name received a callback 9.7% of the time, while resumes with black-sounding names received a callback 6.4% of the time. I performed a t-test on the two samples and created a 95% confidence interval to test whether this difference in callback rates is significant. </p>

<p> The t-test resulted in a p-value of 0.0000196. The p-value is well below the 0.05 significance level, indicating that there is a significant difference between the callback rates for white-sounding names and black-sounding names. Furthermore, the difference in callback rates was 0.032. This falls within the confidence interval of 0.017 and 0.047. </p> 

<p>Since the p-value is below the significance level and the difference in rates falls within the confidence interval, I am confident that there is a 95% chance that the white-sounding names are more likely to get a callback than black-soundings names.   </p>

<p>The statistical analysis in this report shows that race most likely plays a role when hiring candidates, however there is no way to know whether it is the _most important_ factor in callback success. To conclude that we would have to test the additional features in the dataset.  
</div>