# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

In [17]:
import pandas as pd
import numpy as np
import math
from scipy import stats

In [4]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [5]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [6]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


# What test is appropriate for this problem? Does CLT apply?

In [7]:
data.race.value_counts()

b    2435
w    2435
Name: race, dtype: int64

You can use a two sample t-test to see if the mean of the two samples is different.  Yes, the CLT would apply. 

# What are the null and alternate hypotheses?

Null: The mean hiring rate for whites and blacks are equivalent. 
<br>Alterantive: There is a difference between mean hiring rates for whites and blacks.

# Compute margin of error, confidence interval, and p-value.

In [8]:
b_apps = data[data.race == 'b']
w_apps = data[data.race == 'w']

In [12]:
# Run standard t-test
bw_results = stats.ttest_ind(b_apps.call, w_apps.call, equal_var = False)

In [27]:
N1 = len(b_apps)
N2 = len(w_apps)
df = (N1 + N2 - 2)
std1 = b_apps.call.std()
std2 = w_apps.call.std()
std_N1N2 = math.sqrt( ((N1 - 1)*(std1)**2 + (N2 - 1)*(std2)**2) / df) 
diff_mean = b_apps.call.mean() - w_apps.call.mean()

MoE = stats.t.ppf(0.975, df) * std_N1N2 * math.sqrt(1/N1 + 1/N2)

In [34]:
print("Margin of error: {0}".format(MoE))
print("Confidence interval: {0} to {1}".format(diff_mean - MoE, diff_mean + MoE))
print("p-value: {0}".format(bw_results[1]))
print("Mean calls Black:{0:.2f}; White:{1:.2f}".format(b_apps.call.mean(), w_apps.call.mean()))


Margin of error: 0.015261931850025749
Confidence interval: -0.04729478670508633 to -0.016770923005034827
p-value: 3.942941513645935e-05
Mean calls Black:0.06; White:0.10


# Write a story describing the statistical significance in the context or the original problem.

We looked at the call back rate blacks and whites based on name.  <br> 
our results show that there is a significant difference in the mean call backs with whites being called back more.  The results indicate that these call back rate differences are not due to chance.

# Does your analysis mean that race/name is the most important factor in call back success? Why or why not? If not, how would you amend your analysis?

Based on the results above it does seem like name does play a role. <br> I don't think we can definitively state that it is the ONLY factor that plays into calls backs since we haven't looked at any other variables.  It's possible that other canidates are trimmed out prior to getting to the final call back rate.  Location in the united states may also play a role.  