# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
data.race.unique()

array(['w', 'b'], dtype=object)

In [6]:
b_data = data[data.race == 'b']
w_data = data[data.race == 'w']
print (len(b_data), len(w_data))

2435 2435


In [7]:
b_data.call.mean()

0.064476386

In [8]:
w_data.call.mean()

0.096509241

### What test is appropriate for this problem? Does CLT apply?

The sample data itself follows a Bernoulli distribution, but the distribution of sample means must be a normal curve. The samples are large enough, so CLT should apply.

Since we're trying to compare two independent samples to understand whether the difference in the samples is statistically significant, the most appropriate test is a **two sample t-test**.

Null hypothesis: $H_0 \Rightarrow X_b = X_w$

Alternate hypothesis: $H_a \Rightarrow X_b \neq X_w$

From the calculations above, we find that black-sounding names get call-backs 6.447% of the time, and white-sounding names get call-backs 9.65% of the time. Using the t-test, we will be able to determine if this difference is statistically significant.

The t-statistic can be calculated as follows:
$$ t-statistic = \frac{X_w - X_b}{\sqrt{\frac{Var_w}{Num_w} + \frac{Var_b}{Num_b}}}$$

# <font color='red'>FEEDBACK</font>
Since N > 30, you don't need a t-test. You can use a z-test.

In [23]:
import math

X_w = w_data.call.mean()
X_b = b_data.call.mean()

var_w = X_w * (1 - X_w)
var_b = X_b * (1 - X_b)

num_w = len(w_data)
num_b = len(b_data)

pooled_std = math.sqrt(float(var_b/num_b) + float(var_w/num_w))

mean_diff = float(X_w - X_b)

t_statistic = mean_diff/pooled_std

# Assuming a confidence interval of 95% 
t_critical_min = stats.t.ppf(0.025, num_w-1)
t_critical_max = stats.t.ppf(0.975, num_w-1)

print("T-critical range: ({},{})\nT-statistic: {}".format(t_critical_min, t_critical_max, t_statistic))

T-critical range: (-1.9609391001008838,1.9609391001008833)
T-statistic: 4.115550519002299


### Margin of error, Confidence Interval

In [24]:
margin_of_error = t_critical_max * pooled_std
print (margin_of_error)

0.0152627157128


In [25]:
#Print confidence interval:
print (mean_diff - margin_of_error, mean_diff + margin_of_error)

0.0167701391423 0.0472955705678


Given that the t-statistic lies far outside of the critical window, we can reject the null hypothesis and conclude that there is a statistical significance between the two samples, i.e., there does seem to be a significant difference between the number of call backs received by applicants with black-sounding names as compared to applicants with white-sounding names.

### Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

While there is a statistically significant difference as concluded above, it is not clear whether race/name is the most important factor in callback success. Since there are many attributes of an applicant, I can only tell if I repeat this analysis using other independent variables, such as education, number of jobs, etc.