# Examining Racial Discrimination in the US Job Market

## C. Bonfield (Data Science Career Track)

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [1]:
# Import statements 
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

Let's start by taking a quick peek at the data to see what we're working with here.

In [3]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [4]:
print('Number of Applicants: ', len(data.index))
print('Number of "Black-Sounding" Names: ', len(data[data.race=='b'].index))
print('Number of "White-Sounding" Names: ', len(data[data.race=='w'].index))

Number of Applicants:  4870
Number of "Black-Sounding" Names:  2435
Number of "White-Sounding" Names:  2435


Interesting! While we have oodles of features, we're really only interested in the `race` and `call` columns.

### 1. What test is appropriate for this problem? Does CLT apply?

For this problem, the most appropriate test would be a two-proportion z-test, where the proportions of interest are the fraction of applicants with black- and white-sounding names. As we saw above, our samples are large for each population (black- vs white-sounding names) for the CLT to apply, so it is reasonable to use the properties of our sampling distributions in the statistical tests that follow.  

### 2. What are the null and alternate hypotheses?

Here, the null hypothesis is that there is no statistically significant difference between the fractions of applicants with black- and white-sounding names who received interview requests (p<sub>b</sub> = p<sub>w</sub>). The alternate hypothesis, therefore, is that there *is* a significant difference between p<sub>b</sub> and p<sub>w</sub>.

### 3. Compute margin of error, confidence interval, and p-value.

Here, we will calculate the 95% confidence interval for the difference between p<sub>b</sub> and p<sub>w</sub>.

In [5]:
# Number of calls for black/white-sounding names, number of white/black applicants, 
# and number of total applicants. 
b_calls = sum(data[data.race=='b'].call)
w_calls = sum(data[data.race=='w'].call)
b_total = len(data[data.race=='b'].index)
w_total = len(data[data.race=='w'].index)
total_applicants = len(data.index)

# Fraction of successes. 
b_success = b_calls / b_total
w_success = w_calls / w_total

In [6]:
# Compute margin of error, confidence interval, and p-value. 
se = np.sqrt((b_success * (1. - b_success) / b_total) + (w_success * (1. - w_success) / w_total))
moe = 1.96 * se
l_ci = (b_success - w_success) - moe
u_ci = (b_success - w_success) + moe

z = (b_success - w_success) / se
p = stats.norm.pdf(z) * 2.0

# Alternate definition of z (returns nearly identical result)
frac_combined = (b_calls + w_calls) / (b_total + w_total)
z_alt = (b_success - w_success) / np.sqrt(frac_combined * (1. - frac_combined) * ((1./b_total) + (1./w_total)))
p_alt = stats.norm.pdf(z_alt) * 2.0

print('Margin of error: ', moe)
print('Lower Bound (95% confidence interval): ', l_ci)
print('Upper Bound (95% confidence interval): ', u_ci)
print('z: ', z)
print('p-value: ', p)

Margin of error:  0.0152554063499
Lower Bound (95% confidence interval):  -0.0472882605593
Upper Bound (95% confidence interval):  -0.0167774478596
z:  -4.11555043573
p-value:  0.000167476243407


### 4. Write a story describing the statistical significance in the context of the original problem.

Our findings in the previous section (confidence interval, p-value) lead us to reject the null hypothesis in favor of the alternate hypothesis at 5% significance (or even lower than that, for that matter) - indeed, it does appear that there is a statistically significant difference between the number of applicants with black- vs. white-sounding names who receives callbacks if we only consider the `race` column in our analysis.  

### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

This, however, does not necessarily mean that race/name is the most important factor in callback success. The dataset contains 65 columns, meaning that we have 50+ other reported resume features that could be more/less significant in callback success (including correlations amongst said features, too). In my opinion, the quick and dirty thing to do in this situation would be to standardize all of the other features in the dataset, perform logistic regression (with `call` as the variable being predicted), and examine the relative magnitudes of the coefficients returned via regression. Other, more sophisticated methods exist (Boruta is one that I've tinkered with in the past) for such a task, but regardless, the important takeaway here is that one should not draw conclusions before examining all of the available data!