# Examining Racial Discrimination in the US Job Market

### Introduction
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercise

We will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

### Initial Data Exploration

In [19]:
import pandas as pd
import numpy as np
from scipy import stats

In [20]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [21]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [29]:
# Extract data associated with white-sounding names
w = data[data.race=='w']

# Extract data associated with black-sounding names
b = data[data.race=='b']



# Number of resumes with white-sounding names
wr = len(w)

# Number of resumes with black-sounding names
br = len(b)



# Number of callbacks for white-sounding names
wcb = sum(w.call)

# Number of callbacks for black-sounding names
bcb = sum(b.call)



print('number of resumes with white-sounding names:', wr)
print('number of callbacks for white-sounding names:', wcb)
print('\n')
print('number of resumes with black-sounding names:', br)
print('number of callbacks for black-sounding names:', bcb)

number of resumes with white-sounding names: 2435
number of callbacks for white-sounding names: 235.0


number of resumes with black-sounding names: 2435
number of callbacks for black-sounding names: 157.0


In [30]:
# Conditions of np >= 10 & n(1-p) >= 10
print(br*(bcb/br) >= 10)
print(br*(1-bcb/br) >= 10)
print(wr*(wcb/wr) >= 10)
print(wr*(1-wcb/wr) >= 10)

True
True
True
True


### 1. What test is appropriate for this problem? Does Central Limit Theorem (CLT) apply?

The Central Limit Theorem (CLT) is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.


In order to determine whether the rate of callbacks depends on race, we need to compare the rate of callbacks on résumés with white sounding names vs. rate of callbacks on résumés with black sounding names.  Because there are two variables to examine, the approach we must take will be a **two sample test**.


In order to apply to CLT, the samples must be random, normal and independent:


1. Random:

As stated above in the *introduction*, the résumés are **randomly** assigned to a white-sounding or black-sounding name for the experiment, thus making it a random, unbiased sample.


2. Normal:

To make sure the sample distribution is normal, the sample size needs to be large enough so that:
- np >= 10
- n(1-p) >= 10
    
As shown above, both the samples of white-sounding and black-sounding names meet the normal condition.


3. Indepedent:

The researchers in this study used just under 5,000 resumes gathered in response to over 1,300 newspaper ads looking for sales, administrative and clerical jobs in the Chicago and Boston areas.  It can be safe to assume that the combined cities have over 50,000 job seekers and by the 10% rule (our sample is smaller than 10% of our population) the sample can be considered independent.


The three conditions of CLT are met and so CLT applies in this case.

### 2. What are the null and alternate hypotheses?

Null hypothesis: 
- *Race doesn't significantly effect the rate of callbacks on resumes.*

Alternative hypothesis: 
- *Race does significantly effect the rate of callbacks on resumes.*

### 3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

#### Bootstrap Approach

Method: two sample test using permutation

Significance level: $\alpha$=0.05

In [31]:
# Calculate the proportion of callbacks for white-sounding names
wp = sum(w.call)/len(w)

# Calculate proportion of callbacks for black-sounding names
bp = sum(b.call)/len(b)

# Calculate the difference in proportions
difference = bp - wp

#-------------------------------------------------------

# Concatenate the 'call' columns from both races while assuming the null hypothesis is true
cb = np.concatenate([w.call, b.call])

# Initialize array of replicates
bs_reps = np.empty(10000)

#-------------------------------------------------------

# Generate replicates
np.random.seed(10)
for i in range(10000):
    # Generate sample replicates
    bs = np.random.choice(cb, size=len(cb))
    
    # Split permuted array into two samples
    bs_b = bs[:len(b)]
    bs_w = bs[len(b):]
    
    # Calculate the difference in bootstrap sample proportions
    bs_reps[i] = sum(bs_b)/len(bs_b) - sum(bs_w)/len(bs_w)

#-------------------------------------------------------    
    
# Calculate margin of error for 95% confidence level, z = 1.96
error = 1.96*np.std(bs_reps)
print('Margin of error is {}'.format(error))

# Calculate 95% confidence interval
conf_int = np.percentile(bs_reps, [2.5, 97.5])
print('95% confidence interval is {}'.format(conf_int))

# Calculate p-value for number of replicates that are more extreme than what is observed in the sample
p = np.sum(bs_reps <= difference)/len(bs_reps)

# Convert to two-tailed test
p = 2*p
print('p-value is {}'.format(p))

Margin of error is 0.015359902171738776
95% confidence interval is [-0.01560575  0.01560575]
p-value is 0.0


The p-value < $\alpha$ and is a number approaching 0, so the null hypothesis is rejected and race does significantly effect the rate of callbacks on resumes.


#### Frequentist approach

Method: two sample test using permutation

Significance level: $\alpha$=0.05

In [35]:
# Calculate the proportion of callbacks for white-sounding names
wp = sum(w.call)/len(w)

# Calculate proportion of callbacks for black-sounding names
bp = sum(b.call)/len(b)

# Calculate the difference in proportions
difference = bp - wp

# Null Hypothesized difference in proportions is zero
h_difference = 0

#-------------------------------------------------------

# Calculate variance of sampling distribution for white-sounding names
w_var = wp*(1-wp)/len(w)

# Calculate variance of sampling distribution for black-sounding names
b_var = bp*(1-bp)/len(b)

# Calculate standard deviation of the difference in sample proportions
std_diff = np.sqrt(b_var + w_var)

#-------------------------------------------------------

# Calculate margin of error for 95% confidence level, z = 1.96
error = 1.96*std_diff
print('Margin of error is {}'.format(error))

# Calculate 95% confidence interval
conf_int = h_difference + np.array([-1,1])*error
print('95% confidence interval is {}'.format(conf_int))

# Calculate z-statistic & p-value from a z-table
z = (difference - h_difference)/std_diff
print('The z-score is {}'.format(z))

Margin of error is 0.015255406349886438
95% confidence interval is [-0.01525541  0.01525541]
The z-score is -4.11555043573


According to the z-table, the p-value for a two-tailed test is:

p(z <= -4.116) < 0.0001

In [36]:
# Using scipy to compute the p-value
p = stats.norm.cdf(difference, h_difference, std_diff)

# Convert to two-tailed test
p = 2*p
print('The p-value is {}'.format(p))

The p-value is 3.862565207522622e-05


The p-value < $\alpha$ and is a number approaching 0, so the null hypothesis is rejected and race does significantly effect the rate of callbacks on resumes.

### 4. Write a story describing the statistical significance in the context of the original problem.

Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.  

The study shows that 9.65% of résumés with white-sounding names received callbacks, while only 6.45% of résumés with black-sounding names recieved callbacks.  The difference in callbacks was 3.2%, which is statistically significant with a z-score of -4.12 and p-value approaching 0.  The study concludes that race still has a significant impact on callback rates when all other candidate qualifications are equal.

### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

Although the analysis has shown that race/name has a significant effect on callback success rates, it does not show that it is the most important factor in callback success rates.  

Other variables that need to be considered are education, experience, proficiency and skills, volunteer work, military experience and much more.  In order to determine if race/name is the most important factor, the correlation between callback rates and these other variables must also be evaluated and then ranked.

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution