# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [3]:
data = pd.io.stata.read_stata('Desktop/project4/EDA_racial_discrimination/data/us_job_market_discrimination.dta')

In [4]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [5]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

# 1. What test is appropriate for this problem? Does CLT apply?
- Because this data set is essentially large, z-test is appropriate for this problem
- This data set is significantly big so central limit theorem does apply here.

# 2. What are the null and alternate hypotheses?
- Just like mentioned in this problem, The 'call' and 'id' columns will be looked into carefully.
- The null hypothesis: The race has no effect on call-backs.
- The alternate hypothesis: The race has effect on call-backs.

In [27]:
call_mean = data.call.mean()
print(call_mean)

0.08049281686544418


# 3. Compute margin of error, confidence interval, and p-value.
- We will do z-statistics here in question 3.
- We will focus on black call-backs.
- The p-value is the probability that the people with black-sounding names getting a callback is less than the probability of overall callback.

In [56]:
w = data[data.race=='w'] # white-sounding names
b = data[data.race=='b'] # black-sounding names
call_prob = data.call.mean() # because it is bernoulli distribution
std_err = np.sqrt(call_prob*(1-call_prob)/b.shape[0])
print('The standard error is:',std_err)
b_call_prob = b.call.mean()

z = (b_call_prob - call_prob) / std_err
print('The z-score is:',z)

margin_err = (call_prob + z)*std_err
print('The margin error for this test is:',margin_err)
conf = (call_prob+margin_err, call_prob-margin_err)
print('The confidence interval is:',conf)
print('By checking the z table, the p-value is 0.0024')
call_mean+z




The standard error is: 0.0055132367615322465
The z-score is: -2.905086765831054
The margin error for this test is: -0.015572655195838737
The confidence interval is: (0.06492016166960544, 0.09606547206128292)
By checking the z table, the p-value is 0.0024


-2.82459394896561

In [7]:
# Your solution to Q3 here

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

# 4. Write a story describing the statistical significance in the context or the original problem.
There is the same amount of white and black-sounding names in this data set. There are above 4000 rows in this data set so it is relatively larger than many data sets. By separating data into white and black, we oberved that people with white-sounding name do get mroe call-backs from employers.

# 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?
-Although, applicants with white-sounding names tend to get more call-backs. It is not the key factor contributing less or more call-backs. In question 3, I did the z test on appliants with black-sounding names and found that the p-value is quite low. In fact, it does not even pass that 5% threshold; therefore, we reject the null hypothesis. To make our analysis sound, what we can do is to take education and work experiences into consideration, as we know that the majority employers in the US equal employers and almost all companies use tracking system to sort out applicants and the key factors are of course education and experience.