# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
len(data)

4870

In [19]:
len(data[data.race=='w']), len(data[data.race=='b'])

(2435, 2435)

In [3]:
# proportion of callbacks for black-sounding names
sum(data[data.race=='b'].call) / len(data[data.race=='b'])

0.064476386036960986

In [4]:
# proportion of of callbacks for white-sounding names
sum(data[data.race=='w'].call) / len(data[data.race=='w'])

0.096509240246406572

In [5]:
# proportion of of callbacks for white-sounding names
sum(data.call) / len(data)

0.080492813141683772

## What test is appropriate for this problem? Does CLT apply?

In this problem we are looking at two proportions within a population. We know that the resumes were randomly assigned 'black' or 'white' names, so we can apply the central limit theorem for proportions. Let $p_1$ be the proportion of white named resume callback, and $p_2$ be the proportion of black named resume callbacks.

## What are the null and alternate hypotheses?

Null hypothesis: $p_1 - p_2 = 0$

Alternative Hypothesis: $p_1 - p_2 > 0$

## Compute margin of error, confidence interval, and p-value.

In [22]:
#P Value
z_num = 0.096509240246406572 - 0.064476386036960986
z_den_sq = 0.080492813141683772*(1-0.080492813141683772)*(1/len(data[data.race=='b']) + 1/len(data[data.race=='w']))
z = z_num/np.sqrt(z_den_sq)
z
1-stats.norm.cdf(z)

1.9919434187887219e-05

In [7]:
#Margin of Error
me = np.sqrt(0.096509240246406572*(1-0.096509240246406572)/len(data[data.race=='w']) + 0.064476386036960986*(1-0.064476386036960986)/len(data[data.race=='b']))
me

0.0077833705866767544

In [23]:
#99% Confidence Interval 
upper = z_num+2.58*me
lower = z_num-2.58*me
lower, upper

(0.011951758095819557, 0.052113950323071617)

## Write a story describing the statistical significance in the context or the original problem.

We explored a data set consisting of resumes which either recieved or did not recieve callbacks from potential employers. The resumes were randomly divided into two groups and assigned 'black' names and 'white' names. We explored the difference in the proportion of callbacks in each group to see if white resumes had a higher callback rate compared to black resumes. To do this, we conducted a one tailed z test for the difference in proportions. Our Z-Score was 4.1084, with a p-value of 1.9e-05, or approximately 0. We have statistical evidence to claim that there truly is a difference in callback rates, and that white sounding resumes have a higher callback rate compared to black callback rates. Additionally, we our 99% confident that the true difference in callback rates lies within the range of (0.012, 0.052).

## Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

In [25]:
data.columns

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')

As shown above, there are clearly many variables at play in determining resume callbacks. We cannot say that race is the only factor in callback success, or even teh most important factor. To do this, we would need to study each of the other variables that might affect callback rates. For example, one would expect industry, employer, role, and location to play a role. However, we can conclude with statistical evidence that race is one important factor in resume callbacks, and that racism is an issue in the job industry. 