# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

In [2]:
data = pd.io.stata.read_stata('/racial_disc/data/us_job_market_discrimination.dta')

In [3]:
print(data.head())
print(data.shape)

  id ad  education  ofjobs  yearsexp  honors  volunteer  military  empholes  \
0  b  1          4       2         6       0          0         0         1   
1  b  1          3       3         6       0          1         1         0   
2  b  1          4       1         6       0          0         0         0   
3  b  1          3       4         6       0          1         0         1   
4  b  1          3       3        22       0          0         0         0   

   occupspecific    ...      compreq  orgreq  manuf  transcom  bankreal trade  \
0             17    ...          1.0     0.0    1.0       0.0       0.0   0.0   
1            316    ...          1.0     0.0    1.0       0.0       0.0   0.0   
2             19    ...          1.0     0.0    1.0       0.0       0.0   0.0   
3            313    ...          1.0     0.0    1.0       0.0       0.0   0.0   
4            313    ...          1.0     1.0    0.0       0.0       0.0   0.0   

  busservice othservice  missind  owne

### 1. What test is appropriate for this problem? Does CLT apply?

The normality tests rejects the null hypothesis that the data comes from a normal distribution. Thus, a non-paramatric test will be used. The appropiate test for this problem is the Mann Whitney U test for two groups (black and white). 

The CLT applies because the data is large (2435*2; n > 30) and the labels were randomly selected.

In [4]:
# Length of the data
print(data.race.value_counts())

# Response to resumes
print(data.groupby('race')['call'].value_counts())

b    2435
w    2435
Name: race, dtype: int64
race  call
b     0.0     2278
      1.0      157
w     0.0     2200
      1.0      235
Name: call, dtype: int64


In [5]:
# Check for normality in the data
print(stats.normaltest(data[data.race=='w'].call))
print(stats.normaltest(data[data.race=='b'].call))

NormaltestResult(statistic=1304.8637469446803, pvalue=4.4919770957666643e-284)
NormaltestResult(statistic=1743.1541461306329, pvalue=0.0)


### 2. What are the null and alternate hypotheses?  

Null hypothesis: there is not difference between white and black sounding names and their call rate. Put it differently, white and black get the same call rate.
Alternate hypothesis: Black and white names receive a different call rate.

### 3. Compute margin of error, confidence interval, and p-value.

In [6]:
# number of callbacks for black-sounding names
black_call = sum(data[data.race=='b'].call)
print('black-sounding name calls:',black_call)
print('black-sounding name resumes:', len(data[data.race=='b']))

# number of callbacks for white-sounding names
white_call = sum(data[data.race == 'w'].call)
print('white-sounding name call:', white_call)
print('white-sounding name resumes:', len(data[data.race=='w']))

black-sounding name calls: 157.0
black-sounding name resumes: 2435
white-sounding name call: 235.0
white-sounding name resumes: 2435


In [7]:
# Subset data into groups
white = data[data.race=='w']
black = data[data.race=='b']

# Mann-Whitney U-test
statistic, p_value = stats.mannwhitneyu(sorted(white.call), sorted(black.call))
print('z_score:', statistic)
print('p_value =', p_value)

z_score: 2869647.5
p_value = 1.9957709596809747e-05


In [8]:
# Confidence intervals
import statsmodels.stats.api as sms
print('Call rate(white) confidence interval (95%):', sms.DescrStatsW(white.call).tconfint_mean())
print('Call rate(black) confidence interval (95%):', sms.DescrStatsW(black.call).tconfint_mean())

Call rate(white) confidence interval (95%): (0.08477242880649906, 0.10824605168631408)
Call rate(black) confidence interval (95%): (0.05471454925653359, 0.07423822281738839)


  from pandas.core import datetools


In [9]:
# Proportions
proportion_w = white_call / len(data[data.race=='w'])
proportion_b = black_call / len(data[data.race=='b'])

# SE of difference of proportions
SE = ((proportion_w * (1 - proportion_w)) / len(data[data.race=='w']) + 
      (proportion_b * (1 - proportion_b)) / len(data[data.race=='b'])) ** 0.5

# Margin of error
# 1.96 = z-value for 95% confidence interval (see z_value table)
# margin of error = critical value * standard error
me = 1.96 * SE
print(me)

0.015255406349886438


### 4. Write a story describing the statistical significance in the context or the original problem.

This project studies whether or not there are indications of racial discrimination in the US labor market. In order to investigate this, identical resumés were randomly assigned white-sounding names and black-sounding names in order to evaluate the response rate for interviews from employers. The analysis of the data shows that the response rate is statistically different for resumés having white or black sounding names. This is, the data indicates a trace of racial discrimination in favor of white-sounding names over black-sounding ones. 

### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

Based on the information above, it cannot be concluded that race is the most important factor in callback success. There could be other variables such as 'sex', 'city', 'manager', 'computerskills', among others, as well as their interactions that could play a role in the rate of callbacks. Including those variables in the analysis will help in answering what is the most important factor in callback success. 