
### Examining racial discrimination in the US job market

#### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

#### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

#### Exercise
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [67]:
%matplotlib inline
import math
import pandas as pd
import numpy as np
from scipy import stats as st

In [52]:
# Read in us job market data
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

# Create sample-polulation-list of black-sounding-names calls and list of 
# sample-population-list white-sounding-names calls
df = data[['race','call']]
black = df[df.race=='b']['call']
white = df[df.race=='w']['call']

**1 - What test is appropriate for this problem? Does CLT apply?**

Answer: Hypothesis test comparing population proportions. Yes in the same way as described previously.

**2 - What are the null and alternate hypotheses?**

Let p1 = proportion of black-sounding-name-people who get a call out of that population

Let p2 = proportion of white-sounding-name-people who get a call out of that population

The null hypthosis, H0 is that there is no difference in the sample means of the two populations, ie p1-p2=0

In [50]:
# Calculate p1 and p2
p1 = float(sum(black==1))/float(len(black))
p2 = float(sum(white==1))/float(len(white))
p1, p2

(0.06447638603696099, 0.09650924024640657)

In [56]:
# Enter confidence level as a %
conf_lev = 95.0 # If we asssume abnormal temperatures are in the top 2.5% or bottom 2.5%

# Calculate the critical probability (e.g 95% conf interval as crit_prob = 0.975)
crit_prob = (50 + conf_lev/2)/100

# Calculate the z value (critial value) would need to between 0 +/- answer-below to be not abnormal
z_value = st.norm.ppf(crit_prob)
z_value

1.959963984540054

In [78]:
# If we assume the null hypothesis then p1 = p2 = p, where p can be considered
# the proportion of p1 and p2 combined (ie put p1 and p2 are the same) so we we take
# p as the proportion of calls from the whole population. Calculate p
p = sum(data.call)/len(data)

# Calculate the standard deviation of the difference of p1_hat and p2_hat.
# (the formula below is used given p = p1 = p2)
omega = math.sqrt(2*p*(1-p)/len(data))
omega

# Calculate Z score for under the null hypothesis
z = (p1-p2 - 0)/omega

# p-value based on z score is:
st.norm.sf(abs(z))

3.1204310805128342e-09

In [81]:
# Calculate the margin of error
margin_of_err = z_value * omega
margin_of_err

0.010805745262777831

In [83]:
# Calculate the lower and upper bounds of the confidence interval
lower, upper = -margin_of_err + p1-p2, margin_of_err + p1-p2
lower, upper

(-0.042838599472223418, -0.021227108946667753)

**3 - Compute margin of error, confidence interval, and p-value.**

Answer - see previous 3 cells.

**4 - Discuss statistical significance.**

Based on the above, we can reject the null hypothesis and say that yes, there is an effect between white sounding names receiving more calls compared with white sounding names as p-value is < 1%. It is very unlikely there would be this difference in sample proportions between black and white if there is no effect.