# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import os
os.getcwd()

'C:\\Users\\Frank'

### Read in data and examine structure of the data

In [3]:
os.chdir('C:/Users/Frank/Desktop/Python_Springboard/EDA_discrimination/1522241501_dsc_racial_disc/EDA_racial_discrimination/data')

In [4]:
data = pd.io.stata.read_stata('us_job_market_discrimination.dta')

In [7]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [8]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [9]:
df = pd.DataFrame([data.call, data.race])
df = df.transpose()
df.head(10)

Unnamed: 0,call,race
0,0,w
1,0,w
2,0,b
3,0,b
4,0,w
5,0,w
6,0,w
7,0,b
8,0,b
9,0,b


## Separate data by race

In [10]:

w = data[data.race=='w']
b = data[data.race=='b']
len(b), len(w)



(2435, 2435)

## Calculate proportions

In [12]:
prop_b = ((sum(b.call))/(len(b.call)))
prop_w = ((sum(w.call))/(len(w.call)))
prop_b, prop_w 

(0.06447638603696099, 0.09650924024640657)

## What test is appropriate for this problem? Does CLT apply?

A hypothesis z-test is appropriate in this instance as we are comparing the observed difference between two proportions (a point estimate) to the null; and because n > 30. Categorical data differs from numerical data in that we use the sum of 'successes' divided by the 'n' of the sample instead of calculating a mean. We can then compute the z-test. The Central Limit theorum does apply to categorical data for the distribution of the sample proportion. The distribution in this case is nearly normal.

## What are the alternate and null hypothesis?

#### The null hypothesis:
There is no statistically significant differenece between the proportion of calls recieved from CV's with white-sounding names and those with black-sounding names

#### The alternate hypothesis:
There is a statistically significant difference between the proportion of calls recieved from CV's with white-sounding names and the proportion of those with black-sounding names.


## Compute the margin of error

In [13]:
se_CI = np.sqrt((prop_b*(1 - prop_b)/(len(b))) + (prop_w*(1 - prop_w) /(len(w))))
se_CI

0.0077833705866767544

In [14]:
critical = 1.96
margin = abs(critical*se_CI)
print("The true population proportion lies +/- %0.6F around the point estimate" % margin)

The true population proportion lies +/- 0.015255 around the point estimate


## Compute the confidence interval

In [15]:
point_est = prop_w - prop_b
point_est

0.032032854209445585

In [16]:
CI = [ point_est + margin, point_est - margin]
print("The proportion is %0.6F +/- %0.5F " % (point_est,margin))
print("The proportion of CVs with white-sounding names that recieve a call is between %0.6F and %0.6F higher than the proportion of CVS with black-sounding names" % (CI[1],CI[0]))

The proportion is 0.032033 +/- 0.01526 
The proportion of CVs with white-sounding names that recieve a call is between 0.016777 and 0.047288 higher than the proportion of CVS with black-sounding names


## Calculate the p-value

In [18]:
null = 0

p_pool = (sum(data.call)/(len(data.call)))
p_pool

0.08049281314168377

In [20]:
se_ht = np.sqrt((p_pool*(1 - p_pool)/(len(b))) + (p_pool*(1 - p_pool) /(len(w))))
se_ht

0.007796894036170457

In [21]:
z = (point_est - null)/se_ht #standard error calculated in CI above
p_values = stats.norm.sf(abs(z))*2 #twoside
print("Z-score is equal to : %6.3F  p-value equal to: %6.7F" % (z,p_values))

Z-score is equal to :  4.108  p-value equal to: 0.0000398


In [22]:
from statsmodels.stats.weightstats import ztest
z_test = ztest(data.call[data.race == 'w'],data.call[data.race == 'b'], alternative = 'two-sided')
print("Z-score is equal to : %6.3F  p-value equal to: %6.7F" % (abs(z_test[0]),z_test[1]))

Z-score is equal to :  4.115  p-value equal to: 0.0000388


## Conclusion:
If the null hypothesis were true there would only be a 0.00388% chance of the observed data being sampled. 
This gives strong evidence that there is a statistically significant difference between the two groups 
based on race. The significance of the finding is strong as it is considerably below the threshold.