# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
</div>
****

In [43]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib notebook

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


### Answer to Questions

#### Question 1 & 2

<div class="span5 alert alert-success">
<p>Your answers to Q1 and Q2 here</p>
</div>

Q1. What test is appropriate for this problem? Does CLT apply?
<li> Two sample z-test is applicable for this problem. 
<li> Central Limit Theorem (CLT) applies for this problem as it meets the following conditions: 
    <ul>
    <li> Randomization Condition: identical résumés is randomly assigning to black-sounding or white-sounding names
    <li> Independence Assumption: samples are independent of each other
    <li> Normal: the sample size is more than 30
    </ul>
    
Q2. What are the null and alternate hypotheses?
<li> Null Hypothesis: the proportion of callback for black-sounding names (p1) is statistically the same as the callback for white-sounding names (p2), i.e. p1 = p2
<li> Alternative Hypothesis: p1 != p2

In [6]:
w = data[data.race=='w']
b = data[data.race=='b']

In [7]:
# Your solution to Q3 here

#### Question 3: Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

In [15]:
# calculate the proportion of callbacks for each sample group
df = data.loc[:,['race','call']].groupby('race').agg(['count','sum','mean'])
df

Unnamed: 0_level_0,call,call,call
Unnamed: 0_level_1,count,sum,mean
race,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
b,2435,157.0,0.064476
w,2435,235.0,0.096509


##### Bootstrapping approach

In [29]:
def _boostrap_replicate_2d(df, func):
    ### generate bootstrap replicate on 2d dataframe ###
    bs_sample = df.iloc[np.random.randint(len(data), size=len(data))]
    return func(bs_sample)

In [36]:
def _draw_bs_reps(data, func, size=1):
    return np.array([_boostrap_replicate_2d(data, func) for _ in range(size)])

In [37]:
def _diff_bw_proportion(df):
    df2 = df.loc[:,['race','call']].groupby('race').agg(['count','sum','mean'])
    return (df2.loc['b',('call','mean')] - df2.loc['w',('call','mean')])

In [40]:
# bootstraping: calculate the difference in proportion: black_soudning_names - white_sounding_name
bs_replicates = _draw_bs_reps(data, _diff_bw_proportion, size=1000)

In [49]:
conf_int = np.percentile(bs_replicates,[2.5, 97.5])  # 95% confidence level
print('The confidence interval with 95% CL is {:.2f}% to {:.2f}%.'
      .format(conf_int[0]*100,conf_int[1]*100))

The confidence interval with 95% CL is -4.73% to -1.71%.


##### Frequentist statistical approach: two-sample z-test

In [68]:
from statsmodels.stats.proportion import proportions_ztest  # Test for proportions based on normal (z) test
# from statsmodels.stats.proportion import proportion_confint  # confidence interval (for one sample only...)

In [55]:
# z-test and p-value
z_stat, p_val = proportions_ztest([df.loc['b',('call','sum')], df.loc['w',('call','sum')]], 
                                  [df.loc['b',('call','count')], df.loc['w',('call','count')]])
print('z statistics = {}, p value = {}'.format(z_stat,p_val))
if p_val < 0.05:
    print('Reject H0, callback rates are different between black sounding names and white sounding names.')
else:
    print('Cannot Reject H0, callback rates are not different.')

z statistics = -4.108412152434346, p value = 3.983886837585077e-05
Reject H0, callback rates are different between black sounding names and white sounding names.


References:
* Comparing Two Proportions: 
https://onlinecourses.science.psu.edu/stat414/node/268/
* Two sample z-test in python: http://knowledgetack.com/python/statsmodels/two-sample-hypothesis-testing-in-python-with-statsmodels/
* One Sample Test of Proportions: http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html
* ENH: confidence intervals for two proportions, difference, odds ratio and risk ratio #2605: 
https://github.com/statsmodels/statsmodels/issues/2605

<div class="span5 alert alert-success">
<p> Your answers to Q4 and Q5 here </p>
</div>

#### Question 4. Write a story describing the statistical significance in the context or the original problem.


Based on the above analysis, it can be concluded that, with a 95% confidence level, the difference in terms of callback rate (black sounding names - white sounding names) is captured within -1.71% to -4.73%. 

A direct assessment of z-statistics = -4.11 and p-value = 3.98e-05 suggest that the callback rate of black sounding names is significantly lower than that of white sounding names.

Therefore, it appears that there is a significant level of racial discrimination in the labor market. 

#### Question 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

Given the above analysis and finding, however, it does not mean race/name is the most important factor in callback success. 

One way to try and understand the importance of various features is by looking for correlations between the features and the target ('call'). The Pearson correlation coefficient between every variable and the target is calculate using the .corr dataframe method, and the top 3 positiviely and negatively correlated features are listed. It clearly shows that there are a lot more important factors than race/name alone. 


In [71]:
correlations = data.corr()['call'].sort_values() # 2 because the 1st and 2nd rows are ID and Target

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(4))
print('\nMost Negative Correlations:\n', correlations.head(3))

Most Positive Correlations:
 empholes         0.071888
honors           0.071951
specialskills    0.111074
call             1.000000
Name: call, dtype: float64

Most Negative Correlations:
 fracdropout        -0.056671
lmedhhinc_empzip   -0.049879
req                -0.041699
Name: call, dtype: float64


===== END =====