# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
df = pd.io.stata.read_stata('us_job_market_discrimination.dta')
df.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [3]:
print(df.columns)
print('-------------------------------------------------------------------')
print('Number of Observations:', df.shape[0])

## We will use only columns of interest
df = df[['race','call']]
print('-------------------------------------------------------------------')
print(df.head(10))

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')
-------------------------------------------------------------------
Number of Observations: 4870
------------

In [4]:
## Checking for 'NaN' entires
df.isnull().sum()

print('Number of missing values in the columns of interest is',df.isnull().sum().sum())

Number of missing values in the columns of interest is 0


## Problem
We are trying to understand if race has impact on callback people get.If we consider getting a call back as success then for two races 'b' and 'w', we can proportion of success and failures. So, this boils to two sampled proportion test where we are trying to see if the difference between proportion of success(callbacks) between two races is statistically significant. So, we would use two sampled proportion test in answering our question.

### Checking Assumptions
1. Independent Random Samples: We are told that race is assigned randomly. Also, we can safely infer that a race getting call back doesn't effect call back in to another member of same race i.e., essentially we have independence within a sample.
2. Normality: To perform hypothesis testing, we need to have number of success and number of failures greater than 10 only then success proportion is approx. normally distributed.  
Lets check this assumption

In [5]:
### Race: 'b'
b_num_succ = sum(df[df.race=='b'].call)
b_num_trails = len(df[df.race == 'b'])
b_prop_succ = b_num_succ/b_num_trails
print('Race b success(call back) success proportion is: ', b_prop_succ)

## Race = 'w
w_num_succ = sum(df[df.race=='w'].call)
w_num_trails = len(df[df.race == 'w'])
w_prop_succ = w_num_succ/w_num_trails
print('Race b success(call back) success proportion is: ', w_prop_succ)

Race b success(call back) success proportion is:  0.064476386037
Race b success(call back) success proportion is:  0.0965092402464


In [6]:
# Checking Normality assumptions

#Race: 'b'
b_test = dict({'Number_of_success': b_prop_succ * b_num_trails,
              'Number_of_failure' : (1 - b_prop_succ) * b_num_trails
              })
print('race b success and failure:')
print(b_test)

print('-------------------------------------------------------------------')
# Race: 'w'
w_test = dict({'Number_of_success': w_prop_succ * w_num_trails,
              'Number_of_failure' : (1 - w_prop_succ) * w_num_trails
              })
print('race w success and failure:')
print(w_test)

race b success and failure:
{'Number_of_success': 157.0, 'Number_of_failure': 2278.0}
-------------------------------------------------------------------
race w success and failure:
{'Number_of_success': 235.0, 'Number_of_failure': 2200.0}


## Hypothesis Formulation
We observe that number of success and failures in each case are greater than 10. Hence, success proportion approx.normal distriution. we can perfrom hypothesis testing now as requirements are met.

we formulate the null and alternate hypotheses.

$\hat{p_b}$ = Success proportion for race 'b'   
$\hat{p_w}$ = Sucess proportion for race 'w'

**NULL Hypothesis**: $H_0\:is \: \hat{p_b} - \hat{p_w} = 0$

**Alternate Hypothesis:** $H_A \:is \: \hat{p_b} - \hat{p_w} \neq 0$


Lets, have **significance level ($\alpha$):** 0.05 any p-value less than significance level we reject null hypothesis. Since, we are checking for difference between the two proportions we would use two-tailed test. If we were to see if one race gets more call backs than other we would perfrom one-tailed test.

## Margin of error
Margin of error for the sample statistic is given by $\sqrt{\frac{\hat{p}_b(1-\hat{p}_b)}{n_b} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}} $  

Since, we have sample data approx to a normal we use z-statistic to compute confidence interval. Hence, the margin of error is  $Z_{\alpha/2} * SE$. For a 95% confidence interval, the z-value is 1.96.
The confidence interval, subsequently, is $\hat{p}_b - \hat{p}_w \pm {Z_{\alpha/2} * SE}$

In [7]:
## Calculating margin of error
z = 1.96
margin_error = z * np.sqrt(( b_prop_succ*(1-b_prop_succ) / 
                      b_num_trails) + (w_prop_succ*(1-w_prop_succ)/w_num_trails))

print("Margin of error = ", margin_error)
print('-------------------------------------------------------------------')
print('95% confidence interval for difference in proportion of success(call back) for races b and w is',
     round((b_prop_succ-w_prop_succ)-margin_error,4),'to',
      round((b_prop_succ-w_prop_succ)+margin_error,4))

Margin of error =  0.0152554063499
-------------------------------------------------------------------
95% confidence interval for difference in proportion of success(call back) for races b and w is -0.0473 to -0.0168


We observe that our 95% confidence interval(CI) for difference between population proportion of getting call back shows race w is higher than race b. However, we need to check if this is statistically significant. Also, observe that the 95% CI does not cover 0.

In [8]:
## Hypothesis Testing two sample proportion
from statsmodels.stats.proportion import proportions_ztest as p_z
z_test_statistic, p_value = p_z([b_num_succ,w_num_succ],[b_num_trails,w_num_trails],
                                value=0, alternative='two-sided')
print('p_value obtained from test for difference of proportion of call back for races b and w is', p_value)

p_value obtained from test for difference of proportion of call back for races b and w is 3.98388683759e-05


We observe that the p_value less than significance($\alpha$) of 0.05. Hence, we have evidence to reject null hypothesis. This implies that there significant difference in the number of call backs between white sounding names and black sounding names. This does not mean that race is significant factor for getting call backs as here we are only testing for difference between two races and call backs without any othe factors. 

However, in reality we have many factors like person's qualification for job, experiences and job fit. So, we need to dig into  understanding correlations and causation for call backs with respect to other factors. Hypothesis testing would only provide us testing our assumptions/beliefs and not correlations and causation between the variables.