# Examining Racial Discrimination in the US Job Market

**submitted by: Pan Chen

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
#only include the columns for this particular analysis
df=data[['race','call']]


# 1. What test is appropriate for this problem? Does CLT apply?
Naturally, I would like to use the chi-square test for independence for this problem, because whether a person receives a callback is a categorical variable, and the chi-square test for independence tests to see whether distributions of categorical variables differ from each another...

However, the third question asks to get a margin of error and confidence interval, so I will use a two-tailed t-test for independent samples. The reason I chose t instead of z is because we don't know about the standard deviations of the callback values of neither population (black names/white names)

CLT applies, because the sample is large enough, and it's has identically distributed independent variables. All of the data points are independent from each other, and drawn from the same probability distribution.

# 2. What are the null and alternate hypotheses?
**H0**: People with black sounding names have the same chance of being called back as the people with white sounding names.

**HA**: People with black sounding names don't have the same chance of being called back as the people with white sounding names.



# 3. Compute margin of error, confidence interval, and p-value.
Assume we are going to calculate a 95% confidence interval

s1=standard deviation of black names' chance of being called back  
s2=standard deviation of white names' chance of being called back

Margin of Error=t<sub>(a/2)</sub>*sqrt(s1^2/n1+s2^2/n2)  

LL=xbar1-xbar2-me  
UL=xbar1-xbar2+me

p-value=2*P(t>|Test Statistics|)

Test Statistics=(xbar1-xbar2)/sqrt(s1^2/n1+s2^2/n2)

In [3]:
n1=len(df[df.race=="b"])
n2=len(df[df.race=="w"])
xbar1=np.mean(df[df.race=="b"]['call'])
xbar2=np.mean(df[df.race=="w"]['call'])

#compute s1
s1=np.std(df[df.race=="b"].call)

#compute s2
s2=np.std(df[df.race=="w"].call)

dfreedom=len(df[df.race=="w"])+len(df[df.race=="b"])-2

t=stats.t.ppf(1-0.025, dfreedom)
test=(xbar1-xbar2)/np.sqrt(s1**2/n1+s2**2/n2)
#pvalue=2*P(t>|Test Statistics|)
pvalue=2*(1-stats.t.cdf(abs(test),dfreedom))
#compute Margin of error
me=t*np.sqrt(s1**2/n1+s2**2/n2)
ll=xbar1-xbar2-me
ul=xbar1-xbar2+me

print('margin of error is: {}, confidence interval is: {}, p-value is: {}'.format(me,[ll,ul],pvalue))

margin of error is: 0.015258797951013666, confidence interval is: [-0.047291652806074246, -0.016774056904046909], p-value is: 3.92587953657042e-05


In [62]:
#Take the shortcut...
import scipy.stats as stats
stats.ttest_ind(df[df.race=="b"]['call'], df[df.race=="w"]['call'],equal_var=False)

Ttest_indResult(statistic=-4.1147052908617514, pvalue=3.9429415136459352e-05)

In [64]:
#chi-square test
#from scipy.stats import chi2_contingency
#do an observed table
#tab = pd.crosstab(df.race, df.call, margins = True)
#tab.columns = ["no_call","call",'total']
#tab=tab.iloc[0:2,0:2]

#chi2_contingency(np.array(tab))
#output is chi^2, p-value, degree of freedom, and expected value table

# 4. Write a story describing the statistical significance in the context or the original problem.
The p-value is 3.92587953657042e-05 and statistic is below zero, so we can reject the null hypothesis, which means we can say with confidence that people with black sounding names have the less chance of being called back as the people with white sounding names.

# 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?
It does not mean that race/name is the **most** important factor in callback success, 
because: 
1) there is no validity test that the recruiters actually perceived the "white-sounding names" as white applicants as the researchers expected. 
2) There are other factors that might affect callback success rate, so these must be analyzed and ranked to test which feature is the **most** important in callback success

