# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [3]:
import pandas as pd
import numpy as np
from scipy import stats

In [5]:
data = pd.io.stata.read_stata('us_job_market_discrimination.dta')
print(data.head(10))

  id ad  education  ofjobs  yearsexp  honors  volunteer  military  empholes  \
0  b  1          4       2         6       0          0         0         1   
1  b  1          3       3         6       0          1         1         0   
2  b  1          4       1         6       0          0         0         0   
3  b  1          3       4         6       0          1         0         1   
4  b  1          3       3        22       0          0         0         0   
5  b  1          4       2         6       1          0         0         0   
6  b  1          4       2         5       0          1         0         0   
7  b  1          3       4        21       0          1         0         1   
8  b  1          4       3         3       0          0         0         0   
9  b  1          4       2         6       0          1         0         0   

   occupspecific    ...      compreq  orgreq  manuf  transcom  bankreal trade  \
0             17    ...          1.0     0.0    1

# Qn. 1 Appropiate test and CLT for the problem
 There are two category in races (b and w)to be tested, the test can be considered two sample z- test with a sample > 30. Consequently with that testing method, CLT appplies.

In [6]:
# finding out number of callbacks and no callbacks for black-sounding names
bpop_callback=sum(data[data.race=='b'].call)
print('bpop callback: ' + str(bpop_callback))
bpop_nocallback=sum(-1 * (data[data.race=='b'].call-1))
print('bpop no callback: ' + str(bpop_nocallback))

bpop callback: 157.0
bpop no callback: 2278.0


In [7]:
wpop_callback=sum(data[data.race=='w'].call)
print('whitepop callback: ' + str(wpop_callback))
wpop_nocallback=sum(-1 * (data[data.race=='w'].call-1))
print('wpop no callback: ' + str(wpop_nocallback))

whitepop callback: 235.0
wpop no callback: 2200.0


With these apparent difference in the callback and no callback, we can proceed with whether the null or alternate hypothesis testing works

# Qn. 2 The null and alternate hypothesis:

Null Hypothesis is when the sample mean of the white population is equal to to the that of the black population. This case is ideally what we want. The Alternate Hypothesis is the converse result, whereby the the sample mean of the white population is NOT equal to to the that of the black population.

# Qn. 3 Computing the margin of error, confidence interval and p value
we can calculate the following statistical computation using chi-square test based on the whitepop_call and nocall as well as the blackpop_call and no call

In [8]:
bwlist = np.array([[bpop_callback,bpop_nocallback],[wpop_callback,wpop_nocallback]])
# we can use the function for chi square test with the argument of the array above
stats.chi2_contingency(bwlist)

(16.449028584189371, 4.9975783899632552e-05, 1, array([[  196.,  2239.],
        [  196.,  2239.]]))

In [9]:
# finding the margin error using the formula included in the chi2 testing
all_bpopcall=bpop_callback +bpop_nocallback
all_wpopcall=wpop_callback +wpop_nocallback

In [10]:
#finding the proportion of callback
bcallprop=bpop_callback/all_bpopcall
wcallprop=wpop_callback/all_wpopcall
#we know that the white proportion is bigger
prop_diff =wcallprop-bcallprop
print(prop_diff)

0.0320328542094


In [11]:
#calculating for the critical of z test using the stats.norm.ppf function
critz =stats.norm.ppf(0.975)
print(critz)

1.95996398454


In [12]:
#finding the margin of error, first need to find the difference in expecation variable
diff=np.sqrt(((bcallprop*(1-bcallprop))/all_bpopcall) + ((wcallprop*(1-wcallprop))/all_wpopcall))
print (diff)
low, upp = [(diff - (critz *diff)), (diff + (critz *diff))]
print ('margin of error: ' + str(low), (upp))

0.00778337058668
('margin of error: -0.00747175544154', 0.023038496614891584)


# Qn. 4 
The results from Qn.3 proves that for the dataset provided, there is a difference in the rate of callback that can achieved between a black sounding name and white sounding name. Hence, the black sounding names will receive a unfair career treatment in the job market

# Qn. 5
White the unfairness might be true, the racial discrimination in employment is not direct as this data  might have other factors such years of experience, honors, other job specification that is put into factor for hiring. For future analysis, if black and white sounding names are put into comparison again, perhaps the data should be filtered of based on the two categories with the same amount of experience and education. If the result of the data still shows that white sounding names are put in priority for employmeny, then there is a discrimination in the employment for different races