# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [None]:
Answers, Work Below

1. The correct test for this problem is a independent two sample z test. The t-test is used since we only have the sample standard deviation
to work with. An independent test is used over the paired even though identical resumes are used they are being sent to different employers.
CLT does apply in this problem because there are independent samples which are assigned randomly and the total 
data size of 4870 is less than 10% of the total population of the United States.

2. The goal of this analysis is to find out if being associated with a black name or a white name impacts employer call back rates.
as such the Null Hypothesis should be that the mean call back rate of employers is equal for both samples. The Alternative hypothesis 
will be that they are not equal. With these hypotheses if the p-value is low we know there are statistically significant results
to support against the null hypothesis which would mean that name based racial discrimination does infact occur when considering 
employer call back rates.

3. Margin of error: 0.0152552843854
    P-value: 3.92587953657e-05
    Confidence Interval Black Sounding: (0.049221381646604188, 0.079731389778643011)
    Confidence Interval White Sounding: (0.081254236501664759, 0.11176424463370359)
        
4. Racial discrimination is a continually pervasive issue, to shed light on how racial discrimination affects job seekers we will
analyze a dataset gathered from employer call back rates for resumes that had been randomly assigned a black sounding or white 
sounding name. If there is a statistically significant discrimination we will be able to find it. The Null hypothesis for this test
is to assume that there is no difference in the mean employer call back rate between the black and white sample groups. An 
independent two sample t-test was used because each employer-resume-name data point is independent from eachother. After
finding the confidence intervals and p-values as shown above we come to the conclusion that racial discrimination is in fact 
prevalent during a job search. The differences between white and black names is actually so large that the confidence intervals
for white sounding and black sounding names employer call back rates do not even overlap.

5. Even though the analysis clearly shows that racial discrimination occurs when considering employer call back rate many more 
factors for this may exist. This data sample specifically attempted to isolate race to test its important, but variables such as
level of education, skills (computer and special), and military may all also have a large impact on employer call back rate and 
with the current analysis there is no way to compare them to tell which variable provides the largest impact. A possible way to amend
this issue is by making the assignment of resumes less random or extend the analysis to by viewing only white or black names to be
more focused on these other variables a two sample t-test could be performed on military represented as 0/1. It would also be
interesting to redo the analysis using a paired t-test rather than an independent one. By pairing samples by employer we take 
a null hypothesis that the pair differences are zero. This represents the idea that an employer who calls back a black sounding 
high level resume would do the same to a white sounding one and vice versa.
    

In [8]:
# number of callbacks for black-sounding names
print(sum(data[data.race=='b'].call), len(data))


157.0 4870


In [41]:
#separating dataframe into black and white and get call column
black = data[data.race =='b'].call
white = data[data.race == 'w'].call
black.head()

2    0.0
3    0.0
7    0.0
8    0.0
9    0.0
Name: call, dtype: float32

In [56]:
#compute sample mean and margin of error
std_black, std_white = np.std(black), np.std(white)
mean_black, mean_white = np.mean(black), np.mean(white)
std_pooled = np.sqrt(((len(black)-1)*std_black**2 + (len(white)-1)*std_white**2)/(len(black)+len(white)-2))
sempooled = np.sqrt(std_pooled**2*(1/len(black) + 1/len(white)))
t = (mean_black - mean_white)/sempooled 
error_margin = 1.96*sempooled
p = stats.t.cdf(t, df=(len(black)+len(white)-2))*2
conf_int_black =stats.norm.interval(0.95, loc=mean_black, scale=sempooled)
conf_int_white =stats.norm.interval(0.95, loc=mean_white, scale=sempooled)
print (t, error_margin, p, conf_int_black, conf_int_white)

-4.11558342208 0.0152552843854 3.92587953657e-05 (0.049221381646604188, 0.079731389778643011) (0.081254236501664759, 0.11176424463370359)


In [57]:
data.columns

Index(['id', 'ad', 'education', 'ofjobs', 'yearsexp', 'honors', 'volunteer',
       'military', 'empholes', 'occupspecific', 'occupbroad', 'workinschool',
       'email', 'computerskills', 'specialskills', 'firstname', 'sex', 'race',
       'h', 'l', 'call', 'city', 'kind', 'adid', 'fracblack', 'fracwhite',
       'lmedhhinc', 'fracdropout', 'fraccolp', 'linc', 'col', 'expminreq',
       'schoolreq', 'eoe', 'parent_sales', 'parent_emp', 'branch_sales',
       'branch_emp', 'fed', 'fracblack_empzip', 'fracwhite_empzip',
       'lmedhhinc_empzip', 'fracdropout_empzip', 'fraccolp_empzip',
       'linc_empzip', 'manager', 'supervisor', 'secretary', 'offsupport',
       'salesrep', 'retailsales', 'req', 'expreq', 'comreq', 'educreq',
       'compreq', 'orgreq', 'manuf', 'transcom', 'bankreal', 'trade',
       'busservice', 'othservice', 'missind', 'ownership'],
      dtype='object')

In [51]:
data.military.unique()

array([0, 1], dtype=int64)