# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


In [37]:
import pandas as pd
import numpy as np
import math
import scipy.stats as st
import statsmodels.stats as smstats
pd.options.display.max_columns = None



In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

### Summary info about the dataset

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
# number of callbacks for white-sounding names
sum(data[data.race=='w'].call)

235.0

In [6]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [7]:
# number of callbacks by race and sex 
data.groupby(by=['race','sex'])['call'].sum().unstack()

sex,f,m
race,Unnamed: 1_level_1,Unnamed: 2_level_1
b,125.0,32.0
w,184.0,51.0


Females appear to receive more callbacks than males across the 'white' and 'black' population. Let's see if the number of male and female candidates are evenly spread

In [8]:
data.groupby(by=['race', 'sex']).size().unstack()

sex,f,m
race,Unnamed: 1_level_1,Unnamed: 2_level_1
b,1886,549
w,1860,575


Apparently there were more female candidates than males in both black and white categories. Let's get to a statistical analysis of the impact of race in the number of callbacks

### Does Race have a significant impact on the rate of callbacks for resumes?


#### 1. What test is appropriate for this problem? Does CLT apply?

Here the sample size is sufficiently large (4870) and we can assume the sample to represent the population and CLT applies. 
We will use a proportion test (chi square test) with the 'white' group and the 'black' group. The test statistic will be the proportion of callbacks. We will do a hypothesis test to study if the observed difference in proportion of callbacks between white and black names is statistically significant

#### 2. What are the null and alternate hypotheses?

The widely perceived notion is that blacks are discriminated against in job applications. We are doing this study to understand if there's any truth to this. The null and alternate hypothesis can be stated as below.

* H0 -> There's no difference in the proportion of callbacks for white-sounding names and black-sounding names. i.e Pw - Pb = 0

* Ha -> white-sounding names receive more callbacks than black-sounding names. i.e Pw - Pb > 0

#### 3. Compute margin of error, confidence interval, and p-value.

In [9]:
tab = pd.crosstab(data.race, data.call)
tab = tab.reindex_axis([1.0,0.0], axis = 1)
tab

call,1.0,0.0
race,Unnamed: 1_level_1,Unnamed: 2_level_1
b,157,2278
w,235,2200


In [22]:
chi2, p, dof, exp = st.chi2_contingency(tab)
print( "chi2 = %f \n p-value = %f \n dof = %d \n Expected = %s " %(chi2, p, dof, exp))

chi2 = 16.449029 
 p-value = 0.000050 
 dof = 1 
 Expected = [[  196.  2239.]
 [  196.  2239.]] 


The p-value as seen above is very low which means we have to reject the null hypothesis. 

Let's calculate the margin of error and confidence interval 

In [15]:
# the statsmodels package as below gives error. TO BE DISCUSSED
smstats.proportion.proportion_confint([157, 235], [2435, 2435])

AttributeError: module 'statsmodels.stats' has no attribute 'proportion'

 Using the confidence interval from R command
 
 prop.test(x = c(235, 157), n = c(4870, 4870), alternative= 'g')

	2-sample test for equality of proportions with continuity correction

data:  c(235, 157) out of c(4870, 4870)
X-squared = 15.759, df = 1, p-value = 3.597e-05
alternative hypothesis: greater
95 percent confidence interval:
 0.009265324 1.000000000
sample estimates:
    prop 1     prop 2 
0.04825462 0.03223819 

In [24]:
# Confidence interval for the difference in proportions
ci = (0.009265324 , 1.0)
ci

(0.009265324, 1.0)

In [25]:
# margin of error
lower , upper = ci
moe = (upper - lower) / 2
moe

0.495367338

##### 4. Write a story describing the statistical significance in the context or the original problem.
##### 5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

The p-value of the two sample proportion test suggests that the observed difference in callback successes between white sounding names and black-sounding names are statistically significant.

The confidence interval for the difference in proportions, Pw - Pb is positive suggesting that white-sounding names tend to get more callback successes.

However, there could be other factors contributing to this inequality. 
If all other attributes were identical and race was the only difference between the samples and resulted in more callbacks for white-sounding names, we could conclude that race is the most important factor

It will be worthwhile to study how other attributes compare between the two samples before drawing a final conclusion. Below is a quick and initial comparison of other attributes between the two groups


In [31]:
#How do the two groups vary in average years of experience

data.groupby(by = 'race')['yearsexp'].mean()

race
b    7.829569
w    7.856263
Name: yearsexp, dtype: float64

Interestingly, the values are very close 

In [32]:
#How do the two groups vary in education

data.groupby(by = 'race')['education'].mean()

race
b    3.616016
w    3.620945
Name: education, dtype: float64

The values are very close again

In [38]:

crosstab = data.groupby(by = 'race').agg(np.mean)
crosstab

Unnamed: 0_level_0,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,email,computerskills,specialskills,h,l,call,adid,fracblack,fracwhite,lmedhhinc,fracdropout,fraccolp,linc,col,eoe,parent_sales,parent_emp,branch_sales,branch_emp,fed,fracblack_empzip,fracwhite_empzip,lmedhhinc_empzip,fracdropout_empzip,fraccolp_empzip,linc_empzip,manager,supervisor,secretary,offsupport,salesrep,retailsales,req,expreq,comreq,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1
b,3.616016,3.658316,7.829569,0.051335,0.414374,0.101848,0.445996,216.744969,3.487885,0.560986,0.479671,0.832444,0.32731,0.502259,0.497741,0.064476,651.777832,0.313214,0.540329,10.143023,0.185319,0.21264,9.547022,0.722793,0.29117,587.686462,2287.05127,196.050659,755.416992,0.114765,0.079096,0.843762,10.65568,0.101692,0.333873,10.031505,0.151951,0.077207,0.33306,0.118686,0.151129,0.167967,0.787269,0.435318,0.124846,0.106776,0.437372,0.07269,0.082957,0.03039,0.08501,0.213963,0.267762,0.154825,0.165092
w,3.620945,3.664476,7.856263,0.054209,0.408624,0.092402,0.450103,214.530595,3.475154,0.558111,0.47885,0.808624,0.330185,0.502259,0.497741,0.096509,651.777832,0.308439,0.545211,10.151353,0.186026,0.214998,9.554592,0.716222,0.29117,587.686462,2287.05127,196.050659,755.416992,0.114765,0.079096,0.843762,10.65568,0.101692,0.333873,10.031505,0.152361,0.077207,0.332649,0.118686,0.151129,0.167967,0.787269,0.435318,0.124846,0.106776,0.436961,0.07269,0.082957,0.03039,0.08501,0.213963,0.267762,0.154825,0.165092


The quick analysis shows that other attributes are not varying between the two groups. For a detailed analysis, we could do statistical tests to prove any difference between the two groups

So the conclusion is that race is an important factor in the callback rates for job applications in the United States