# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('C:/MyBriefCase/SpringBoard/CapstoneProject/Examine Racial Discrimination/racial_disc/data/us_job_market_discrimination.dta')
len(data)

4870

1. After analyzing the data we can conclude that it consists of 2 set of distinct datasets on 'race' attribute (Black or White). Each set represents Bernoulli distribution of 'call' variable which can have 2 possible values: 1 with probability 'p' and 0 with probability 'q' (1 - p).

##### The Sample Mean and Standard Deviation of 'Black' race

In [3]:
dataBlack = data[data.race == 'b']
dataBlack.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
7,b,1,3,4,21,0,1,0,1,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit
8,b,1,4,3,3,0,0,0,0,316,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
9,b,1,4,2,6,0,1,0,0,263,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private


Number of callbacks for black-sounding names divide by the sample size is nothing but the mean of the distribution

In [4]:
callBack = sum(data[data.race == 'b'].call)
SizeB = len(dataBlack)
SMeanB = callBack / SizeB
VarB = SMeanB * (1 - SMeanB)

SMeanB, VarB, SizeB

(0.064476386036960986, 0.060319181680573764, 2435)

##### The Sample Mean and Standard Deviation of 'White' race

In [5]:
dataWhite = data[data.race == 'w']
dataWhite.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit
5,b,1,4,2,6,1,0,0,0,266,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private
6,b,1,4,2,5,0,1,0,0,13,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Private


Number of callbacks for white-sounding names divide by the sample size is nothing but the mean of the distribution

In [6]:
callBack = sum(data[data.race == 'w'].call)
SizeW = len(dataWhite)
SMeanW = callBack / SizeB
VarW = SMeanW * (1 - SMeanW)

SMeanW, VarW, SizeW

(0.096509240246406572, 0.087195206793467941, 2435)

##### The difference of Mean between 2 groups is

In [7]:
MeanD = SMeanW - SMeanB
MeanD

0.032032854209445585

##### The Standard Error of difference of Mean between White and Black group of people where Mean of a group is calculated by Sampling distribution of sampling mean is square-root of sum of square of Standard Error(SE) of White and square of Standard Error(SE) of Black 

In [8]:
SED = np.sqrt(VarW/SizeW + VarB/SizeB)
SED

0.0077833705866767544

With 95% confidence level we would like to determine Margin of Error using z-stat table. The critical z value with 2-tailed distribution is 1.96. 

##### The Margin of Error (MOE) of difference of Mean between 2 groups

In [9]:
zCVal = 1.96
MOED = zCVal * SED
MOED

0.015255406349886438

##### The 95% Confidence Interval for the difference of Mean between 2 groups
The lower-end of Confidence Interval(CIL) is

In [10]:
CIL = MeanD - SED
CIL

0.024249483622768832

The upper-end of Confidence Interval(CIU) is

In [11]:
CIU = MeanD + SED
CIU

0.039816224796122339

With 95% confidence, we can say that the mean callback to the White population as compare to the mean callback to the Black population differs in between 2.4% and 3.9%, based on the sample data. 

### Hypothesis Testing with 1% Significance Level

##### Null Hypothesis (H0): The mean (MUB) callback of Black population equals the mean (MUW) callback of White population. MUB = MUW

##### Alternative Hypothesis (H1): The mean (MUB) callback of Black population differs from the mean (MUW) callback of White population.  MUB !- MUW

Assume H0 is true then (MUW - MUB) = 0 then the computed t score stat (we don't know the SD of the Population)

In [12]:
tval = (MeanD - 0)/SED
tval

4.1155504357300003

This means the sample mean (0.032032854209445585) is 4.1155 times standard deviation away from the Null hypothesis(H0) mean i.e. Zero(0).  In order to reject the H0 this has to be less probable (occurrence) than the significance level (alpha) of 1%.

This is a two-tail test.  An extreme is either far above the mean or far below the mean will allow us to reject H0.
The P-value is the probability that the t-score is less than -4.1155 or greater than 4.1155.

Having Normal Sampling Distribution of Sample Mean we have to find P(t < -4.1155) and P(t > 4.1155) 

###### Thus, the P-value 

In [13]:
PL = 0.0001 #P(t < -4.1155)
PR = 0.0001 #P(t > 4.1155)
pval = PL + PR
pval

0.0002

Since the P-value (0.0002) is less than the significance level alpha (0.01), we reject H0.

#### Conclusion
We don’t know 100% but statistically we are in favor of the idea that there is a difference between the proportion of callback to White people and the proportion of callback to Black people. 

The results of this study indicate that, all other things being equal, race is still an factor in the American labor market. An applicant's race certainly has effects on his employment prospects on average. Resumes with white-sounding names received more callbacks than those with black-sounding names though the difference is not significant and can be improved without employers being biased on the basis of race.