# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

<div class="span5 alert alert-info">
### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [26]:
df = data.describe()
df.to_csv('DescriptionData.csv')

In [10]:
sum(data[data.race=='w'].call)

235.0

In [40]:
data[data.race=='b'].call.describe()

count    2435.000000
mean        0.064476
std         0.245649
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: call, dtype: float64

In [39]:
data[data.race=='w'].call.describe()

count    2435.000000
mean        0.096509
std         0.295346
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: call, dtype: float64

In [16]:
sum(data.race=='w')

2435

In [17]:
sum(data.race=='b')

2435

__What test is appropriate for this problem? Does CLT apply?__

I believe a 2 Sample Z-test is the best test for this problem. 

Since this is a samples size of 4870, I believe this problem would be approaching the mean of the population. I don't know if this data was sampled from one particular geographic location. 

__What are the null and alternate hypotheses?__

The null hypothesis is that the callback rate for 'w' and 'b' candidates is the same. The alternate hypothesis is that the callback rate is not the same for 'w' and 'b' candidates. 

__Compute margin of error, confidence interval, and p-value.__

In [3]:
#White person margin of error
import math

ro_Wc = 235/2435

moe_Wc = 1.96*math.sqrt((ro_Wc*(1-ro_Wc))/2435)
print('Margin of Error "White Name Callback":', moe_Wc)

con_Interval_w_high = 0.096509 + 1.96 * (.295346) / math.sqrt(2435)
con_Interval_w_low = 0.096509 - 1.96 * (.295346) / math.sqrt(2435)
print('Confidence Interval:', con_Interval_w_high, 'to', con_Interval_w_low)


Margin of Error "White Name Callback": 0.011728781469131009
Confidence Interval: 0.10824007150498303 to 0.08477792849501696


In [4]:
#Black person margin of error

ro_Bc = 157/2435

moe_Bc = 1.96*math.sqrt((ro_Bc*(1-ro_Bc))/2435)
print('Margin of Error "Black Name Callback":', moe_Bc)

con_Interval_b_high = 0.064476 + 1.96 * (.245649) / math.sqrt(2435)
con_Interval_b_low =  0.064476 - 1.96 * (.245649) / math.sqrt(2435)
print('Confidence Interval:', con_Interval_b_high, 'to', con_Interval_b_low)


Margin of Error "Black Name Callback": 0.009755158027911414
Confidence Interval: 0.07423311871543065 to 0.054718881284569365


__Write a story describing the statistical significance in the context or the original problem.__

The statistical significance of this data is that having a "white" name correllates to between 8% and 10% chance of call back, and having a "black" name correllates to between 5% and 7% chance of getting a call back. 

__Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?__

I believe the statistics show that race/name is not the most important factor in callback success because it changes only approximately 3% for the race/name that is assumed. I would amend my analysis to see if any other features could be used to correlate qualifications or certifications for the job and prove that is a more signicant factor for callback, such as controlling for poverty and income status. Also, labeling names as "white-sounding" and "black-sounding" sort of reinforces racial stereotypes, it might be more of indication of socioeconomic status and upbringing, rather than class. 