# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
print 'Black names call:',np.sum(data[data.race=='b'].call)
print 'White names call:',np.sum(data[data.race=='w'].call)

Black names call: 157.0
White names call: 235.0


### Answer to Question 1&2:
1.Based on the number of samples this should be a z-test. Since the names is randomly selected, I am going to assume the CLT applies. (I do not know how to correctly test this assumpation, last mini project I can use body tempature, but I do not know which feature to analyze in this one)

2.the null hypotheses indicates there is no racial discrimination going on in the employment, the alternative hypotheses indicates thta race does matter in the employment.

In [None]:
white = data[data.race=='w'].call
black = data[data.race=='b'].call

diff = np.sum(white) - np.sum(black)
print 'Acutal Difference:',diff

bs_times = 10000

bs_diff = np.empty(bs_times)

for i in range(bs_times):
    bs_white = np.random.choice(white, len(white))
    bs_black = np.random.choice(black, len(black))
    
    bs_diff[i] = np.sum(bs_white) - np.sum(bs_black)
    
print '95% Confidence Interval:',np.percentile(bs_diff, [2.5, 97.5])
    

Acutal Difference: 78.0
95% Confidence Interval: [  41.  115.]


In [None]:
permu_times = 100000

conc_sample = np.concatenate((white,black))

perm_diff = np.empty(permu_times)
for i in range(permu_times):
    perm_all = np.random.permutation(conc_sample)
    permu_white = perm_all[:len(white)]
    permu_black = perm_all[len(white):]
    perm_diff[i] = np.sum(permu_white)-np.sum(permu_black)

print 'p value for null hypotheses:',np.sum(perm_diff>diff)/len(perm_diff)


### Answer to Question 3, 4 & 5:
3.I am not sure how to compute margin of error. For the call rate difference between balck and white, the actual value is 78, the 95% confidence interval using bootstrap is [41, 116], and p value is close to 0.

4.I conducted an analysis based on the given data, a total of 2435 black names and 2435 white names are used, and 157 black names get called, and 235 white names get called. The difference between them is 78, using bootstrap technic it is concluded that the 95% confidence interval for this value is [41, 116], and p value for the null hypotheses that race does not affect callback rate is 0 when simulated using permuation for ten thousand times, this indicates a high unlikely for null hyoitheses to stand, and race does influence the call back rate.

5.With above being said, we cannot conclude the race/name is the most important factor in this experiment. For example, although the names are randomly assigned, it could be the case that resumes with white names have higher working experience on average than those with black names. In order to clarify this, I need to test the null hypotheses that resumes with white and black names have same distribution in other features (education, work experience, etc.)