# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
+ Formulas for the Bernoulli distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='w'].call)

235.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
data.describe()

Unnamed: 0,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,occupbroad,workinschool,...,educreq,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind
count,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,...,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0,4870.0
mean,3.61848,3.661396,7.842916,0.052772,0.411499,0.097125,0.448049,215.637782,3.48152,0.559548,...,0.106776,0.437166,0.07269,0.082957,0.03039,0.08501,0.213963,0.267762,0.154825,0.165092
std,0.714997,1.219126,5.044612,0.223601,0.492156,0.296159,0.497345,148.127551,2.038036,0.496492,...,0.308866,0.496083,0.259649,0.275854,0.171677,0.278932,0.410141,0.442847,0.361773,0.371308
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,7.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,3.0,5.0,0.0,0.0,0.0,0.0,27.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,4.0,6.0,0.0,0.0,0.0,0.0,267.0,4.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,4.0,9.0,0.0,1.0,0.0,1.0,313.0,6.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,4.0,7.0,44.0,1.0,1.0,1.0,1.0,903.0,6.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## 1) What test is appropriate for this problem? Does CLT apply?
## 2) What are the null and alternate hypotheses?

* A general rule of thumb for the Large Enough Sample Condition is that n≥3 (n=4870)
* We assume that indepent condition is satisfies as well  
--> Therefore CLT applies.  
  
$H_0$ -- rate of callbacks for resumes is the same ($r_b == r_w$)  
$H_A$ -- rate of callbacks for resumes is not the same ($r_b != r_w$)

## 3) Compute margin of error, confidence interval, and p-value. Try using both the bootstrapping and the frequentist statistical approaches.

In [6]:
w = data[data.race=='w']
b = data[data.race=='b']
print("Total calls back:", data.call.values.sum())
print("Total calls back to 'w':", w.call.values.sum())
print("Total calls back to 'b':", b.call.values.sum())

Total calls back: 392.0
Total calls back to 'w': 235.0
Total calls back to 'b': 157.0


In [7]:
data.race.value_counts()

w    2435
b    2435
Name: race, dtype: int64

Our data represent 'b' and 'w' the same with count of 2435.

In [8]:
len(data[data.race=='w'].call)

2435

In [9]:
df_rates = data.groupby(['race','call']).size().apply(lambda x: x/2435.).unstack()
df_rates.columns = ['no call', 'call']
df_rates

Unnamed: 0_level_0,no call,call
race,Unnamed: 1_level_1,Unnamed: 2_level_1
b,0.935524,0.064476
w,0.903491,0.096509


We are dealing with Bernouli distribution. We will also try to answer the question, whether there is a difference in receiving calls for blacks and whites (we are looking for $p_w - p_b = ?)$. Our confidence interval will be 95$\%$ and associated p-value with it is $0.05$.    
Let's write some equations, what is valid for this case, where $i \in \{b, w\}$:  
$\mu_i = p_i$  
$\sigma_i = \frac{p_i * (1 - p_b)}{\textrm{sample size}}$  
$p_i = \frac{\textrm{received calls}}{2435}$

For difference sample:  
$\mu = p_w - p_b$  
$\sigma = \sigma_w + \sigma_b$

In [11]:
p_b = b.call.values.sum()/2345
p_w = w.call.values.sum()/2345
sigma_b = (p_b * (1 - p_b)) / 2345
sigma_w = (p_w * (1 - p_w)) /2345
#print("Proportions for black and white are:", p_b, p_w)

# difference sample
mu = p_w - p_b
sigma = sigma_w + sigma_b
std_dev = np.sqrt(sigma)
print(mu)

0.033262260127931764


Distance from $\mu$ is $d = z * std\_dev$ and according z-table $z=1.96$. Margin error is $2*d$.     
Keep in mind, our data is large enough, so we can use our sample proportions $\approx$ proportions.

In [12]:
d = 1.96*std_dev
low = mu - d
up = mu + d
margin_error = 2*d
margin_error

0.031626234233526306

$95\%$ confidence interval is $<0.017449, 0.04907>$, margin error is $0.0316$.  
Therefore we are confident that there is $95\%$ chance that whites receive call back more likely than blacks.

## 4) Write a story describing the statistical significance in the context or the original problem.
## 5) Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

We found out that white people tend to receive more calls back than black people. Now let's look into names to discover whether this is associated with it. The tip is, that yes as you ususally do not have race on the application.

In [91]:
names_w = set(w.firstname.values)
names_b = set(b.firstname.values)

In [92]:
names_w

{'Allison',
 'Anne',
 'Brad',
 'Brendan',
 'Brett',
 'Carrie',
 'Emily',
 'Geoffrey',
 'Greg',
 'Jay',
 'Jill',
 'Kristen',
 'Laurie',
 'Matthew',
 'Meredith',
 'Neil',
 'Sarah',
 'Todd'}

In [93]:
names_b

{'Aisha',
 'Darnell',
 'Ebony',
 'Hakim',
 'Jamal',
 'Jermaine',
 'Kareem',
 'Keisha',
 'Kenya',
 'Lakisha',
 'Latonya',
 'Latoya',
 'Leroy',
 'Rasheed',
 'Tamika',
 'Tanisha',
 'Tremayne',
 'Tyrone'}

In [94]:
names_w.intersection(names_b)

set()

Although analysis above showed that we are able to determine whether the applicant is white or black according their first name (as there is no overlaps in names of whites and blacks), however we cannot claim, that whites received more calls back because of their races. To come with this conclusion, we have to look on ther features, whether they are more significant (perphas whites are more rich and as a result they maybe more of them tend get higher education and therefore better chance to get call back on resume).  
It is definitely needed to make analysis of other features to find out which features are the msot significant in order to get call back.