# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
%matplotlib inline

In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [3]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)

157.0

In [4]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [5]:
data.shape

(4870, 65)

In [6]:
data.race.value_counts()

b    2435
w    2435
Name: race, dtype: int64

In [7]:
data.call.value_counts()

0.0    4478
1.0     392
Name: call, dtype: int64

In [8]:
call_w =data[data.race=='w' ].call
call_w.shape

(2435,)

In [9]:
call_b =data[data.race=='b'].call
call_b.shape

(2435,)

Since we have categorical data, we will use** crosstabulation and chi-square **.. Cross tabulation analysis is used for two-way tables and is also known as contingency table analysis.


For a chi-square test for association, the hypotheses are as follows:

**Ho==>** Call of resume and race is **independent**, no association between variables exists.

**H1==>** Call of resume and race is **not independent**; an association between variables exist.


In [10]:
contingency_table = pd.crosstab(data.race, data.call)
contingency_table.columns = ['No','Yes']
contingency_table.index = ['b','w']
contingency_table

Unnamed: 0,No,Yes
b,2278,157
w,2200,235


In [11]:
#chi2, p, df, ex: chi2 stat, p value, deg. of freedom, and expected table
chi2, p_vl, dof, exp = stats.chi2_contingency(contingency_table)
chi2_critical = stats.chi2.ppf(q = 0.95, df = dof)
print("Chi-square Critical value:", chi2_critical)
print('chi2:', chi2)
print('p_val:', p_vl)
print('degree of freedom:', dof)

Chi-square Critical value: 3.84145882069
chi2: 16.4490285842
p_val: 4.99757838996e-05
degree of freedom: 1


As calculated above **p-values is less than 0.05** and we also have **chi-square calculated value as greater than the chi-square critical value**.  Based on these two evidences, we can **reject the null hypothesis** and can go with the alternate hypothesis. 

Here we can say that **race and resume has dependency**. However with this test, we cannot say how strong this dependency is.
