# Examining Racial Discrimination in the US Job Market

### Background
Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés to black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

### Data
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes when presented to the employer.

### Exercises
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions **in this notebook below and submit to your Github account**. 

   1. What test is appropriate for this problem? Does CLT apply?
   2. What are the null and alternate hypotheses?
   3. Compute margin of error, confidence interval, and p-value.
   4. Write a story describing the statistical significance in the context or the original problem.
   5. Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet


#### Resources
+ Experiment information and data source: http://www.povertyactionlab.org/evaluation/discrimination-job-market-united-states
+ Scipy statistical methods: http://docs.scipy.org/doc/scipy/reference/stats.html 
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [83]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

In [84]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')

In [85]:
#data = data[data.sex =="m"]
#data[["sex","race","call"]].tail()

# Exploring the dataset

In [86]:
data.head()

Unnamed: 0,id,ad,education,ofjobs,yearsexp,honors,volunteer,military,empholes,occupspecific,...,compreq,orgreq,manuf,transcom,bankreal,trade,busservice,othservice,missind,ownership
0,b,1,4,2,6,0,0,0,1,17,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,b,1,3,3,6,0,1,1,0,316,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,b,1,4,1,6,0,0,0,0,19,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,b,1,3,4,6,0,1,0,1,313,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,b,1,3,3,22,0,0,0,0,313,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Nonprofit


In [87]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4870 entries, 0 to 4869
Data columns (total 65 columns):
id                    4870 non-null object
ad                    4870 non-null object
education             4870 non-null int8
ofjobs                4870 non-null int8
yearsexp              4870 non-null int8
honors                4870 non-null int8
volunteer             4870 non-null int8
military              4870 non-null int8
empholes              4870 non-null int8
occupspecific         4870 non-null int16
occupbroad            4870 non-null int8
workinschool          4870 non-null int8
email                 4870 non-null int8
computerskills        4870 non-null int8
specialskills         4870 non-null int8
firstname             4870 non-null object
sex                   4870 non-null object
race                  4870 non-null object
h                     4870 non-null float32
l                     4870 non-null float32
call                  4870 non-null float32
city        

In [88]:
#data.sex.value_counts()

For this analysis our columns of interest are: race and call.
Let us see the main properties of these two columns. 

In [89]:
data.race.value_counts()# Values of the column race and its counting

b    2435
w    2435
Name: race, dtype: int64

In [90]:
data.call.value_counts()
# Values of the column call and its counting

0.0    4478
1.0     392
Name: call, dtype: int64

In [91]:
# Number of call backs  received for race = "w"
y_calls_white = data[(data.race =="w") & (data.call ==1.0)].call.count()
# Number of people that not received call backs for race = "w"
n_calls_white = data[data.race =="w"].call.count()- y_calls_white
# Number of call backs  received for race = "b"
y_calls_black = data[(data.race =="b") & (data.call ==1.0)].call.count()
# Number of people that not received call backs for race = "b"
n_calls_black = data[data.race =="b"].call.count()- y_calls_black

In [92]:
# Convert series race to type category
data["race"] = data["race"].astype("category")

In [93]:
"""
Redefining column call(with name: phoned) 
where 1.0 -> "yes" and 0.0 -> "no"
"""
data["phoned"]= data.call.map({1.0:"yes", 0.0:"no"})
# Convert series phoned to type category
data["phoned"] = data["phoned"].astype("category")
# New dataframe with columns: race, call and phoned
data_race_vs_calls= data[["race","call","phoned"]]
data_race_vs_calls.tail()

Unnamed: 0,race,call,phoned
4865,b,0.0,no
4866,b,0.0,no
4867,w,0.0,no
4868,b,0.0,no
4869,w,0.0,no


In [94]:
"""
Contingency table with columns(phoned) yes and no 
and rows(race) b and w
"""
data_race_vs_calls =data.pivot_table(index = "race",columns = "phoned",\
                 values = "call",aggfunc="count")
data_race_vs_calls.head()

phoned,no,yes
race,Unnamed: 1_level_1,Unnamed: 2_level_1
b,2278,157
w,2200,235


In [95]:
b_callbacks_percent =100.0*y_calls_black /(y_calls_black + n_calls_black)
w_callbacks_percent =100.0*y_calls_white /(y_calls_white + n_calls_white)
print("Percent of callbacks received by black applicants: {}".format(b_callbacks_percent))
print("Percent of callbacks received by white applicants: {}".format(w_callbacks_percent))
print("Difference in favor of\
 white applicants: {}".format(w_callbacks_percent - b_callbacks_percent))

Percent of callbacks received by black applicants: 6.447638603696099
Percent of callbacks received by white applicants: 9.650924024640657
Difference in favor of white applicants: 3.2032854209445585


**Is that 3.2% statistically significant?**

**What test is appropriate for this problem?**

We want to know whether race has a significant impact on the rate of callbacks. In order to do that, we can check if our data suggest any dependence between the variables race and call. The classical method to acomplish such task is by using chi-square test for independence. We can use this method because: 
* The sampling method is simple random sampling('b' and 'w' values in race are assigned randomly to the resumes). 
* The variables under study are categorical.
* The expected frequency count for each cell of the table is at least 5.

**What are the null and alternate hypotheses?**

Null hypothesis

$H_{0}$: If a given person receives or not a callback is not related with her/his race. In other words, variables race and call are independent.

Alternative hypothesis

$H_{a}$: variables race and call are not independent.

**Does CLT apply?**

No, here we have categorical variables.

**Compute margin of error, confidence interval, and p-value.**

By using chi-square test of independence implemented in the scipy package

In [96]:
 chi2, p_value, dof,\
    expected_frec = stats.chi2_contingency(observed= data_race_vs_calls,correction = True)

In [97]:
print("The p-value associated to this test is: {}".format(p_value))

The p-value associated to this test is: 4.997578389963255e-05


The above p-value tell us if we assume that race does not have a significant impact on the rate of callbacks. The probability that  the observed difference or more extreme values happen are approximately $10^{-5}.$ Then, we can reject the null hypothesis with this level of confidence.  **This dataset suggest that rate of callbacks depend of race.**
Now, let use hacker statistical methods to do the same hypothesis test.

In [98]:
def permutation_sample(data1, data2):
    """Return a permutation sample from two data sets."""

    # Concatenate the data sets:
    data = np.concatenate((data1, data2))

    # Randonly permute the concatenated array:
    permuted_data = np.random.permutation(data)

    # Separate the new array into two samples:
    permuted_sample_1 = permuted_data[:len(data1)]
    permuted_sample_2 = permuted_data[len(data1):]

    return permuted_sample_1, permuted_sample_2


def draw_permutation_replicates(data_1, data_2, function, size=1):#Draw multiple permutation replicates

    # Initialize array of permutation replicates:
    permutation_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation samples:
        permutation_sample_1, permutation_sample_2 = permutation_sample(data_1, data_2)

        # Compute the test statistic:
        permutation_replicates[i] = function(permutation_sample_1, permutation_sample_2)

    return permutation_replicates

In [99]:
# Construct arrays of data: black,white
calls_b = np.array([True] *y_calls_black  + [False] * n_calls_black)
calls_w = np.array([True] * y_calls_white + [False] * n_calls_white)

def frac(black, white):
    """Compute fraction of call made to race = b."""
    frac = np.sum(black)/ len(black)
    return frac

# Acquire permutation samples: perm_replicates
perm_replicates = \
draw_permutation_replicates(calls_b, calls_w, frac, 100000)

# Compute and print p-value: p
p = np.sum(perm_replicates <= (y_calls_black)/(y_calls_black +n_calls_black)) / len(perm_replicates)
print("The p-value associated to this test is: {}".format(p))


The p-value associated to this test is: 2e-05


Like previous method  the p-value suggest that there is an association between race and number of callbacks.

Now, let us calculate an estimated of proportion 
of callbacks for our two races and their respective confident intervals. In this case for a given race its population is all people in the US labor market belonging to this race. Then, we can consider that our two samples(black and white) were chosen randonly from their respective populations.

In [100]:
"""
Black race
"""
prob_callbacks = b_callbacks_percent/100.0

N = data[data.race =="b"].race.count()


b_interval = stats.norm.interval(alpha = 0.99,    # Confidence level             
                   loc =  prob_callbacks, # Point estimate of proportion
                   scale = np.sqrt((prob_callbacks*(1-prob_callbacks))/N))  # Scaling factor
print("99% confident interval for \
proportion of callbacks when race = b\n{}".format(b_interval))

99% confident interval for proportion of callbacks when race = b
(0.051656170777244395, 0.077296601296677578)


In [101]:
"""
White race
"""

prob_callbacks = w_callbacks_percent/100.0

N = data[data.race =="w"].race.count()


w_interval = stats.norm.interval(alpha = 0.99,    # Confidence level             
                   loc =  prob_callbacks,     # Point estimate of proportion
                   scale = np.sqrt((prob_callbacks*(1-prob_callbacks))/N))  # Scaling factor
print("99% confident interval for \
proportion of callbacks when race = w\n{}".format(w_interval))

99% confident interval for proportion of callbacks when race = w
(0.081095291775432607, 0.11192318871738054)


These two confident interval have null intersection. This fact suggest us there is a statistically significant difference between the two race mean proportions. 

**Does your analysis mean that race/name is the most important factor in callback success? Why or why not? If not, how would you amend your analysis?**

Not necessary race is the most important factor in callback success. Confounding variables like, years of experience, sex, etc can contribute in such way that could change our previous result. For example, let us check if the variable sex changes this result.

In [102]:
print(data.sex.value_counts()) # Values of the column sex and its counting

f    3746
m    1124
Name: sex, dtype: int64


In [103]:
"""
Contingency table with columns(phoned) yes and no 
and rows(race and sex) with b and w and f and m respectively
"""
data_race_sex_vs_calls =data.pivot_table(index = ["race","sex"],columns = "phoned",\
                 values = "call",aggfunc="count")
data_race_sex_vs_calls

Unnamed: 0_level_0,phoned,no,yes
race,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
b,f,1761,125
b,m,517,32
w,f,1676,184
w,m,524,51


Chi-square test of independence.

Null hypothesis is: variables race, sex and call are independent.

Alternative hypothesis: variables race, sex and call are not independent.

In [104]:
chi2, p_value, dof, expected_frec = stats.chi2_contingency(observed= data_race_sex_vs_calls,correction = True)

print("The p-value associated to this test is: {}".format(p_value))

The p-value associated to this test is: 0.00046857565208271426


New variable sex does not change the result of the independence test. The number of callbacks depend of race. But, again, it may be that the effect of other variables may change this result(for example years of experience).