# Causal Inference
# School of Information, University of Michigan 
## Week 1

### Resources:
 - Course Manual, which can be found in Coursera
 - [Intro to Pandas Data Structures](assets/IntroPandasStructures.pdf)
 - [Bertrand and Mullainathan, 2004](assets/BertrandMullainathan2004.pdf)

## Part 1

### Background

We will be using two datasets for this assignment. The first one (lecture1_survey.csv) is an observational dataset that comes from a labor market survey conducted in Chicago and Boston in 2001. The second one (lecture1_random_exper.csv) comes from a randomized experiment conducted by Marianne Bertrand and Sendhil Mullainathan, who sent fictitious resumes out to employers in response to job adverts in Boston and Chicago in 2001. The resumes differ in various attributes including the names of the applicants, and different resumes were randomly allocated to job openings. Some of the names are distinctly white sounding and some distinctly black sounding.


### Data

The data file “lecture_1_survey.csv” contains 4 variables for 10,593 individuals. Below are descriptions of each variable in the data:

- *black*: dummy variable for race; equal to 0 if white, 1 if black
- *yearsexp*: years of work experience
- *somecol_more*: 1 if college dropout or college degree or college degree and more, 0 otherwise
- *employed*: 1 is employed, 0 if unemployed

In [1]:
#Import statements. Run this cell.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.stats.api as sms
from scipy import stats

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [3]:
#Run this cell to get the data for Part 1.
data_survey = pd.read_csv('assets/lecture1_survey.csv')

#The line below lets you see the first five lines of the data_survey dataframe. Uncomment and run to view.
#data_survey.head()

### Questions

We are interested in exploring whether there is racial discrimination against African-Americans in the labor market. We have two sets of data; observational (survey) and experimental that we can make use of for our investigation.

**Note:** You can refer to the manual for the methods we use in the assignment if you need to.

**Use the data_survey dataframe uploaded above to answer the questions in Part 1 unless otherwise specified.**

**1.** We will start by checking for covariate balance. Use the statsmodels.stats or scipy stats module to run a (two-sided) t-test to test the following null hypothesis: the mean of *somecol_more* is the same for African-Americans and Whites. Assign the results of the ttest to the variable `ttest1_1`. (1 pt)

**Note:** For this balance test, we could also use a nonparametric test (Chi-squared test for binary variables) that relies on fewer assumptions.

In [4]:
data_survey_black = data_survey.loc[data_survey['black'] == 1]
data_survey_white = data_survey.loc[data_survey['black'] == 0]

data_survey_white.head()

Unnamed: 0,yearsexp,somecol_more,col_more,employed,black
2,16,0,0,1,0
3,8,1,0,1,0
5,5,0,0,0,0
6,20,1,0,1,0
7,19,0,0,1,0


In [5]:
# YOUR CODE HERE
ttest1_1 = stats.ttest_ind(data_survey_black['somecol_more'], data_survey_white['somecol_more'])



In [6]:
# Hidden Tests, checking t-statistic and p-value of ttest1_1.

**2.** Based on the p-value, do you find evidence that this variable (somecol_more) differs between African-Americans and Whites at the 5% significance level? Explain. (i.e. report BOTH the p-value AND the decision rule based on p-value to determine if they differ at the 5% significance level) (1 pt) 
 
**Note**: This question will be manually graded.

H0: x̄ somecol_more(white) == x̄ somecol_more(black)
H1: x̄ somecol_more(white) != x̄ somecol_more(black)

t test output: Ttest_indResult(statistic=-11.24095598344323, pvalue=3.7514090837794205e-29)

The p-value for this particular two sided t test is extremely small, at nearly 3.75e^-29. This implies that there is only a 3.75e^-29 chance that the results obtained are random, suggesting that it's unlikely that the means for the two groups were observed as different just by chance. Therefore, there is enough evidence to reject the null hypothesis in favor of the alternative at an alpha of 0.05, since 3.75e^-29 < 0.05.

**3.** Use the statsmodel.stats or scipy stats module to run a (two-sided) t-test to test the following null hypothesis: the mean of years of experience (*yearsexp*) is the same for African-Americans and Whites. Assign the results of the ttest to the variable `ttest1_3`. (1 pt)

In [7]:
# YOUR CODE HERE
ttest1_3 = stats.ttest_ind(data_survey_black['yearsexp'], data_survey_white['yearsexp'])
ttest1_3

Ttest_indResult(statistic=-20.665913460336267, pvalue=4.743084195169169e-93)

In [8]:
# Hidden Tests, checking the t-statistic and p-value of ttest1_3.

**4.** Based on the t-statistic, do you find evidence that this variable (yearsexp) differs between African-Americans and Whites at the 5% significance level? Explain. (i.e. report BOTH the t-statistic AND the decision rule based on t-statistic to determine if they differ at the 5% significance level) (1 pt) 

**Note**: This question will be manually graded

H0: x̄ yearsexp(white) == x̄ yearsexp(black)
H1: x̄ yearsexp(white) != x̄ yearsexp(black)

Ttest_indResult(statistic=-20.665913460336267, pvalue=4.743084195169169e-93)


The p-value for this particular two sided t test is extremely small, at nearly 4.75e^-93 . This implies that there is only a 4.75e^-93 chance that the results obtained are random, suggesting that it's unlikely that the means for the two groups were observed as different just by chance. Therefore, there is enough evidence to reject the null hypothesis in favor of the alternative at an alpha of 0.05, since 4.75e^-93  < 0.05.

**5.** Discuss your results from 1 and 3 above. Why do we care about whether these variables look similar by race for the purpose of investigating racial discrimination in the labor market? (1 pt) 

**Note**: This question will be manually graded.

It is important to take differences between these two variables into account amongst participants who self-identify as white and black respectively in order to prevent potential bias in HR recruiting systems, among other things. Specifically, if a recruiting platform is automated to root out applicants below a certain education/work experience threshold, this may have the unintended consequence of excluding cer

**6.** The main outcome variable is the indicator variable for employment status (*employed*). Use the statsmodel.stats or scipy stats module to run a (two-sided) t-test to test the null hypothesis that the means for the variable *employed* are the same for the two races.

In [9]:
# YOUR CODE HERE
ttest1_6 = stats.ttest_ind(data_survey_black['employed'], data_survey_white['employed'])


Based on the p-value above, do you find evidence that this variable differs significantly at 5% level between races? Explain. (i.e. report BOTH the p-value AND the decision rule based on p-value to determine if they differ at the 5% significance level) (1 pt) 

**Note**: This question will be manually graded.

H0: x̄ employed(white) == x̄ employed(black)
H1: x̄ employed(white) != x̄ employed(black)


Ttest_indResult(statistic=-7.1420030594715485, pvalue=9.802071342176244e-13)



The p-value for this particular two sided t test is extremely small, at nearly 9.80^-13 . This implies that there is only a 9.80^-13 chance that the results obtained are random, suggesting that it's unlikely that the means for the two groups were observed as different just by chance. Therefore, there is enough evidence to reject the null hypothesis in favor of the alternative at an alpha of 0.05, since 9.80^-13  < 0.05.

**7.** In light of your results, can we conclude from the observational (survey) data that there is racial discrimination against African-Americans in employment? Yes or no? Explain. (1 pt) 

**Note**: This question will be manually graded.

Setting aside my personal feelings regarding the subject(that sytemic discrimination within the job market is quite real), based on the results calculated above, it is very salient that there is a differential between the education status, years of prior work experience, and most importantly, employment status of white individuals versus black individuals. However, it is difficult to comment on the magnitude of such an effect, as the p values reported do not capture any information regarding effect size(mean1-mean2/SD for one of the groups). Furthermore, in some respects, it may be more insightful to conduct a survey utilizing black/white participants with equivalent levels of education and experience, in order to better isolate potential discrimination(employed vs not employed). In sum, while it is clear that there is a differential in employment status between the two groups, it is difficult to comment on the nature of said differential(whether it is on account of discrimination or otherwise.)  

## Part 2
### Data

The data file “lecture1_random_exper.csv” contains 7 variables for 4,870 observations. Below are descriptions of each variable in the data:

* *id*: a de-identified identifier to represent unique observations (resumes)
* *male*: dummy variable for gender; equal to 0 if female, 1 if male
* *black*: dummy variable for race; equal to 0 if resume has a white sounding name and 1 if resume has a black sounding name
* *education*: 0 if not reported, 1 if high school dropout, 2 if high school graduate, 3 if college dropout, 4 if has college degree or more
* *computerskills*: 1 if resume mentions some computer skills, 0 otherwise
* *ofjobs*: number of jobs listed on the resume
* *yearsexp*: number of years of work experience on the resume
* *call*: 1 if applicant was called back, 0 otherwise

In [10]:
#Run this cell to get the data for Part 2.
data_random = pd.read_csv('assets/lecture1_random_exper.csv')

#The line below lets you see the first five lines of the data_random. Uncomment and run to view.
#data_random.head()

### Questions

In (Bertrand and Mullainathan 2004), researchers were interested to learn whether black sounding names obtain fewer callbacks for interviews than white sounding names.

**Use the data_random dataframe uploaded above to answer the questions in Part 2 unless otherwise specified.**

**1.** What is the treatment variable ($D_{i}$)? Assign the correct data_random column to the variable `d_i`. What is the outcome variable ($Y_{i}$)? Assign the correct data_random column to the variable `y_i`. (1 pt)

In [11]:
data_random.head()

Unnamed: 0,education,ofjobs,yearsexp,computerskills,call,id,male,black
0,4,2,6,1,0,1,0,0
1,3,3,6,1,0,2,0,0
2,4,1,6,1,0,3,0,1
3,3,4,6,1,0,4,0,1
4,3,3,22,1,0,5,0,0


In [12]:
# YOUR CODE HERE
y_i = data_random['call']
d_i = data_random['black']

In [13]:
# Hidden Tests, checking the columns assigned to d_i and y_i.

**2.** Remember the potential outcomes framework. By using the `loc` method, check for the treatment status of observation/individual id = 345. Based on what you see, what is the counterfactual outcome for observation/individual id =345, $Y_{345}^{black}$ or $Y_{345}^{white}$? (You can type “Y^b345” for $Y_{345}^{black}$ and “Y^w_345” for $Y_{345}^{white}$.) (1 pt)

**Note**: This question will be manually graded.

In [14]:
data_random_345 = data_random.loc[data_random['id'] == 345]
data_random_345

Unnamed: 0,education,ofjobs,yearsexp,computerskills,call,id,male,black
344,4,3,7,1,0,345,0,1


For the individual with id 345, the "actual" outcome is Y^b345. Therefore, the counterfactual outcome would be the opposite, Y^w345.

**3.** Researchers designed the experiment such that the treatment and control groups are balanced on all other variables that can affect callback rates.

**3a.** Cross tabulate variables *male*, *computerskills*, *education*, and *ofjobs* with the variable *black*.

**Tip:** Use ``pd.crosstab()``. An example of a cross-tabulated table is shown below.


![Male variable cross-tabulated with black variable](assets/crosstab_example.png)


In [15]:
# YOUR CODE HERE


data_crosstab_male = pd.crosstab(data_random['male'], data_random['black'])

data_crosstab_male

black,0,1
male,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1860,1886
1,575,549


In [16]:
data_crosstab_computerskills = pd.crosstab(data_random['computerskills'], data_random['black'])

data_crosstab_computerskills

black,0,1
computerskills,Unnamed: 1_level_1,Unnamed: 2_level_1
0,466,408
1,1969,2027


In [17]:
data_crosstab_education = pd.crosstab(data_random['education'], data_random['black'])

data_crosstab_education

black,0,1
education,Unnamed: 1_level_1,Unnamed: 2_level_1
0,18,28
1,18,22
2,142,132
3,513,493
4,1744,1760


In [18]:
data_crosstab_ofjobs = pd.crosstab(data_random['ofjobs'], data_random['black'])

data_crosstab_ofjobs

black,0,1
ofjobs,Unnamed: 1_level_1,Unnamed: 2_level_1
1,54,56
2,347,357
3,726,703
4,800,811
5,258,275
6,243,221
7,7,12


Do these variables look similar by race? (1 pt)

**Note**: This question will be manually graded.

Visually speaking, there doesn't seem to be eye catching differences between these variables regardless of race. For any given gender, computer literacy, education level, and number of jobs held, there are no major numerical differences for white and black respondents. Notably, however, it does appear as if the vast majority of respondents in the data are female, as indicated by the data_crosstab_male table(3746 total reported for female and 1124 for male).


**3b.** Generate a table with the mean and standard deviation of the variable *yearsexp* for different values of black.

**Tip:** Use the ``groupby()`` and ``describe()`` methods.

In [19]:
# YOUR CODE HERE
yearsexp_df =  data_random[["yearsexp", "black"]]
yearsexp_df = yearsexp_df.groupby('black').describe()
yearsexp_df

Unnamed: 0_level_0,yearsexp,yearsexp,yearsexp,yearsexp,yearsexp,yearsexp,yearsexp,yearsexp
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
black,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,2435.0,7.856263,5.079228,1.0,5.0,6.0,9.0,26.0
1,2435.0,7.829569,5.010764,1.0,5.0,6.0,9.0,44.0


Does this variable look similar by race? (1 pt)

**Note**: This question will be manually graded.

The number of respondents for each race are identical, and the respective means and standard variations are also nearly identical. This implies that experience levels are roughly equivalent across both races. The only difference between the two races is max experience. There was at least one respondent that identified as black who had forty-four years of prior experience.

**4.** The outcome variable in the data set is the variable *call*, which indicates a callback for an interview. Let’s look at the means of this variable conditional on race.

**4a.** Use the statsmodel.stats or scipy stats module to run a (two-sided) t-test to test the following null hypothesis: the callback rates (means of variable *call*) are the same for resumes with black-sounding names as they are for resumes with white-sounding names. Assign the output to the variable `ttest2_4_1`. (1 pt)



In [20]:
# YOUR CODE HERE

data_random_black = data_random.loc[data_random['black'] == 1]
data_random_white = data_random.loc[data_random['black'] == 0]


ttest2_4_1 = stats.ttest_ind(data_random_white['call'], data_random_black['call'])


In [21]:
# Hidden Tests, checking the t-statistic and p-value of ttest2_4_1.

**4b.** Based on the p-value, do you find evidence that there is discrimination in callback rates? That is, callback rates differ significantly at 5% level between resumes with white-sounding names and resumes with black-sounding names? Explain. (i.e. report BOTH the p-value AND the decision rule based on p-value to determine if they differ at the 5% significance level) (1 pt) 

**Note**: This question will be manually graded.

H0: x̄ call(white) == x̄ call(black)
H1: x̄ call(white) != x̄ call(black)

Ttest_indResult(statistic=4.114705266723095, pvalue=3.9408025140695284e-05)

The p-value for this particular two sided t test is fairly small, at nearly 3.94e^-05. This implies that there is only a 3.94e^-05 chance that the results obtained are random, suggesting that it's unlikely that the means for the two groups were observed as different just by chance. Therefore, there is enough evidence to reject the null hypothesis in favor of the alternative at an alpha of 0.05, since 3.94e^-05 < 0.05. This finding does supply some support to the claim that there is a potentially discriminatory differential in callback rates amongst particpants, especially since, upon observing the response data, education, prior job experience, number of jobs, and computer skills are quite similar between both races. 

**5.** Let’s see if the results change when we compare callback rates conditional on gender.

**5a.** Use the statsmodel.stats or scipy stats module to carry out a (two-sided) t-test to test the hypothesis that callback rates are equal for females across races. Assign the result to the variable `ttest2_5_1`. (1 pt)



In [26]:
# YOUR CODE HERE
data_random_black_female = data_random.loc[(data_random["black"] == 1) & (data_random["male"] == 0)]
data_random_white_female = data_random.loc[(data_random["black"] == 0) & (data_random["male"] == 0)]

ttest2_5_1 = stats.ttest_ind(data_random_black_female['call'], data_random_white_female['call'])

In [27]:
ttest2_5_1

Ttest_indResult(statistic=-3.6369213964305627, pvalue=0.0002796319942029361)

In [28]:
# Hidden Tests, checking the t-statistic and p-value of ttest2_5_1.

**5b.** Use the statsmodel.stats or scipy stats module to carry out a (two-sided) t-test to test the hypothesis that callback rates are equal for males across races. Assign the result to the variable ``ttest2_5_2``. (1 pt)

In [29]:
# YOUR CODE HERE
data_random_black_male = data_random.loc[(data_random["black"] == 1) & (data_random["male"] == 1)]
data_random_white_male = data_random.loc[(data_random["black"] == 0) & (data_random["male"] == 1)]

ttest2_5_2 = stats.ttest_ind(data_random_black_male['call'], data_random_white_male['call'])

In [30]:
ttest2_5_2

Ttest_indResult(statistic=-3.6369213964305627, pvalue=0.0002796319942029361)

In [31]:
# Hidden Tests, checking the t-statistic and p-value of ttest2_5_2.