# Causal Inference
# School of Information, University of Michigan 
## Week 1

### Resources:
 - Course Manual, which can be found in Coursera
 - [Intro to Pandas Data Structures](assets/IntroPandasStructures.pdf)
 - [Bertrand and Mullainathan, 2004](assets/BertrandMullainathan2004.pdf)

## Part 1

### Background

We will be using two datasets for this assignment. The first one (lecture1_survey.csv) is an observational dataset that comes from a labor market survey conducted in Chicago and Boston in 2001. The second one (lecture1_random_exper.csv) comes from a randomized experiment conducted by Marianne Bertrand and Sendhil Mullainathan, who sent fictitious resumes out to employers in response to job adverts in Boston and Chicago in 2001. The resumes differ in various attributes including the names of the applicants, and different resumes were randomly allocated to job openings. Some of the names are distinctly white sounding and some distinctly black sounding.


### Data

The data file “lecture_1_survey.csv” contains 4 variables for 10,593 individuals. Below are descriptions of each variable in the data:

- *black*: dummy variable for race; equal to 0 if white, 1 if black
- *yearsexp*: years of work experience
- *somecol_more*: 1 if college dropout or college degree or college degree and more, 0 otherwise
- *employed*: 1 is employed, 0 if unemployed

In [1]:
#Import statements. Run this cell.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.stats.api as sms
from scipy import stats

In [2]:
#Run this cell to get the data for Part 1.
data_survey = pd.read_csv('assets/lecture1_survey.csv')

#The line below lets you see the first five lines of the data_survey dataframe. Uncomment and run to view.
data_survey.head()

Unnamed: 0,yearsexp,somecol_more,col_more,employed,black
0,7,0,0,1,1
1,4,1,0,1,1
2,16,0,0,1,0
3,8,1,0,1,0
4,11,0,0,0,1


### Questions

We are interested in exploring whether there is racial discrimination against African-Americans in the labor market. We have two sets of data; observational (survey) and experimental that we can make use of for our investigation.

**Note:** You can refer to the manual for the methods we use in the assignment if you need to.

**Use the data_survey dataframe uploaded above to answer the questions in Part 1 unless otherwise specified.**

**1.** We will start by checking for covariate balance. Use the statsmodels.stats or scipy stats module to run a (two-sided) t-test to test the following null hypothesis: the mean of *somecol_more* is the same for African-Americans and Whites. Assign the results of the ttest to the variable `ttest1_1`. (1 pt)

**Note:** For this balance test, we could also use a nonparametric test (Chi-squared test for binary variables) that relies on fewer assumptions.

In [3]:
black = data_survey[data_survey['black'] == 1]
white = data_survey[data_survey['black'] == 0]

ttest1_1 = stats.ttest_ind(black['somecol_more'], white['somecol_more'])
ttest1_1
#raise NotImplementedError()

Ttest_indResult(statistic=-11.24095598344323, pvalue=3.7514090837794205e-29)

In [4]:
# Hidden Tests, checking t-statistic and p-value of ttest1_1.

**2.** Based on the p-value, do you find evidence that this variable (somecol_more) differs between African-Americans and Whites at the 5% significance level? Explain. (i.e. report BOTH the p-value **AND** the decision rule based on p-value to determine if the difference is significant at the 5% level.) (1 pt)
 
**Note**: This question will be manually graded.

Yes, the somecol_more variable differs between the two groups at the 5% significance level. The p-value for the comparison is 3.75e-29, which is significantly lower than 0.05 and makes clear that the probability of seeing this relationship based solely on chance is extremely unlikely. As a result, we reject the null hypothesis that the mean of somecol_more is the same between the groups.

**3.** Use the statsmodel.stats or scipy stats module to run a (two-sided) t-test to test the following null hypothesis: the mean of years of experience (*yearsexp*) is the same for African-Americans and Whites. Assign the results of the ttest to the variable `ttest1_3`. (1 pt)

In [5]:
ttest1_3 = stats.ttest_ind(black['yearsexp'], white['yearsexp'])
ttest1_3
#raise NotImplementedError()

Ttest_indResult(statistic=-20.665913460336267, pvalue=4.743084195169169e-93)

In [6]:
# Hidden Tests, checking the t-statistic and p-value of ttest1_3.

**4.** Based on the t-statistic, do you find evidence that this variable (yearsexp) differs between African-Americans and Whites at the 5% significance level? Explain. (i.e. report BOTH the t-statistic **AND** the decision rule based on t-statistic to determine if the difference is significant at the 5% level.) (1 pt) 

**Note**: This question will be manually graded

Yes, the yearsexp variable differs between the two groups at the 5% significance level. The t-statistic for the comparison is -20.67, which is significantly lower than the ~-2 critical value. Because we expect abosolute t-statistics to be greater than 2 only 5% of the time by chance, our finding makes clear that the probability of seeing this relationship based solely on chance is extremely unlikely. As a result, we reject the null hypothesis that the mean of yearsexp is the same between the groups.

**5.** Discuss your results from 1 and 3 above. Why do we care about whether these variables look similar by race for the purpose of investigating racial discrimination in the labor market? (1 pt) 

**Note**: This question will be manually graded.

For the purpose of investigating racial discrimination in the labor market, it is important to understand whether there may be any confounder variables at play that influence any existing correlation between race and success within the labor market/job search. If one group was to have higher rates of college and advanced degrees and more job experience, this would be critical to understand as they would be clearly confounding any present relationship between race and the labor market.

**6.** The main outcome variable is the indicator variable for employment status (*employed*). Use the statsmodel.stats or scipy stats module to run a (two-sided) t-test to test the null hypothesis that the means for the variable *employed* are the same for the two races.

In [7]:
stats.ttest_ind(black['employed'], white['employed'])
#raise NotImplementedError()

Ttest_indResult(statistic=-7.1420030594715485, pvalue=9.802071342176244e-13)

Based on the p-value above, do you find evidence that this variable differs significantly at 5% level between races? Explain. (i.e. report BOTH the p-value **AND** the decision rule based on p-value to determine if the difference is significant at the 5% level.) (1 pt) 

**Note**: This question will be manually graded.

Yes, the employed variable differs between the two groups at the 5% significance level. The p-value for the comparison is 9.80e-13, which is significantly lower than 0.05 and makes clear that the probability of seeing this relationship based solely on chance is extremely unlikely. As a result, we reject the null hypothesis that the mean of employed is the same between the groups.

**7.** In light of your results, can we conclude from the observational (survey) data that there is racial discrimination against African-Americans in employment? Yes or no? Explain. (1 pt) 

**Note**: This question will be manually graded.

While a statistically significant relationship does exist where employment rates differ notably between the two groups, we are not able to conclude from the observation survey data that there is racial discrimination against African Americans in employment. The reason we are unable to make this conclusion is that not all else is equal (ceteris paribus) as there are statistically significant differences also between college education and years experience between the two groups. Additionally, if there are notable differences in this observable data, it is also reasonable to conclude that there may be notable differences in unobserved data. With these confounding variables present and statistically signficant, we therefore can not make this conclusion around discrimination.

## Part 2
### Data

The data file “lecture1_random_exper.csv” contains 7 variables for 4,870 observations. Below are descriptions of each variable in the data:

* *id*: a de-identified identifier to represent unique observations (resumes)
* *male*: dummy variable for gender; equal to 0 if female, 1 if male
* *black*: dummy variable for race; equal to 0 if resume has a white sounding name and 1 if resume has a black sounding name
* *education*: 0 if not reported, 1 if high school dropout, 2 if high school graduate, 3 if college dropout, 4 if has college degree or more
* *computerskills*: 1 if resume mentions some computer skills, 0 otherwise
* *ofjobs*: number of jobs listed on the resume
* *yearsexp*: number of years of work experience on the resume
* *call*: 1 if applicant was called back, 0 otherwise

In [8]:
#Run this cell to get the data for Part 2.
data_random = pd.read_csv('assets/lecture1_random_exper.csv')

#The line below lets you see the first five lines of the data_random. Uncomment and run to view.
data_random.head()

Unnamed: 0,education,ofjobs,yearsexp,computerskills,call,id,male,black
0,4,2,6,1,0,1,0,0
1,3,3,6,1,0,2,0,0
2,4,1,6,1,0,3,0,1
3,3,4,6,1,0,4,0,1
4,3,3,22,1,0,5,0,0


### Questions

In (Bertrand and Mullainathan 2004), researchers were interested to learn whether black sounding names obtain fewer callbacks for interviews than white sounding names.

**Use the data_random dataframe uploaded above to answer the questions in Part 2 unless otherwise specified.**

**1.** What is the treatment variable ($D_{i}$)? Assign the correct data_random column to the variable `d_i`. What is the outcome variable ($Y_{i}$)? Assign the correct data_random column to the variable `y_i`. (1 pt)

In [9]:
d_i = data_random['black']
y_i = data_random['call']
#raise NotImplementedError()

In [10]:
# Hidden Tests, checking the columns assigned to d_i and y_i.

**2.** Remember the potential outcomes framework. By using the `loc` method, check for the treatment status of observation/individual id = 345. Based on what you see, what is the counterfactual outcome for observation/individual id =345, $Y_{345}^{black}$ or $Y_{345}^{white}$? (You can type “Y^b345” for $Y_{345}^{black}$ and “Y^w_345” for $Y_{345}^{white}$.) (1 pt)

**Note**: This question will be manually graded.

In [11]:
data_random.loc[data_random['id'] == 345]

Unnamed: 0,education,ofjobs,yearsexp,computerskills,call,id,male,black
344,4,3,7,1,0,345,0,1


Y^w_345

**3.** Researchers designed the experiment such that the treatment and control groups are balanced on all other variables that can affect callback rates.

**3a.** Cross tabulate variables *male*, *computerskills*, *education*, and *ofjobs* with the variable *black*.

**Tip:** Use ``pd.crosstab()``. An example of a cross-tabulated table is shown below.


![Male variable cross-tabulated with black variable](assets/crosstab_example.png)


In [12]:
pd.crosstab(data_random['male'], data_random['black'])
#raise NotImplementedError()

black,0,1
male,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1860,1886
1,575,549


In [13]:
pd.crosstab(data_random['computerskills'], data_random['black'])

black,0,1
computerskills,Unnamed: 1_level_1,Unnamed: 2_level_1
0,466,408
1,1969,2027


In [14]:
pd.crosstab(data_random['education'], data_random['black'])

black,0,1
education,Unnamed: 1_level_1,Unnamed: 2_level_1
0,18,28
1,18,22
2,142,132
3,513,493
4,1744,1760


In [15]:
pd.crosstab(data_random['ofjobs'], data_random['black'])

black,0,1
ofjobs,Unnamed: 1_level_1,Unnamed: 2_level_1
1,54,56
2,347,357
3,726,703
4,800,811
5,258,275
6,243,221
7,7,12


Do these variables look similar by race? (1 pt)

**Note**: This question will be manually graded.

Yes, these variables look similar by race. Despite researcher's best efforts, there will always be challenges often outside of control that can lead to non-participation/dropping out of trials, but these other variables look to be well controlled for in this specific research study.

**3b.** Generate a table with the mean and standard deviation of the variable *yearsexp* for different values of black.

**Tip:** Use the ``groupby()`` and ``describe()`` methods.

In [16]:
data_random.groupby(['black'])['yearsexp'].describe()
#raise NotImplementedError()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
black,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,2435.0,7.856263,5.079228,1.0,5.0,6.0,9.0,26.0
1,2435.0,7.829569,5.010764,1.0,5.0,6.0,9.0,44.0


Does this variable look similar by race? (1 pt)

**Note**: This question will be manually graded.

Yes, this variable looks very similar by race. While there may be more notable differences in the outlier/max values, count, mean, and standard deviation look very similar when comparing across the two groups.

**4.** The outcome variable in the data set is the variable *call*, which indicates a callback for an interview. Let’s look at the means of this variable conditional on race.

**4a.** Use the statsmodel.stats or scipy stats module to run a (two-sided) t-test to test the following null hypothesis: the callback rates (means of variable *call*) are the same for resumes with black-sounding names as they are for resumes with white-sounding names. Assign the output to the variable `ttest2_4_1`. (1 pt)



In [17]:
black2 = data_random[data_random['black'] == 1]
white2 = data_random[data_random['black'] == 0]

ttest2_4_1 = stats.ttest_ind(black2['call'], white2['call'])
ttest2_4_1
#raise NotImplementedError()

Ttest_indResult(statistic=-4.114705266723095, pvalue=3.9408025140695284e-05)

In [18]:
# Hidden Tests, checking the t-statistic and p-value of ttest2_4_1.

**4b.** Based on the p-value, do you find evidence that there is discrimination in callback rates? That is, callback rates differ significantly at 5% level between resumes with white-sounding names and resumes with black-sounding names? Explain. (i.e. report BOTH the p-value **AND** the decision rule based on p-value to determine if the difference is significant at the 5% level.) (1 pt) 

**Note**: This question will be manually graded.

Yes, we do find evidence that there is discrimination in callback rates at the 5% significance level. The p-value for the comparison is 3.94e-05, which is significantly lower than 0.05 and makes clear that the probability of seeing this relationship based solely on chance is extremely unlikely. As a result, we reject the null hypothesis that there is no difference in callback rates based on white-sounding and black-sounding names on resumes. Additionally, the researchers designed this experiment ceteris paribus, which again is valuable in affirming that the names themselves are the drivers leading to lower callback rates.

**5.** Let’s see if the results change when we compare callback rates conditional on gender.

**5a.** Use the statsmodel.stats or scipy stats module to carry out a (two-sided) t-test to test the hypothesis that callback rates are equal for females across races. Assign the result to the variable `ttest2_5_1`. (1 pt)



In [19]:
black_fem2 = data_random[(data_random['black'] == 1) & (data_random['male'] == 0)]
white_fem2 = data_random[(data_random['black'] == 0) & (data_random['male'] == 0)]

ttest2_5_1 = stats.ttest_ind(black_fem2['call'], white_fem2['call'])
ttest2_5_1
#raise NotImplementedError()

Ttest_indResult(statistic=-3.6369213964305627, pvalue=0.0002796319942029361)

In [20]:
# Hidden Tests, checking the t-statistic and p-value of ttest2_5_1.

**5b.** Use the statsmodel.stats or scipy stats module to carry out a (two-sided) t-test to test the hypothesis that callback rates are equal for males across races. Assign the result to the variable ``ttest2_5_2``. (1 pt)

In [21]:
black_mal2 = data_random[(data_random['black'] == 1) & (data_random['male'] == 1)]
white_mal2 = data_random[(data_random['black'] == 0) & (data_random['male'] == 1)]

ttest2_5_2 = stats.ttest_ind(black_mal2['call'], white_mal2['call'])
ttest2_5_2
#raise NotImplementedError()

Ttest_indResult(statistic=-1.9501711134984252, pvalue=0.05140448724722174)

In [22]:
# Hidden Tests, checking the t-statistic and p-value of ttest2_5_2.