In [4]:
import numpy as np 
import pandas as pd 
import statsmodels.api as sm 
import statsmodels.formula.api as smf 
from itertools import combinations 
import plotnine as p

## Randomized Control Trial (RCT):

- Key Idea: Remember that our goal is to find clever ways to remove bias and make the treatment group and the control group comparable. Thus all the differences that we see will be only the average effect of the applied treatment. So, we require to make association be causation:

$$ E[Y|T=1] - E[Y|T=0] = \underbrace{E[Y_1 - Y_0|T=1]}_{ATET} + \underbrace{E[Y_0|T=1] - E[Y_0|T=0]}_{BIAS} $$
    

The first tool we have to make the bias vanish is **Randomized Experiments**.

In short, an RCT is an study design that **randomly assigns** participants into an experimental group or a control group. As the study is conducted, the **only expected difference** between the control and experimental groups in a RCT is the outcome variable being studied.

Randomisation annihilates bias by making the potential outcomes independent of the treatment: 

$$ (Y_0, Y_1) \,\bot\, T  $$

Saying that the potential outcomes are independent of the treatment is saying that they would be, in expectation, the same in the treatment or the control group. In simpler terms, it means that treatment and control groups are comparable.

Therefore, this means that the treatment is the only thing generating a difference between the outcome in the treated and in the control group.

### Example 1: the ideal experiment

In 2020, the Coronavirus Pandemic forced businesses to adapt to social distancing. In this context, we want to answer if online learning has a negative or positive impact on the student’s academic performance.

To solve that, we need to make the treated and untreated comparable. One way to force this is by randomly assigning the online and presential classes to students.

Imagine that we've randomized classes: some students were assigned to have face-to-face lectures, others to have only online lessons, and a third group to have a blended format of both online and face-to-face classes. Then, we collect data on a standard exam at the end of the semester:

In [11]:
data = pd.read_csv('https://github.com/matheusfacure/python-causality-handbook/raw/master/causal-inference-for-the-brave-and-true/data/online_classroom.csv')
print(data.shape)
data.head()


(323, 10)


Unnamed: 0,gender,asian,black,hawaiian,hispanic,unknown,white,format_ol,format_blended,falsexam
0,0,0.0,0.0,0.0,0.0,0.0,1.0,0,0.0,63.29997
1,1,0.0,0.0,0.0,0.0,0.0,1.0,0,0.0,79.96
2,1,0.0,0.0,0.0,0.0,0.0,1.0,0,1.0,83.37
3,1,0.0,0.0,0.0,0.0,0.0,1.0,0,1.0,90.01994
4,1,0.0,0.0,0.0,0.0,0.0,1.0,1,0.0,83.3


To estimate the causal effect, we can simply compute the mean score for each of the treatment groups.

In [14]:
# creado una columna extra para clasificar
data_2 = (data
 .assign(class_format = np.select(
     [data["format_ol"].astype(bool), data["format_blended"].astype(bool)],
     ["online", "blended"],
     default="face_to_face" #create a new variable
 ))) #group by the new variable (treatments)

In [15]:
data_2

Unnamed: 0,gender,asian,black,hawaiian,hispanic,unknown,white,format_ol,format_blended,falsexam,class_format
0,0,0.0,0.0,0.0,0.0,0.0,1.0,0,0.0,63.29997,face_to_face
1,1,0.0,0.0,0.0,0.0,0.0,1.0,0,0.0,79.96000,face_to_face
2,1,0.0,0.0,0.0,0.0,0.0,1.0,0,1.0,83.37000,blended
3,1,0.0,0.0,0.0,0.0,0.0,1.0,0,1.0,90.01994,blended
4,1,0.0,0.0,0.0,0.0,0.0,1.0,1,0.0,83.30000,online
...,...,...,...,...,...,...,...,...,...,...,...
318,0,0.0,0.0,0.0,0.0,0.0,1.0,0,1.0,68.36000,blended
319,1,,,,,,,1,0.0,70.05000,online
320,0,,,,,,,1,0.0,66.69000,online
321,1,,,,,,,1,0.0,83.29997,online


In [16]:
data_2.groupby(["class_format"]).mean() #get the exam's mean

Unnamed: 0_level_0,gender,asian,black,hawaiian,hispanic,unknown,white,format_ol,format_blended,falsexam
class_format,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
blended,0.550459,0.217949,0.102564,0.025641,0.012821,0.012821,0.628205,0.0,1.0,77.093731
face_to_face,0.633333,0.20202,0.070707,0.0,0.010101,0.0,0.717172,0.0,0.0,78.547485
online,0.542553,0.228571,0.028571,0.014286,0.028571,0.0,0.7,1.0,0.0,73.635263


In [17]:
# Y(1)- Y(0) = ATE Average Treatment Effect - Causal 
# Y(1) -> online, Y(0) -> face_to_face
73.635263 - 78.547485

-4.912222

We can see that face-to-face classes yield a 78.54 average score, while online courses yield a 73.63 average score. Not so good news for the proponents of online learning. The ATT for an online class is thus -4.91. This means that online classes cause students to perform about 5 points lower, on average. 

A good sanity check to see if the randomization was done right (or if you are looking at the correct data) is to check if the treated are equal to the untreated in pre-treatment variables. Our data has information on gender and ethnicity to see if they are similar across groups. We can say that they look pretty similar for the gender, asian, hispanic, and white variables. The black variable, however, seems a little bit different.