# 0. Abstract and Outline

This document is on a brief introduction to an important methodology used for data analysis/data science in a wide range of scientific and practical fields: Causal Inference, the development of which has become the norm for quantitative social sciences and, therefore, has led to the 2021 Nobel prize in economic sciences. We will start with the potential outcome model framework to discuss the core challenge in causal inference: missing outcome values and demonstrate why and how selection bias arises to ruin reliable causal inferences. Based on this framework, we will introduce A/B testing, which is also referred to as randomized controlled trials/randomized experiments in other contexts (such as medicine or statistics), as the gold standard that completely rules out selection bias and, thus, provide unbiased estimation of the causal effect. Demonstration code to estimate the average treatment effect with data from randomized experiments and the experiment design issues will be introduced. 

Below is the outline of this document:

- In [Section 1](#section_1), we discuss the core challenge of causal inference: missing data for the counterfactuals.

- In [Section 2](#section_2), we introduce the potential outcomes model framework and discuss why selection bias would make estimating the average treatment effect very difficult. 

- In [Section 3](#section_3), we introduce randomized experiments and discuss why randomized experiments would address selection bias in estimating the average treatment effect. We also present 2 estimates of the average treatment effect, with applications to a data set from an A/B test. 

- In [Section 4](#section_4), we discuss some central issues related to experiment design, including internal validity, external validity, and the stable unit treatment valuation assumption.

<a id='section_1'></a>
# 1. Causal Inference

Recall that, in supervised learning (regression and classification), we are trying to build a model $\hat f(\cdot)$ from the training set $\mathcal D=\{Y_i,X_{ij}:1\le i\le n,1\le j\le p\}$, such that, given a new data observation with feature $X$ the predicted outcome $\hat f(X)$ is as close to the true outcome $Y$ as possible, i.e., 

$$Y\approx \hat f(X)$$

Note that in machine learing we only care about making accurate predictions. In this decoument, we go one step further to do causal inference:

----

<span style="font-family:Comic Sans MS">
    <p style="color:red">
How can we identify, from data, what changes in $Y$ could changes in $X$ cause?
        </p>
</span>    

----


As we will detail in this document, this is an important yet difficult question. At a high level, addressing the above question would lead to important insights that could help us:

- Identify the cause of an issue/phenomenon (e.g., global warming);
- Predict the effect/consequence of an operational/strategic change (e.g., providing subsidies to App users);
- Inform what changes should be made to improve the status quo (e.g., to facilitate the recovery of economy from COVID-19). 

Below, we summarize some questions that could be addressed with solid and rigorous causal inference analysis:

- Are human activities **causing** climate changes?
- Will more education **lead to** higher incomes?
- Will poking **change** user engagement at WeChat?
- Can this new recommendation algorithm and/or product UI **increase** user retention?

Causal inference is so important that it deserves the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel 2021 (a.k.a. Nobel Prize in Economics).

<img src="Nobel-2021.png" width=350>

## 1.1. Simple Examples

The following figure plots the relationship between cholocate consumption and the number of Nobel laureates per 10 million population.

<img src="Chocolate.png" width=500>

----

<span style="font-family:Comic Sans MS">
    <p style="color:red">
If the policy makers wants to boost the number of Nobel Laureates in a country, should they encourage chocolate consumptions?
    </p>
</span>    

----


Let us first consider the following example that illustrates the difficulty of causal inference. Consider a survey conducted by National Health Interview, in which patients who visited the emergency department (ED) of a hospital were recorded at discharge. Moreover, each patient's health status when s/he leaves the hospital is also recorded in the survey. In this survey, 1 refers to poor health whereas 5 refers to excellent health. We want to examine the effectiveness of hospitization on patient health and have the following table, where the row "Hospitalized" means for the patients who are hospitalized; and the row "Not hospitalized" means for the patients who are asked to directly go home without hospitalization:    


<table style="width:70%">
  <tr>
    <th>Group </th>
    <th>Sample Size</th> 
    <th>Average Health Status</th>
    <th>Standard Error</th>
  </tr>
  <tr>
    <th>Hospitalized</th>
    <td>7,774</td>
    <td>3.21</td>
    <td>0.14</td>
  </tr>
      <tr>
    <th>Not hospitalized</th>
    <td>90,049</td>
    <td>3.93</td>
    <td>0.003</td>
  </tr>
</table>

From the above table we observe that the health status of the patients who have been hospitized have a **poorer** health status of **3.21** on average when the leave the hospital, compared with the patients who have been directly asked to go home who have a health status of **3.93** on average upon discharge. With the data from this health interview survey, can we conclude from the above table that hospitalization actually **makes** patients unhealthy? The answer is **No**, because only if a patient is seriously ill in the first place will s/he be hospitalized. Therefore, the patients who are hospitalized and those who are not hospitalized are not drawn from the same population distribution as those who are sent back home directly. In other words, these two groups of patients are <font color="red"> **not comparable** </font>.

Consider another example of Didi, the largest ride-sharing platform in China, which sends out coupons to riders in order to encourage them to use Didi. Below we summarize the data of one day, with the purpose to understand the effect of coupon promotions, where in each cell we report the number of users in this category.

<table style="width:70%">
  <tr>
    <th>Group </th>
    <th>Ride with Didi</th> 
    <th>Not Ride with Didi</th>
    <th>Active Rate</th>

  </tr>
  <tr>
    <th>Coupon</th>
    <td>2,000</td>
    <td>23,000</td>
    <td>8.0%</td>
  </tr>
      <tr>
    <th>No Coupon</th>
    <td>10,560</td>
    <td>69,183</td>
    <td>13.2%</td>
  </tr>
</table>


As we observe from the table above, the proportion of users who ride with Didi on that day is **$\frac{2000}{25000}=8.0\%$ if receiving a coupon**; whereas the proportion of users who ride with Didi on that day is **$\frac{10560}{79743}=13.2\%$ if not receiving a coupon**. In this case, can we conclude that coupon will **decrease the user activeness of Didi**? The answer is also **No**. This is because Didi will more likely to offer coupons to users/riders who are less active on the platform. To this regard, users who receive the coupon are not drawn from the same population distribution as those who do not receive the coupon. In other words, the two groups of users are <font color="red"> **not comparable** </font>.

The two examples above have clearly demonstrated that **association** does not necessarily imply **causation**. 

- Hospitization is **negatively correlated** with the health status, but it does not mean that hospitization will **cause a poorer health**. 
- Coupon is **negatively correlated** with Didi user activeness, but it does not mean that coupon will **make users less active on Didi**.

## 1.2. Correlation vs. Causation: The Missing Value Problem

The key problem underlying the difficulty of drawing a causal conclusion from correlations is the **missing value** issue:

----

<span style="font-family:Comic Sans MS">
    <p style="color:red">
We are unable to observe what would have happened to each individual if the alternative action had been applied.
        </p>
</span>    

----

Specifically, for the hospitization survey case, people who are seriously ill are more likely to be admitted into a hospital in the first place, but we are unable to see:

- What happens to a seriously sick person if not admitted into a hospital?
- What happens to a slightly sick person if admitted into a hospital?

Very likely, both of them will be better if hospitized. Analogously, for the Didi coupon case, only those users who are less active will receive the coupon, but we are unable to see:

- What happens to an active user if receiving the coupon?
- What happens to an inactive user if not receiving the coupon?

<a id='section_2'></a>
# 2. Potential Outcomes Model

We call the unseen information about each individual the **counterfactual**. For example, for the Didi users who have received (resp. not received) the coupon, the counterfactual is what they would have behaved if not receiving (resp. if receiving) the coupon. The key to successful causal inference is **rigorous reasoning about counterfactuals** by constructing **reasonable benchmarks**. 

The most widely used framework to analyze counterfactuals is called the potential outcomes model ([Wiki Page](https://en.wikipedia.org/wiki/Rubin_causal_model))). We consider the case of two possible actions applied to each individual, denoted by $W$:

- $W=1$ refers to the "treatment" action, such as a patient is hospitalized or a Didi user receives a coupon.
- $W=0$ refers to the "control" action, such as a patient is not hospitalized or a Didi user does not receive a coupon.

For each individual, there are two potential outcomes:

- $Y(1)$: The outcome of the individual if treatment is applied. For example, the health status of a patient if hospitalized; or the activeness of a Didi user if receiving the coupon.
- $Y(0)$: The outcome of the individual if control is applied. For example, the health status of a patient if not hospitalized; or the activeness of a Didi user if not receiving the coupon.

We define the causal effect of the treatment as the difference between the potential outcome when the treatment is applied and the potential outcome when the control is applied.

$$\mbox{Causal Effect}=Y(1)-Y(0)$$

The most fundamental problem in causal inference is an issue of missing data: For each individual, we either observe the treatment outcome $Y(1)$, or the control outcome $Y(0)$, but not both. For example, if a patient is hospitalized (resp. not hospitalized), we can only observe his/her health status under hospitalization (resp. without hospitalization) but not that without hospitalization (resp. under hospitalization). Analogously, if a Didi user receives the coupon (resp. does not receive the coupon), we can only observe his/her activeness upon receiving the coupon (resp. without receiving the coupon) but not that without receiving the coupon (resp. upon receiving the coupon).

Thus, we have to infer the counterfactuals from the data of the individuals with the other action. The core question to address for causal inference is then reduced to:


----

<span style="font-family:Comic Sans MS">
    <p style="color:red">
How can we infer the value of the counterfactual, $Y(1-W)$, and construct a reasonable benchmark from the available data?
        </p>
</span>    

----
To answer this question, we fully formalize our potential outcomes model for the sample. Consider the potential outcomes of all $n$ individuals in the data sample: $\{Y_i(1),Y_i(0):1\le i\le n\}$. We define the assignment mechanism as $\mathcal W=\{W_i\in\{0,1\}:1\le i\le n\}$, where $W_i=1$ means individual $i$ is assigned to the treatment group whereas $W_i=0$ means individual $i$ is assigned to the control group. Therefore, the data we could observe for individual $i$ is $Y_i(W_i)$, but not $Y_i(1-W_i)$. We denote $\mathcal D=\{(Y_i(W_i),W_i):1\le i\le n\}$ as the observed data set. We denote $n_1$ as the number of individuals in the treatment group, and $n_0$ as the number of individuals in the control group. 

In the hospitization example, individuals are partially self-assigned, and partially assigned by doctors. In the Didi coupon example, the platform assigns the users to the treatment or control group under the rule that more active users into control and less active ones into treatment.

Of particular importance is the random assignment mechanism, which we denote as $\mathcal R$. Under the random assignment, each individual is assigned to treatment or control at random.

## 2.1. Revisiting the Example of Didi

Let us revisit the Didi coupon example. We denote $W_i=1$ as sending a coupon to user $i$ and $W_i=0$ as not sending a coupon to user $i$. We have the following table summarizing our sample data of 12 observations, where starred entries ($*$) represent what we observe (the group assignment $W_i$'s are all observable):

<table style="width:70%">
  <tr>
    <th>Individual </th>
    <th>$W_i$</th> 
    <th>$Y_i(1)$</th>
    <th>$Y_i(0)$</th>
    <th>Causal Effect</th> 
  </tr>
  <tr>
    <td>1</td>
    <td>0</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>1</td>  
  </tr>
   <tr>
    <td>2</td>
    <td>0</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>1</td>  
  </tr>
   <tr>
    <td>3</td>
    <td>0</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>1</td>  
  </tr>
   <tr>
    <td>4</td>
    <td>0</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0</td>  
  </tr>
   <tr>
    <td>5</td>
    <td>0</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0</td>  
  </tr>
    <tr>
    <td>6</td>
    <td>0</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0</td>  
  </tr>
    <tr>
    <td>7</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0 </td>
    <td>1</td>  
  </tr>
      <tr>
    <td>8</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0 </td>
    <td>1</td>  
  </tr>
    <tr>
    <td>9</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0 </td>
    <td>1</td>  
  </tr>
    <tr>
    <td>10</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>0 </td>
    <td>0</td>  
  </tr>
    <tr>
    <td>11</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>0 </td>
    <td>0</td>  
  </tr>
    <tr>
    <td>12</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>0 </td>
    <td>0</td>  
  </tr>

</table>

As we can see, on average, the causal effect of sending the coupon to customers is 

$$\frac{\sum_{i=1}^{12}(Y_i(1)-Y_i(0))}{12}=0.5$$

Hence, the coupon is fairly effective to improve the overall activeness of Didi users. However, if we directly compare the activeness of the users who receive the coupon 
$$\frac{\sum_{W_i=1}Y_i(W_i)}{6}=\frac{0+0+0+1+1+1}{6}=0.5,$$ 
with those who do not
$$\frac{\sum_{W_i=0}Y_i(W_i)}{6}=\frac{1+1+1+0+0+0}{6}=0.5,$$
we find that they are the same. So it seems that the coupon is ineffective to improve the user activeness. How could we resolve this discrepancy?

## 2.2. Selection Bias

Because we cannot observe both potential outcomes for any individual, we need to observe two groups of individuals who are nearly identical, but one group under the treatment condition and the other under the control condition. We will make the term "nearly identical" concrete shortly.

We define a metric of interest called <font color="red"> **Average Treatment Effect** (ATE) </font>as follows:

<font color="red">

$$ATE:=\mathbb E[Y(1)] - \mathbb E[Y(0)]$$

</font>

where the expectation is taken with repect to the underlying true distribution of the potential outcomes $Y(1)$ and $Y(0)$. One should note that the average treatment effect is an aggregate metric, so we lose the individual information but could get both estimations in expectation.

To estimate $\mathbb E[Y(1)]$, we use the following estimator:

<font color="red">

$$\bar Y(1)=\frac{1}{n_1}\sum_{W_i=1}Y_i(W_i)\approx \mathbb E[Y_i(1)|W_i=1]$$

</font>    
which is the sample average of the outcomes in the treatment group. When the sample size $n_1$ is sufficiently large, the sample average of the treatment group $\bar Y(1)$ converges to the conditional expectation of the outcomes for the subjects in the treatment group if they are treated. Analogously, we use the sample average of the outcomes in the control group to estmate $\mathbb E[Y(0)]$:

<font color="red">

$$\bar Y(0)=\frac{1}{n_0}\sum_{W_i=0}Y_i(W_i)\approx \mathbb E[Y_i(0)|W_i=0],$$

</font>

which converges to the conditional expectation of the outcomes for the subjects in the control group if they are not treated, if the sample size of the control group $n_0$ is large.

Therefore, we estimate average treatment effect using the difference between the sample average of the outcomes in the treatment group and that of the control group:

<font color="red">

$$\widehat{ATE}:=\bar Y(1)- \bar Y(0)=\frac{1}{n_1}\sum_{W_i=1}Y_i(W_i)-\frac{1}{n_0}\sum_{W_i=0}Y_i(W_i)\approx \mathbb E[Y_i(1)|W_i=1] - \mathbb E[Y_i(0)|W_i=0]$$

</font>    
Now we need to address:

-------
<font color="red">

- **When is $\widehat{ATE}$ a good estimate of $ATE$?**

</font>    

-------------

We add and substract the term $\mathbb E[Y_i(0)|W_i=1]$ (which is the expected outcome of the subjects in the treatment group if they were under the control condition, which is something we are **unable to estimate**):

--------------

<font color="red">

\begin{equation}
\begin{split}
\widehat{ATE}&\approx \mathbb E[Y_i(1)|W_i=1] - \mathbb E[Y_i(0)|W_i=0]\\
&= \mathbb E[Y_i(1)|W_i=1] - \mathbb E[Y_i(0)|W_i=1] + \mathbb E[Y_i(0)|W_i=1] - \mathbb E[Y_i(0)|W_i=0]\\
&= \underbrace{\mathbb E[Y_i(1)-Y_i(0)|W_i=1]}_{\mbox{Expected Causal Effect for Treated}} + \underbrace{\mathbb E[Y_i(0)|W_i=1] - \mathbb E[Y_i(0)|W_i=0]}_{\mbox{Selection Bias for Control}}
\end{split}
\end{equation}

</font>

--------------

Analogously, we can add and substract the term $\mathbb E[Y_i(1)|W_i=0]$ (which is the expected outcome of the subjects in the control group if they were under the treatment condition, which is, again, something we are unable to estimate):

--------------

<font color="red">

\begin{equation}
\begin{split}
\widehat{ATE}&\approx \mathbb E[Y_i(1)|W_i=1] - \mathbb E[Y_i(0)|W_i=0]\\
&= \mathbb E[Y_i(1)|W_i=1] - \mathbb E[Y_i(1)|W_i=0] + \mathbb E[Y_i(1)|W_i=0] - \mathbb E[Y_i(0)|W_i=0]\\
&= \underbrace{\mathbb E[Y_i(1)-Y_i(0)|W_i=0]}_{\mbox{Expected Causal Effect for Control}} + \underbrace{\mathbb E[Y_i(1)|W_i=1] - \mathbb E[Y_i(1)|W_i=0]}_{\mbox{Selection Bias for Treated}}
\end{split}
\end{equation}

</font>

-----------------

Therefore, we have:

----

<span style="font-family:Comic Sans MS">
    <p style="color:red">
If there is no selection bias, i.e., $\mathbb E[Y_i(0)|W_i=1] = \mathbb E[Y_i(0)|W_i=0]$ and $\mathbb E[Y_i(1)|W_i=1] = \mathbb E[Y_i(1)|W_i=0]$, $\widehat{ATE}$ is an unbiased estimate of $ATE$.
        </p>
</span>    

----




The <font color="red"> **no selection bias** </font>condition could be interpreted as the assignment $W_i$ is uncorrelated with the outcomes $Y_i(1)$ and $Y_i(0)$. Clearly, this is not satisfied for the cases of hospitalization and Didi coupon discussed above. For the case of hospitalization, patients in poorer health conditions are more likely to be assigned to the treatment group (to be hospitalized). For the case of Didi coupon, less active users are more likely to be assigned to the treatment group (to receive the coupon). In fact, all the challenges in causal inference could be attributed to the selection bias in the end. 

We now revisit the case of Didi user coupon, the potential outcomes are summarized in the following table: 

<table style="width:70%">
  <tr>
    <th>Individual </th>
    <th>$W_i$</th> 
    <th>$Y_i(1)$</th>
    <th>$Y_i(0)$</th>
    <th>Causal Effect</th> 
  </tr>
  <tr>
    <td>1</td>
    <td>0</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>1</td>  
  </tr>
   <tr>
    <td>2</td>
    <td>0</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>1</td>  
  </tr>
   <tr>
    <td>3</td>
    <td>0</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>1</td>  
  </tr>
   <tr>
    <td>4</td>
    <td>0</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0</td>  
  </tr>
   <tr>
    <td>5</td>
    <td>0</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0</td>  
  </tr>
    <tr>
    <td>6</td>
    <td>0</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0</td>  
  </tr>
    <tr>
    <td>7</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0 </td>
    <td>1</td>  
  </tr>
      <tr>
    <td>8</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0 </td>
    <td>1</td>  
  </tr>
    <tr>
    <td>9</td>
    <td>1</td>
    <td>1 ($*$)</td>
    <td>0 </td>
    <td>1</td>  
  </tr>
    <tr>
    <td>10</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>0 </td>
    <td>0</td>  
  </tr>
    <tr>
    <td>11</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>0 </td>
    <td>0</td>  
  </tr>
    <tr>
    <td>12</td>
    <td>1</td>
    <td>0 ($*$)</td>
    <td>0 </td>
    <td>0</td>  
  </tr>
</table>

We can compute the relevant metrics of interests:

- **Average Treatment Effect:** 
$$ATE=\mathbb E[Y(1)]-\mathbb E[Y(0)]\approx\frac{1}{12}\sum_{i=1}^{12}\left(Y_i(1)-Y_i(0)\right)=\frac{6}{12}=0.5$$
- **Estimated Average Treatment Effect:** 
$$\widehat{ATE}=\frac{1}{6}\sum_{i=7}^{12}Y_i(1)-\frac{1}{6}\sum_{i=1}^{6}Y_i(0)=0.5-0.5=0$$
- **Average Treatment Effect for Treated:** 
$$ATT=\mathbb E[Y(1)-Y(0)|W=1]\approx \frac{1}{6}\sum_{i=7}^{12}(Y_i(1)-Y_i(0))=\frac{3}{6}=0.5$$
- **Average Treatment Effect for Control:** 
$$ATC=\mathbb E[Y(1)-Y(0)|W=0]\approx \frac{1}{6}\sum_{i=1}^{6}(Y_i(1)-Y_i(0))=\frac{3}{6}=0.5$$
- **Selection Bias (for Treated):** 
$$\mathbb E[Y(1)|W=1]-\mathbb E[Y(1)|W=0]\approx \frac{1}{6}\sum_{i=7}^{12}Y_i(1)-\frac{1}{6}\sum_{i=1}^{6}Y_i(1)=\frac{3}{6}-\frac{6}{6}=-0.5$$
- **Selection Bias (for Control):** 
$$\mathbb E[Y(0)|W=1]-\mathbb E[Y(0)|W=0]\approx \frac{1}{6}\sum_{i=7}^{12}Y_i(0)-\frac{1}{6}\sum_{i=1}^{6}Y_i(0)=\frac{0}{6}-\frac{3}{6}=-0.5$$

You may refer to [this book chapter](https://matheusfacure.github.io/python-causality-handbook/01-Introduction-To-Causality.html) to see more examples and discussions of causation. 

<a id='section_3'></a>
# 3. Randomized Experiment (A/B Testing)

Based on our discussions above, to obtain an unbiased estimate of the average treatment effect, we need to remove the selection biases. To this end, we adopt randomized experiments (also known as randomized controlled trials, see the [Wiki page](https://en.wikipedia.org/wiki/Randomized_controlled_trial)):

----

<span style="font-family:Comic Sans MS">
<p style="color:red">
Subjects are randomly assigned to the treatment group or the control group. Mathematically, $W_i$ is completely random for each subject $i$.
</p>
</span>    

----

In the case of a completely randomized experiment, the assignment $W$ is independent of the potential outcomes $Y(1)$ and $Y(0)$, which implies that 

<font color="red">
$$\mathbb E[Y(1)|W=1]=\mathbb E[Y(1)|W=0]\mbox{ and }\mathbb E[Y(0)|W=1]=\mathbb E[Y(0)|W=0],\mbox{ i.e., selection bias is completely removed}.$$ 
</font>

Therefore, **the estimated average treatment effect $\widehat{ATE}$ is an unbiased estimate of the true average treatment effect $ATE$**. For this reason, randomized experiments are considered as the **gold standard** for causal inference.

There are two ways for the random assignment of treatment $\mathcal R$, both of which would imply that selection bias is removed:

- **Complete Randomization:** Randomly select $n_1$ subjects (without replacement), out of the entire sample of $n$ subjects, into the treatment group. So the probability of each assignment is $\left(\begin{matrix}n\\n_1\end{matrix}\right)^{-1}$
- **Simple (Bernoulli) Randomization:** Each subject is independently assigned to the treatment group with probability $p=\frac{n_1}{n}$. In this case, the number of subjects assigned into the treatment may not exactly equal to $n_1$, but would be very close to $n_1$ if $n$ is very large. 

## 3.1. A/B Testing for Online Platforms

Thanks to the Internet technology, unprecedentedly large scale randomized experiments are taking place each day, by Tech giants such as Google, Facebook, Kwai, Alibaba, etc. Such platforms generally simultaneously run thousands of experiments everyday, influencing billions of platform users. A typical context for A/B testing is when the platform wants to examine whether a new product feature/user interface (UI)/algorithm is better than the benchmark. This is implemented in the following manner: Randomly assign users into the treatment group (new strategy applied) and the control group (baseline strategy applied). This is the **only difference** between these two groups. 

The following figure illustrates a typical. A platform wants so examine which of the following UI designs is better for the platform:

- Design A: An orange button on the right of a page;
- Design B: A blue button on the left of a page.

The platform randomly assigns its users into two groups, the first group (called the control group, 50% of the users) with Design A and the second group (called the treatment group, 50% of the users) with Design B.

<img src="AB1.png" width=500>

To evaluate Design A and Design B, we compare the outcomes (i.e., the conversion rates) of the treatment group and the control group. The difference (40% for the treatment group vs. 20% for the control group) measures the effectiveness of the new strategy (here Design B). Next, we will study how to formally perform causal inference analysis with A/B testing.

## 3.2. Causal Inference Analysis with AB Testing: Direct t-Test 

In this subsection, we demonstrate how to draw causal inference conclusions from AB testing. We assume that the complete randomization mechanism is adopted to the data set $\mathcal D$. Two approaches are introduced below: (a) Directly using $\widehat{ATE}$ and (b) Running an OLS regression.

As discussed above, with complete randomization, $\widehat{ATE}$ is a good estimator of the true $ATE$ as long as the sample sizes $n_1$ and $n_2$ are large. In addition to the estimate itself, we are also interested in estimating the standard error of $\widehat{ATE}$. This is because, with the estimated standard error, we could construct the confidence interval for our estimate $\widehat{ATE}$, which helps us understand whether the average treatment effect is statistically away from 0. Specifically, the estimated standard error of $\widehat{ATE}$ is given by:

$$\widehat{SE}:= \sqrt{\frac{\hat \sigma^2_1}{n_1}+\frac{\hat \sigma^2_0}{n_0}}$$

where $\hat\sigma_1^2$ is the sample variance of the treatment group

$$\hat\sigma_1:=\sqrt{\frac{\sum_{W_i=1}(Y_i(1)-\bar Y(1))^2}{n_1-1}}$$

and $\hat\sigma_0^2$ is the sample variance of the control group

$$\hat\sigma_0:=\sqrt{\frac{\sum_{W_i=0}(Y_i(0)-\bar Y(0))^2}{n_0-1}}$$

With the estimates $\widehat{ATE}$ and $\widehat{SE}$, we can compute the $t-$statistic

$$\hat t:=\frac{\widehat{ATE}}{\widehat{SE}}$$ 

and the $p-$value associated with $\hat t$. The $t-$statistic is used to perform a $t-$test ([Wiki Page](https://en.wikipedia.org/wiki/Student%27s_t-test)) that statistically tests whether two data samples have the same mean value (i.e., whether $\bar Y(1)$ and $\bar Y(0)$ are the same). 


The 95%-confidence interval of $ATE$ is then given by 

$$CI_{0.95}:=[\widehat{ATE}-1.96\widehat{SE},\widehat{ATE}+1.96\widehat{SE}],$$

where 1.96 is the $z-$value ([Wiki page](https://en.wikipedia.org/wiki/Standard_score)) for the $97.5-$percentile cutoff of a standard normal distribution, as illustrated in the following figure, where the $X-$axis represents the number of standard deviations away from the mean.

<img src="CI_0.95.jpg" width=750>


The statistical interpretation of the 95%-confidence interval is that if we collect the data sample $\mathcal D$ in $100$ different randomized experiments, for 95 times the estimated 95%-confidence interval will cover the true value of $ATE$. Therefore, 

- If the lower bound of the 95%-confidence interval $\widehat{ATE}-1.96\widehat{SE}>0$, with a high chance $ATE>0$ (i.e., the treatment effect is positive);
- If the upper bound of the 95%-confidence interval $\widehat{ATE}+1.96\widehat{SE}<0$, with a high chance $ATE<0$ (i.e., the treatment effect is negative). 

In general, if we want to construct an $(1-\alpha)$-confidence interval ($\alpha<1$ and $\alpha$ is close to 0, say $\alpha=5\%$), the $(1-\alpha)$-confidence level is given by $[\widehat{ATE}-z_{1-\frac{\alpha}{2}}\widehat{SE},\widehat{ATE}+z_{1-\frac{\alpha}{2}}\widehat{SE}]$, where $z_{1-\frac{\alpha}{2}}$ is the $1-\frac{\alpha}{2}$-percentile $z-$value of a standard normal distribution.

## 3.3. Coupon Experiment of Didi

Next, we consider an A/B testing of the ride-sharing platform Didi which sends out coupons to encourages its users to use it. Specifically, we examine the impact of coupons on the activeness and spending of the users. Didi **randomly assigns** 100,000 users into treatment (50%) and control (50%) groups. At 0:00am on the experiment day, the platform sends a coupon to the treatment users and does nothing to the control users. In the language of the potential outcomes model, $W_i=1$ (resp. $W_i=0$) refers to that user $i$ receives (resp. does not receive) the coupon. The potential outcome $Y_i(W_i)$ is whether user $i$ being active on the day of the A/B testing, given. Another potential outcome is the total spending of user $i$ on Didi on the day of the A/B testing.

The key question of interest for the platform is:

- **Does coupon increase the activeness of the users on the platform?**
- **Which of the users are most sensitive to coupon?**

In this section, we will discuss how to use causal inference analysis with data from an A/B testing to address these questions. Let's first load the data into Python.

In [1]:
# Import necessary packages
import sys 
import numpy as np
import pandas as pd
import statsmodels.api as sm
import sklearn
import scipy as sp
%matplotlib inline 
import matplotlib.pyplot as plt

In [2]:
df_Didi  = pd.read_csv("Didi.csv")
df_Didi.head()

Unnamed: 0,is_active,new_spending,Coupon,Male,ActiveDays,Spending
0,0,0.0,1,1,2,47.22
1,0,0.0,1,1,0,0.0
2,0,0.0,1,0,1,15.72
3,1,23.31,0,0,3,79.08
4,0,0.0,0,1,1,16.15


The data set has 5 variables:

- **is_active $\in\{0,1\}$:** Whether the user is active on the day of the experiment, 1=active, and 0=inactive;
- **new_spending $\in\mathbb R^+$:** The amount of spending by user $i$ on the day of the experiment;
- **Coupon $\in\{0,1\}$:** Whether user $i$ is in the treatment group (i.e., $W_i$ in the potential outcome model), 1=treatment, and 0=control;
- **Male $\in\{0,1\}$:** 1 means male 0 means female;
- **ActiveDays $\in\mathbb Z^+$:** The number of actives for user $i$ 1 week before the experiment (i.e., days $t-7,t-6,t-5,...,t-1$);
- **Spending $\in\mathbb R^+$:** The amount of spending by user $i$ 1 week before the experiment (i.e., days $t-7,t-6,t-5,...,t-1$).

We now discuss examining the effectiveness of coupon on user's activeness. We first estimate $\widehat{ATE}$ and $\widehat{SE}$, starting with the $t-$test approach.

In [3]:
# Evaluate the average treatment effect:
ATE = df_Didi[df_Didi['Coupon']==1]['is_active'].mean()-df_Didi[df_Didi['Coupon']==0]['is_active'].mean()

sd1 = df_Didi[df_Didi['Coupon']==1]['is_active'].std()
sd0 = df_Didi[df_Didi['Coupon']==0]['is_active'].std()

n1 = df_Didi[df_Didi['Coupon']==1].shape[0]
n0 = df_Didi[df_Didi['Coupon']==0].shape[0]

ATE_STD=np.sqrt(sd1**2/n1+sd0**2/n0)

Next, we perform the $t-$test, and compute the 95% confidence interval.

In [4]:
t_statistic = ATE/ATE_STD

UCB_95 = ATE+1.96*ATE_STD
LCB_95 = ATE-1.96*ATE_STD

print("The estimated ATE is %0.4f." %ATE)
print("The estimated standard error of ATE is %0.4f." %ATE_STD)
print("The t-statistic is: %0.4f."%t_statistic)
print("The 95% confidence interval for ATE is [",LCB_95,",",UCB_95,"].")

The estimated ATE is 0.0975.
The estimated standard error of ATE is 0.0027.
The t-statistic is: 35.7554.
The 95% confidence interval for ATE is [ 0.09219811220395034 , 0.10289237233567845 ].


The estimate of $ATE$ is 0.0975, with the 95% confidence interval as $[0.0922, 0.1029]$. Therefore, the coupon could significantly increase the in expectation. 

Actually, in Python, we could directly run a $t-$test for the experiment data. See [this documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for a description of the ``ttest_ind`` function.

In [5]:
# import the ttest package

from scipy.stats import ttest_ind


ttest_ind(df_Didi[df_Didi['Coupon']==1]['is_active'], df_Didi[df_Didi['Coupon']==0]['is_active']\
          ,equal_var = False)

Ttest_indResult(statistic=35.755381390360235, pvalue=3.390215581693204e-278)

Here, the $p-$value means if $ATE=0$, the probability that we could observe a more extreme data set. The function ``ttest_ind()`` performs a $t-$test, whose first argument is the first data sample and the second argument is the second data sample. The argument ``equal_var = False`` means the variances of the two samples are not the same. The slight difference in the results between using the function ``ttest_ind()`` and directly computing the $t-$statistic and the confidence interval is that, we use the normal distribution to approximate a Student's $t-$distribution, but there is a small (and negligible) gap between them (see [Wiki Page](https://en.wikipedia.org/wiki/Student%27s_t-test)).  

We also conduct t-test to examine whether sending coupons will impact the total spending of a user. A rule-of-thumb is to reject the null hypothesis (i.e., $ATE=0$) if the $p$-value is <font color="red"> **smaller than 0.05** </font>, which is equivalent to the case of 0 not belonging to the <font color="red"> **0.95-confidence interval** </font>.

In [6]:
ttest_ind(df_Didi[df_Didi['Coupon']==1]['new_spending'], df_Didi[df_Didi['Coupon']==0]['new_spending']\
          ,equal_var = False)

Ttest_indResult(statistic=53.79566140948277, pvalue=0.0)

Hence, the coupon significantly improves the total spending of a user on the platform. The ATE can be estimated as:

In [7]:
ATE_spending = df_Didi[df_Didi['Coupon']==1]['new_spending'].mean()-df_Didi[df_Didi['Coupon']==0]['new_spending'].mean()
print("The estimated ATE for new spending is %0.4f." %ATE_spending)

The estimated ATE for new spending is 3.7734.


Therefore, on average, receiving the coupon boosts the average spending of a user on the platform by RMB 3.77.

### Balance Check

The validity of the above analysis relies on that the assignment of each user to treatment or control (i.e., $W_i$) is well-randomized, so that we can attribute the changes in the outcome variable to the treatment itself. The bad news is we are unable to rigorously verify that $W_i$ is fully random. The good news is we have a compromised solution to convince ourselves that the treatment and control groups are somewhat identically distributed and, therefore, really **comparable**. The procedure to check the comparability of the experiment is called **balance check**.

To perform the balance, we need to identify some important pre-treatment features and perform t-test to verify that their mean values are the same in the statistical sense. By "important", we mean that the <font color="red"> features that may have influence on the outcome of interest </font> during the experiment. In the coupon experiment case, we conduct balance check for the 3 pre-treatment features **Male**, **ActiveDays**, and **Spending**. 

In [8]:
## Balance check for gender.

ttest_ind(df_Didi[df_Didi['Coupon']==1]['Male'], df_Didi[df_Didi['Coupon']==0]['Male']\
          ,equal_var = False)

Ttest_indResult(statistic=1.6171226742564309, pvalue=0.10585495759309685)

Therefore, we cannot reject the null hypothesis. The gender distribution of the treatment group is not significantly different from that of the control group.

In [9]:
## Balance check for active days of one week before the experiment.

ttest_ind(df_Didi[df_Didi['Coupon']==1]['ActiveDays'], df_Didi[df_Didi['Coupon']==0]['ActiveDays']\
          ,equal_var = False)

Ttest_indResult(statistic=0.29863680940148807, pvalue=0.7652177970992515)

Therefore, we cannot reject the null hypothesis. The distribution for **ActiveDays** of the treatment group is not significantly different from that of the control group.

In [10]:
## Balance check for spending of one week before the experiment.

ttest_ind(df_Didi[df_Didi['Coupon']==1]['Spending'], df_Didi[df_Didi['Coupon']==0]['Spending']\
          ,equal_var = False)

Ttest_indResult(statistic=0.3155021395396201, pvalue=0.7523810852696775)

Therefore, we cannot reject the null hypothesis. The distribution for **Spending** of the treatment group is not significantly different from that of the control group.

Based on the balance checks for the pre-treatment features **Male**, **ActiveDays**, and **Spending** demonstrate that all 3 features have identical distributions for the treatment and control groups. Therefore, we can attribute any changes in the outcomes (**is_active** and **new_spending**) to the coupon.

## 3.4. Causal Inference Analysis with AB Testing: OLS Regression

Besides performing the $t-$test, an alternative approach to estimate the average treatment effect in the presence of randomized experiment is using the OLS linear regression:

<font color="red">

$$Y_i\approx \hat \beta_0+\hat\tau W_i$$
    
</font>

where $\hat\beta_0$ is the average active rate of the users in the control group and $\hat\tau$ is the estimated average treatment effect $\widehat{ATE}$. Thus, $\hat\beta_0+\hat\tau$ is the average active rate of the users in the treatment group.

In [11]:
# Fit a linear regression model.

# OLS specification: is_active ~ beta_0 + beta_1 * coupon + epsion

LinearModel = sm.OLS(endog = df_Didi['is_active'], exog = sm.add_constant(df_Didi['Coupon']))

# add_constant means we include the intercepts into the linear regression model.

result = LinearModel.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:              is_active   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.013
Method:                 Least Squares   F-statistic:                     1279.
Date:                Tue, 31 Jan 2023   Prob (F-statistic):          2.49e-278
Time:                        11:09:27   Log-Likelihood:                -57792.
No. Observations:              100000   AIC:                         1.156e+05
Df Residuals:                   99998   BIC:                         1.156e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2030      0.002    105.352      0.0

In [12]:
# We do the same thing for new_spending.

# OLS specification: new_spending ~ beta_0 + beta_1 * coupon + epsion

LinearModel_new_spending = sm.OLS(endog = df_Didi['new_spending'], exog = sm.add_constant(df_Didi['Coupon']))
result_new_spending = LinearModel_new_spending.fit()
print(result_new_spending.summary())

                            OLS Regression Results                            
Dep. Variable:           new_spending   R-squared:                       0.028
Model:                            OLS   Adj. R-squared:                  0.028
Method:                 Least Squares   F-statistic:                     2897.
Date:                Tue, 31 Jan 2023   Prob (F-statistic):               0.00
Time:                        11:09:28   Log-Likelihood:            -3.8245e+05
No. Observations:              100000   AIC:                         7.649e+05
Df Residuals:                   99998   BIC:                         7.649e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.0725      0.050     82.212      0.0

As we can see from the above analysis, for both **is_active** and **new_spending**, linear regression produces the same $ATE$ estimate as the $t-$test.

We can also perform balance checks using OLS regressions.

In [13]:
# Balance check for feature Male.

# OLS specification: Male ~ beta_0 + beta_1 * coupon + epsion

BalanceCheck_Male = sm.OLS(endog = df_Didi['Male'], exog = sm.add_constant(df_Didi['Coupon']))
result_BC_Male = BalanceCheck_Male.fit()
print(result_BC_Male.summary())

                            OLS Regression Results                            
Dep. Variable:                   Male   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     2.615
Date:                Tue, 31 Jan 2023   Prob (F-statistic):              0.106
Time:                        11:09:29   Log-Likelihood:                -72577.
No. Observations:              100000   AIC:                         1.452e+05
Df Residuals:                   99998   BIC:                         1.452e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4953      0.002    221.652      0.0

In [14]:
# Balance check for feature ActiveDays.

# OLS specification: ActiveDays ~ beta_0 + beta_1 * coupon + epsion


BalanceCheck_ActiveDays = sm.OLS(endog = df_Didi['ActiveDays'], exog = sm.add_constant(df_Didi['Coupon']))
result_BC_ActiveDays = BalanceCheck_ActiveDays.fit()
print(result_BC_ActiveDays.summary())

                            OLS Regression Results                            
Dep. Variable:             ActiveDays   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                   0.08919
Date:                Tue, 31 Jan 2023   Prob (F-statistic):              0.765
Time:                        11:09:29   Log-Likelihood:            -1.4744e+05
No. Observations:              100000   AIC:                         2.949e+05
Df Residuals:                   99998   BIC:                         2.949e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.3936      0.005    295.025      0.0

In [15]:
# Balance check for feature Spending.

# OLS specification: Spending ~ beta_0 + beta_1 * coupon + epsion

BalanceCheck_Spending = sm.OLS(endog = df_Didi['Spending'], exog = sm.add_constant(df_Didi['Coupon']))
result_BC_Spending = BalanceCheck_Spending.fit()
print(result_BC_Spending.summary())

                            OLS Regression Results                            
Dep. Variable:               Spending   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                   0.09954
Date:                Tue, 31 Jan 2023   Prob (F-statistic):              0.752
Time:                        11:09:29   Log-Likelihood:            -4.5815e+05
No. Observations:              100000   AIC:                         9.163e+05
Df Residuals:                   99998   BIC:                         9.163e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         27.9550      0.106    264.707      0.0

It is evident from our OLS-based balance checks that the results are the same to the t-test based balance checks.

Another approach to conduct **balance check** is to **regress the treatment on relevant features**. Finally, we reun the model specification as follows:

$$W_i\approx \hat\alpha_0+\sum_{j=1}^p\hat\alpha_j X_{ij}\mbox{ for all }i=1,2,...,n$$

The treatment assignment is well-balanced if the $\hat\alpha_j$'s are **insignificant**, which is indicated by **not rejecting the null hypothesis in the [F-test](https://en.wikipedia.org/wiki/F-test)**, i.e., $H_0:\alpha_1=\alpha_2=\cdots=\alpha_p=0$.

In [16]:
# Balance check based on F-statistic.

# OLS specification: Coupon ~ alpha_0 + alpha_1 * Male + alpha_2 * ActiveDays + alpha_3 * Spending + epsion

BalanceCheck_all = sm.OLS(endog = df_Didi['Coupon'], exog = sm.add_constant(df_Didi[['Male','ActiveDays','Spending']]))
result_BC_all = BalanceCheck_all.fit()
print(result_BC_all.summary())

                            OLS Regression Results                            
Dep. Variable:                 Coupon   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.9063
Date:                Tue, 31 Jan 2023   Prob (F-statistic):              0.437
Time:                        11:09:31   Log-Likelihood:                -72578.
No. Observations:              100000   AIC:                         1.452e+05
Df Residuals:                   99996   BIC:                         1.452e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4961      0.003    162.268      0.0

As shown by the regression table above, the $p$-value of the F-test is 0.437>0.05, implying that we should not reject the null hypothesis and the treatment and control groups are well-balanced.

To further reduce the variance of the estimates, we include the features into the OLS regression. Formally, let $X_{i}=(X_{i1},X_{i2},...,X_{ip})$ be the features associated with subject $i$. The OLS model specification for causal inference is given by:
$$Y_i\approx \hat\beta_0+\hat\tau W_i + \sum_{j=1}^p\hat\beta_jX_{ij}\mbox{ for all }i=1,2,...,n$$
Here, we are most interested in the estimation of the average treat effect $\hat \tau$.


We take the **new_spending** as the outcome variable. You may check the case with **is_active** as another outcome variable.

In [17]:
# OLS model with features.

# OLS specification: Spending ~ beta_0 + beta_1 * Coupon + beta_2 * Male + beta_3 * ActiveDays + beta_4 * Spending + epsion


features = ['Coupon','Male','ActiveDays','Spending']

OLS_f_new_spending = sm.OLS(endog = df_Didi['new_spending'], exog = sm.add_constant(df_Didi[features]))
result_f_new_spending = OLS_f_new_spending.fit()
print(result_f_new_spending.summary())

                            OLS Regression Results                            
Dep. Variable:           new_spending   R-squared:                       0.044
Model:                            OLS   Adj. R-squared:                  0.043
Method:                 Least Squares   F-statistic:                     1138.
Date:                Tue, 31 Jan 2023   Prob (F-statistic):               0.00
Time:                        11:09:33   Log-Likelihood:            -3.8165e+05
No. Observations:              100000   AIC:                         7.633e+05
Df Residuals:                   99995   BIC:                         7.634e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.0921      0.076     54.146      0.0

Therefore, after controling for the features, receiving a coupon will have a similar impact on the total spending per user (**RMB 3.77**).

### Heterogeneous Treatment Effect (HTE)

Next, we examine the heterogeneous treatment effect (HTE) of coupon, i.e., how does the treatment effect differ for different users. To this end, we introduce the interaction terms **Spending * Coupon**, the coefficient of which specifies the heterogeneity for the treatment effect of receiving coupon on the total spending on the platform.

Formally, if we are interested in the HTE of $W_i$ with respect to feature $X_k$, the model specification is:

<font color="red">

$$Y_i\approx\hat\beta_0+\hat\tau_1 W_i + \hat\tau_2W_i X_{ik}+\sum_{j=1}^p\hat\beta_jX_{ij}$$

</font>

Therefore, we can interpret the estimates $\hat\tau_1$ and $\hat\tau_2$ as:

------

<font color="red">

- $\hat\tau_1=$The average causal effect of the treatment if $X_k=0$;
- $\hat\tau_2=$The average additional causal effect of the treatment if $X_k$ increases by one unit;
- $\hat\tau_1+\hat\tau_2 X_k=$The average causal effect of the treatment given feature $X_k$.

</font>
    
-------

In [18]:
# OLS model with features to estimate the HTE of coupon.

# OLS specification: 
# Spending ~ beta_0 + beta_1 * Coupon + beta_2 * Coupon * Spending + beta_3 * Spending + epsion

df_Didi['Coupon*Spending'] = df_Didi['Coupon'] * df_Didi['Spending'] 

features = ['Coupon','Coupon*Spending','Spending']

OLS_hte = sm.OLS(endog = df_Didi['new_spending'], exog = sm.add_constant(df_Didi[features]))
result_hte = OLS_hte.fit()
print(result_hte.summary())

                            OLS Regression Results                            
Dep. Variable:           new_spending   R-squared:                       0.031
Model:                            OLS   Adj. R-squared:                  0.031
Method:                 Least Squares   F-statistic:                     1068.
Date:                Tue, 31 Jan 2023   Prob (F-statistic):               0.00
Time:                        11:09:35   Log-Likelihood:            -3.8230e+05
No. Observations:              100000   AIC:                         7.646e+05
Df Residuals:                   99996   BIC:                         7.646e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const               3.5316      0.077     

An immediate observation from the regression result is that the coefficient for **Coupon** and that for **Coupon * Spending** are both significantly positive, implying that:

-------------
<font color="red">

- Receiving a coupon causes a user to spending more on the platform.
- The more a user spent last week, the more receiving coupon will boost the user's spending on the platform on the day of reception (at a magnitude of an RMB 0.0112 boost for each Yuan spent on the platform).

</font>

---------------

Next, we examine there exists HTE of coupon on the the outcome variable **is_active**.

In [19]:
# OLS model with features to estimate the HTE of coupon on is_active.

# OLS specification: 
# is_active ~ beta_0 + beta_1 * Coupon + beta_2 * Coupon * Spending + beta_3 * Spending + epsion

features = ['Coupon','Coupon*Spending','Spending']

OLS_hte_a = sm.OLS(endog = df_Didi['is_active'], exog = sm.add_constant(df_Didi[features]))
result_hte_a = OLS_hte_a.fit()
print(result_hte_a.summary())

                            OLS Regression Results                            
Dep. Variable:              is_active   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.013
Method:                 Least Squares   F-statistic:                     427.4
Date:                Tue, 31 Jan 2023   Prob (F-statistic):          6.44e-276
Time:                        11:09:38   Log-Likelihood:                -57790.
No. Observations:              100000   AIC:                         1.156e+05
Df Residuals:                   99996   BIC:                         1.156e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const               0.2023      0.003     

The regression result above implies that the coefficient for **Coupon * Spending** is insignificant, suggesting that the effect of receiving a coupon will not be more intensive if the user spent more on the platform 1 week prior to the experiment.

Read [this chapter](https://matheusfacure.github.io/python-causality-handbook/02-Randomised-Experiments.html) for more about randomized experiments.

<a id='section_4'></a>
# 4. Experiment Design

## 4.1. Internal Validity and External Validity

In this section, we discuss some fundamental issues for experiment design. The first is the **<font color="red">internal validity</font>** and **<font color="red">external validity</font>**. We use the following figure to illustrate the internal and external validities:

<img src="validity.png" width=750>



**<font color="red">Internal validity</font>** refers to that subjects are randomly assigned to treatment and control groups and, as a consequence, there is no difference between the treated subjects and the controlled subjects, other than the treatment itself, that could affect the outcome. Under internal validity, there will be **no confounding factors** in the estimated factor.


In the case above about the A/B testing on Didi platform's coupon (in [Section 3](#section_3)), the internal validity is satisfied. However, for the case of hospitalization and the toy example of Didi coupon (in [Section 1](#section_1)), the internal validity is violated.

Typical threats to internal validity include:

- **<font color="red">Failure of randomization:</font>** For example, the implementing partner of the experiment (say, the hospital who conducts a clinical trial) assigns its favorites to the treatment group; the imbalance of treatment and control groups due to small sample size, etc.

- **<font color="red">Non-compliance with the experimental protocol:</font>** For example, some members of the control group receive the treatment and some members of the treatment group go untreated.

- **<font color="red">Differential attrition:</font>** For example, in a clinical trial, control group subjects are more likely to drop out of a study than treatment group subjects.

**<font color="red">External validity</font>** refers to that the sample under evaluation accurately represents the entire population. Under external validity, the experiment is conducted on a random sampling of the whole population. External validity would fail if the treatment would have a different effect outside the experimental environment.  

Typical threats to external validity include:

- **<font color="red">Non-representative sample:</font>** For example, in some lab experiments to examine a management problem, the undergraduate students are used. Sometimes, although the subjects are randomly sampled, but they are not from the population of interest. Like in a clinical trial, the new drug is designed to treat end-stage lung cancer, but it recruits early-stage lung cancer.

- **<font color="red">Non-representative treatment:</font>** For example, the treatment in the experiment is different from the actual implementations. Like in a clinical trial, the patients in the treatment group are very carefully examined and taken care of, which may not be the case if the drug is actually used.

In general, people believe **<font color="red">internal validity is more important than external validity</font>**. If you do not know the effects of the treatment on the units in your study, you are not well-positioned to infer the effects on units you did not study who live in circumstances you did not study. If randomization well executed, the internal validity of a randomized experiment is ensured. External validity can be partially addressed by comparing the results of several internally valid studies conducted for different samples in different circumstances and at different times. However, fully addressing external validity is impossible as long as your sample is NOT the full population.

## 4.2. Stable Unit Treatment Value Assumption (SUTVA)

Another important aspect of experiment design is on the interference/independence of different subjects in the experiment, called the **<font color="red">stable unit treatment value assumption (SUTVA)</font>**:

- **<font color="red">SUTVA:</font>** The potential outcomes for one subject is independent with the assignment of treatment/control for another subject.

The key to SUTVA is that there should be no interference between different subjects. Actually, designing internally valid randomized experiments when there are interferences between subjects is actually a very important and challenging problem both academically and practically. 

Let us see an example where SUTVA is violated in a randomized experiment. On the AirBnb platform, we provide a new feature that dramatically streamlines the booking process to a **randomized** group of users. We find that users in the treatment group displayed with the new feature book much more per user than users in the control group without this new feature. This is actually an **over-estimate** of the true $ATE$. To understand why, we observe that:

- There are a limited number of listings on AirBnb. 
- If users in the treatment group book more, fewer listings will be available for those in the control group.
- Users in the control group will book less. Thus, SUTVA is **violated**: Assigning users in the treatment group would negatively affect the potential outcomes of the users in the control group.
- Therefore, the difference between the per-user-booking of the treatment group and that of the control group would be an overestimate of the true average treatment effect. 

**Question:** If there are more listings on AirBnb, will SUTVA more likely to be satisfied?

## 4.3. Minimum Sample Size of A/B Tests

When running A/B tests, we need to determine the right sample size. 

--------------

<font color="red">

- If the sample size is too small, it is not possible to distinguish the true treatment effect from random noise. 
- If the sample size is too large, it may be too costly to affect that many people.

</font>    

--------------

Therefore, the minimum sample size of an A/B test should be the one such that the Type-I error rate (i.e., significance level) is $\alpha$ and the Type-II error rate (i.e., 1-statistical power) is $\beta$. Usually, we set $\alpha=0.05$ and $\beta\le0.2$. This is illustrated in the following figure, where the red curve is the distribution of $\widehat{ATE}$ under the null hypothesis that the treatment has **no effect**, and the blue curve is the distribution of $\widehat{ATE}$ under the alternative hypothesis that the treatment has a **positive effective**. The [power](https://en.wikipedia.org/wiki/Power_of_a_test) of a statistical test is defined as the probability of rejecting a null hypothesis under the ground-truth that the ATE is non-zero.

<img src="minimum-sample-size.png" width=750>

We denote $n_{min}$ as the minimum sample size of the entire experiment, where $n_{min}/2$ of the subjects are in the treatment group and the other $n_{min}/2$ are in the control group. Assume that the experiment will not affect the standard deviation of the potential outcome, which we denote as $\sigma$. In practice, any experiment has a minimum treatment effect that the researcher/experimenter can detect (sometimes also called minimum detectable effect, MDE), which we denote as $\tau_{min}$. Recall that $z_{1-\frac{\alpha}{2}}$ is the $(1-\frac{\alpha}{2})$-percentile $z$-value of a standard normal distribution. Similarly, we define $z_{1-\beta}$ as the $(1-\beta)$-percentile $z$-value of a standard normal distribution. We use $\hat\sigma_{ATE}$ to denote the standard deviation of the estimator $\widehat{ATE}$. Since $n_{min}/2$ of the subjects are in the treatment group and the other $n_{min}/2$ are in the control group and the standard deviation of the potential outcomes is $\sigma$, we have the standard deviation of $\widehat{ATE}$ is given by:

$$\hat\sigma_{ATE}=\sqrt{\frac{\sigma^2}{n_{min}/2}+\frac{\sigma^2}{n_{min}/2}}=\frac{2\sigma}{\sqrt{n_{min}}}.$$

Under the null hypothesis (i.e., $H_0$) that $ATE=0$, $\widehat{ATE}$ will follow a normal distribution as the red curve above, with mean $0$ and standard deviation $\hat\sigma_{ATE}=\frac{2\sigma}{\sqrt{n_{min}}}$. Since the significance of the statistical test is $\alpha$, we will not reject $H_0$ if $\widehat{ATE}\le \hat\sigma_{ATE}z_{1-\frac{\alpha}{2}}$.

If the alternative hypothesis that $ATE=\tau_{min}$ is true, $\widehat{ATE}$ will follow a normal distribution as the blue curve above, with mean $\tau_{min}$ and standard deviation $\hat\sigma_{ATE}=\frac{2\sigma}{\sqrt{n_{min}}}$. Since the power of the statistical test is $1-\beta$, we need that, as long as $H_0$ is rejected, i.e., $\widehat{ATE}\le \hat\sigma_{ATE}z_{1-\frac{\alpha}{2}}$, it must hold that $\widehat{ATE}\le \tau_{\min}-\hat\sigma_{ATE}z_{1-\beta}$ as well. Therefore, we must have 

$$\hat\sigma_{ATE}z_{1-\frac{\alpha}{2}}\le \tau_{\min}-\hat\sigma_{ATE}z_{1-\beta}$$

that is

$$\hat\sigma_{ATE}\le\frac{\tau_{min}}{z_{1-\frac{\alpha}{2}}+z_{1-\beta}} $$

which is

$$\frac{2\sigma}{\sqrt{n_{min}}}\le\frac{\tau_{min}}{z_{1-\frac{\alpha}{2}}+z_{1-\beta}}$$

Re-arranging the terms, we have:

$$n_{min}\ge\frac{4\sigma^2(z_{1-\frac{\alpha}{2}}+z_{1-\beta})^2}{(\tau_{min})^2}$$

i.e., the minimum sample size of an A/B testing with outcome standard error $\sigma$, minimum detectable effect $\tau_{min}$, $\alpha$-significance, and power $1-\beta$ is given by

$$n_{min}=\frac{4\sigma^2(z_{1-\frac{\alpha}{2}}+z_{1-\beta})^2}{(\tau_{min})^2}$$



From the formula of the minimum sample size, we immediately obtain the following insights:

----------------

<font color = "red">

- If the outcome standard deviation $\sigma$ increases, the minimum sample size will increase.
- If the statistical significance $\alpha$ decreases, the minimum sample size will increase.
- If the statistical power $1-\beta$ increases, the minimum sample size will increase.
- If the MDE $\tau_{min}$ decreases, the minimum sample size will increase.

</font>

----------------

Example: If $\alpha=0.05$ and $\beta=0.025$ (then $z_{1-\frac{\alpha}{2}}=z_{1-\beta}=1.96$), $\tau_{min}=0.1$, and $\sigma=1$, then
$$n_{min}=\frac{4\sigma^2(z_{1-\frac{\alpha}{2}}+z_{1-\beta})^2}{(\tau_{min})^2}=\frac{4\cdot 1\cdot(1.9611+1.9611)^2}{0.01}=6154$$

For more advanced theory and practice of experiment design, please refer to this [lecture notes](https://artowen.su.domains/courses/363/doenotes.pdf).