## Objective

Explain what CUPED does. As understandably as possible.

## Dan Frank's explanation on air/cuped

> The basic idea here is simple: we know quite a lot about our users (ie. how likely they are to book) before they are assigned to an experiment. We can therefore use this information to "predict" how many bookings a user will make at the moment they are assigned and then do our experimental statistics on the "difference" between this prediction and the actual number of bookings we observe. The resultant estimators will be unbiased and have lower variance.

## My explanation (longer and involves some stats jargon)

### Large sample variance increases experiment runtime

\# days of experiment runtime is in part a function of the variance of the control sample group and the variance of the treatment sample group. 

$$
\text{# days of experiment runtime} = f\Big(\frac
    {1}
    {Var(\bar{Y}^{\text{(treatment)}}), Var(\bar{Y}^{\text{(control)}})
}\Big)
$$

where $Y$ is the metric in question, like NALs.

If either of these two sample variances increases, the variance of their difference, which we are mainly concerned about, increases as well (Figure 1), increasing experiment runtime.

![](cuped2.png)

![](cuped1.png)

> Figure 1. When the variance of one of either the control or treatment group is greater (a -> d), the variance of their difference increases too (c -> f). This increases experiment runtime, as shown by the fact that the upper experiment is stat sig, while the lower one is not.

### CUPED reduces sample variance

One way to reduce variance in one of the sample groups (i.e. (a) or (b) or (d) or (e)) is by using this idea of a "control variate". 

Let's look at just either $\bar{Y}^{\text{(treatment)}}$ or $\bar{Y}^{\text{(control)}}$, which I will call $\bar{Y}$ for short. By using another dimension or metric, $X$ (let's use # active listings per user before the experiment started) that is correlated with $Y$, we can look at a new statistic $\hat{Y}_{cv}$ instead of $\bar{Y}$ (i.e. the sample mean) that will also estimate $μ_Y$, but have lower variance than $\bar{Y}$. Let's define our estimator $\hat{Y}_{cv}$ as:

$$
\hat{Y}_{cv} = \bar{Y} - c\bar{X} + cμ_X
$$

In English, our "control variate" estimate of NALs, $\hat{Y}_{cv}$, is the sample mean $\bar{Y}$ minus the sample mean $\bar{X}$ times some constant $c$, plus that constant $c$ times the population mean of # active listings per user before the experiment started, $μ_X$.

It is the case that:
- $\hat{Y}_{cv}$ is an unbiased estimator of $μ_Y$ (like $\bar{Y}$ is), since $\mathbb{E}(- c\bar{X} + cμ_X) = 0$.
- $Var(\hat{Y}_{cv}) = Var(\bar{Y})(1 - r^2)$ after we choose the value of $c$ that minimizes $Var(\hat{Y}_{cv})$.
    - $r^2$ is the statistic from regressing $Y$ on $X$ (Figure 2)

![](cuped4.png)

> Figure 2. We get $r^2$ from regressing $Y$ on $X$, where $r$ is the correlation between $X$ and $Y$.

## Conclusion

Huzzah! We have an estimator $\hat{Y}_{cv}$ that is:
1. an unbiased estimator of $μ_Y$
2. has lower variance than $\bar{Y}$ (especially when $X$ is highly correlated with $Y$!)

It looks like this:

![](cuped3.png)

We have thus reduced sample variance, and thus reduced experiment runtime.

## Further reading

- [Knowledge Post](https://knowledge.d.musta.ch/post/projects/experimentation/control_variates_intro.kp)
- [Microsoft paper](http://www.exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf) (see Section 3.2.1)