# CUPED

We call 'Controlled-experiment Using Pre-Experiment Data' (CUPED) a scenario where a treatment is randomly assigned to participants, and we have pre-experiment data of participants like pre-treatment outcome.

Treatment - new product category for users.

We will test hypothesis:

$H_o$ - There is no difference in LTV between treatment and control groups.

$H_a$ - There is a difference in LTV between treatment and control groups.

## Data

We will use DGP from Causalis. Read more at https://causalis.causalcraft.com/articles/make_cuped_tweedie_26

In [1]:
from causalis.scenarios.cuped.dgp import make_cuped_tweedie_26
from causalis.data_contracts import CausalData

data = make_cuped_tweedie_26(return_causal_data=False, include_oracle=True)
data.head()


Unnamed: 0,y,d,tenure_months,avg_sessions_week,spend_last_month,discount_rate,platform_ios,platform_web,m,m_obs,tau_link,g0,g1,cate,y_pre
0,0.0,0.0,14.187461,2.0,57.3553,0.158164,1.0,0.0,0.5,0.5,0.054645,3.694528,3.902033,0.207505,0.0
1,0.0,1.0,6.352893,3.0,46.700946,0.085722,0.0,0.0,0.5,0.5,0.016201,3.694528,3.75487,0.060342,0.0
2,12.91891,0.0,18.910153,9.0,80.136187,0.175115,1.0,0.0,0.5,0.5,0.188082,3.694528,4.459044,0.764516,219.374863
3,13.183865,1.0,7.927627,4.0,33.718224,0.152718,1.0,0.0,0.5,0.5,0.034502,3.694528,3.824221,0.129693,0.0
4,0.0,0.0,11.106925,2.0,92.064518,0.07739,0.0,0.0,0.5,0.5,0.029492,3.694528,3.805111,0.110583,0.0


In [2]:
print(f"Ground truth ATE is {data['cate'].mean()}")
print(f"Ground truth ATTE is {data[data['d'] == 1]['cate'].mean()}")

Ground truth ATE is 0.2700428728660554
Ground truth ATTE is 0.27090216369126935


In [3]:
causaldata = CausalData(df = data,
                        treatment='d',
                        outcome='y',
                        confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'discount_rate', 'platform_ios', 'platform_web', 'y_pre'])
causaldata

CausalData(df=(100000, 9), treatment='d', outcome='y', confounders=['tenure_months', 'avg_sessions_week', 'spend_last_month', 'discount_rate', 'platform_ios', 'platform_web', 'y_pre'])

In [4]:
from causalis.shared import outcome_stats
outcome_stats(causaldata)

Unnamed: 0,treatment,count,mean,std,min,p10,p25,median,p75,p90,max
0,0.0,49868,8.49683,20.907555,0.0,0.0,0.0,0.0,8.903005,23.685134,637.127367
1,1.0,50132,9.057308,21.789248,0.0,0.0,0.0,0.0,9.535612,25.236609,764.333725


In [5]:
from causalis.shared import outcome_plots
outcome_plots(causaldata)

(<Figure size 1540x880 with 1 Axes>, <Figure size 1540x880 with 1 Axes>)

# Monitoring

Some system is randomly splitting users. Half must have new onboarding, other half has not. We should monitor the split with SRM test. Read more at https://causalis.causalcraft.com/articles/srm

In [6]:
from causalis.shared import check_srm

check_srm(assignments=causaldata, target_allocation={0: 0.5, 1: 0.5}, alpha=0.001)

SRMResult(status=no SRM, p_value=0.40381, chi2=0.6970)

In [7]:
from causalis.shared import confounders_balance

confounders_balance(causaldata)

Unnamed: 0,confounders,mean_d_0,mean_d_1,abs_diff,smd,ks_pvalue
0,y_pre,15420.530166,51673.005274,36252.475108,0.008842,0.22562
1,tenure_months,13.756301,13.7944,0.038099,0.005198,0.56587
2,platform_ios,0.298909,0.300706,0.001797,0.003922,1.0
3,platform_web,0.051215,0.050387,0.000828,-0.003772,1.0
4,avg_sessions_week,4.995969,5.00361,0.007641,0.001816,0.88605
5,spend_last_month,75.17656,75.263792,0.087232,0.001222,0.22678
6,discount_rate,0.100197,0.100129,6.8e-05,-0.001031,0.64488


# Inference

### Math of CUPEDModel

The `CUPEDModel` implements the Lin (2013) "interacted adjustment" for ATE (Average Treatment Effect) estimation in randomized controlled trials (RCTs). This method is a robust version of ANCOVA that remains valid even when the treatment effect is heterogeneous with respect to the covariates.


#### 1. Specification
The model fits an Ordinary Least Squares (OLS) regression of the outcome $Y$ on the treatment indicator $D$ and centered pre-treatment covariates $X^c$. The specification includes full interactions between the treatment and the centered covariates:

$$Y_i = \alpha + \tau D_i + \beta^T X_i^c + \gamma^T (D_i \cdot X_i^c) + \epsilon_i$$

Where:
- $Y_i$: Outcome for individual $i$.
- $D_i$: Binary treatment indicator ($D_i \in \{0, 1\}$).
- $X_i$: Vector of pre-treatment covariates.
- $X_i^c = X_i - \bar{X}$: Centered covariates (where $\bar{X}$ is the sample mean).
- $\alpha$: Intercept (represents the mean outcome of the control group when $X = \bar{X}$).
- $\tau$: **Average Treatment Effect (ATE)** or Intent-to-Treat (ITT) effect.
- $\beta$: Vector of coefficients for the main effects of the covariates.
- $\gamma$: Vector of coefficients for the interaction terms between treatment and covariates.
- $\epsilon_i$: Residual error term.


#### 2. Why Centering and Interaction?
- **Centering ($X^c$):** By centering the covariates, the coefficient $\tau$ directly represents the ATE at the average value of the covariates.
- **Interactions ($D \cdot X^c$):** Including interactions (as proposed by Lin, 2013) ensures that $\tau$ is a consistent estimator of the population ATE even if the true treatment effect varies with $X$ (heterogeneity). In traditional ANCOVA without interactions, the estimator for $\tau$ can be biased or less efficient under heterogeneity.


#### 3. Inference and Variance Reduction
The model uses robust covariance estimators (defaulting to `HC3`) to calculate standard errors, which accounts for potential heteroscedasticity:
- **Standard Error ($SE$):** Derived from the robust covariance matrix of the OLS fit.
- **Variance Reduction:** The main goal of CUPED is to reduce the variance of the ATE estimate by "soaking up" explainable variation in $Y$ using pre-treatment data $X$. The variance reduction percentage is calculated by comparing the variance of the adjusted model to a "naive" model ($Y \sim 1 + D$):
  $$\text{Variance Reduction \%} = 1 - \frac{Var(\hat{\tau}_{adjusted})}{Var(\hat{\tau}_{naive})}$$
  Note: For this specific metric, non-robust variances are typically used to directly reflect the reduction in residual sum of squares.


#### 4. Relative Effect
The relative treatment effect is calculated as:
$$\tau_{rel} = \frac{\tau}{\mu_c} \times 100\%$$
Where $\mu_c$ is the mean outcome of the control group ($D=0$).

In [8]:
from causalis.scenarios.cuped.model import CUPEDModel

model = CUPEDModel().fit(causaldata)

In [9]:
result = model.estimate()
result.summary()

Unnamed: 0_level_0,value
field,Unnamed: 1_level_1
estimand,ATE
model,CUPEDModel
value,"0.4631 (ci_abs: 0.1950, 0.7311)"
value_relative,"5.4497 (ci_rel: 2.2953, 8.6042)"
alpha,0.0500
p_value,0.0007
is_significant,True
n_treated,50132
n_control,49868
treatment_mean,9.0573
