# CUPED and CUPAC Tutorial

This tutorial demonstrates variance reduction techniques for A/B testing using covariate adjustment methods available in HypEx.

**CUPED** (Controlled Experiments Using Pre-Experiment Data) uses historical features to reduce variance in your target metrics through linear regression adjustment.

**CUPAC** (Covariate-Updated Pre-Analysis Correction) extends CUPED by allowing multiple covariates and different regression models (linear, ridge, lasso, catboost).

Both methods help you:
- Detect smaller effects with the same sample size
- Reduce sample size needed to detect the same effect
- Increase statistical power of your experiments

## Table of Contents
<ul>
  <li><a href="#data-preparation">Data Preparation</a></li>
  <li><a href="#baseline-ab-test">Baseline AB Test</a></li>
  <li><a href="#cuped-implementation">CUPED Implementation</a></li>
  <li><a href="#cupac-implementation">CUPAC Implementation</a></li>
  <li><a href="#best-practices">Best Practices</a></li>
</ul>

## Data Preparation

For CUPED and CUPAC to work, we need historical data that is correlated with our target metric but unaffected by treatment. Let's generate synthetic data with historical features:

In [12]:
from hypex import ABTest
from hypex.dataset import Dataset, InfoRole, TargetRole, TreatmentRole
from hypex.utils.tutorial_data_creation import DataGenerator

In [13]:
# Generate synthetic data with historical features
gen = DataGenerator(
    n_samples=2000,
    distributions={
        "X1": {"type": "normal", "mean": 0, "std": 1},
        "X2": {"type": "bernoulli", "p": 0.5},
        "y0": {"type": "normal", "mean": 5, "std": 1},
    },
    time_correlations={"X1": 0.2, "X2": 0.1, "y0": 0.6},
    effect_size=2.0,
    seed=42
)

df = gen.generate()
# Drop unnecessary columns
df = df.drop(columns=['y0', 'z', 'U', 'D', 'y1', 'y0_lag_2'])

print("Generated columns:", df.columns.tolist())
print("Dataset shape:", df.shape)
df.head()

Generated columns: ['d', 'X1', 'X1_lag', 'X2', 'X2_lag', 'y0_lag_1', 'y']
Dataset shape: (2000, 7)


Unnamed: 0,d,X1,X1_lag,X2,X2_lag,y0_lag_1,y
0,0,-0.656927,-0.674978,0,0,3.895067,4.752407
1,0,-1.00463,-0.880887,1,0,3.068464,4.246357
2,1,-1.097898,-0.030517,1,1,4.229747,9.546954
3,1,-0.22364,0.350105,1,0,6.413378,5.073932
4,0,2.107403,2.170936,1,0,5.477219,3.436856


In [14]:
# Create HypEx dataset
data = Dataset(
    roles={
        "d": TreatmentRole(),
        "y": TargetRole(),
    },
    data=df,
    default_role=InfoRole()
)

print("Dataset roles:")
data.roles

Dataset roles:


{'d': Treatment(<class 'int'>),
 'y': Target(<class 'float'>),
 'X1': Info(<class 'float'>),
 'X1_lag': Info(<class 'float'>),
 'X2': Info(<class 'int'>),
 'X2_lag': Info(<class 'int'>),
 'y0_lag_1': Info(<class 'float'>)}

## Baseline AB Test

First, let's run a standard AB test without any variance reduction to establish our baseline:

In [15]:
# Standard AB test without covariate adjustment
test_baseline = ABTest()
result_baseline = test_baseline.execute(data)

result_baseline.resume

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value
0,y,1,4.737214,7.961045,3.223832,68.05333,OK,1.0120319999999999e-184


In [16]:
result_baseline.sizes

Unnamed: 0,control size,test size,control size %,test size %,group
1,1328,672,66.4,33.6,1


## CUPED Implementation

CUPED uses a single historical feature to adjust the target variable. In HypEx, simply specify the `cuped_features` parameter:

In [17]:
# CUPED with single covariate
test_cuped = ABTest(cuped_features={'y': 'y0_lag_1'})
result_cuped = test_cuped.execute(data)

result_cuped.resume

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value
0,y,1,4.737214,7.961045,3.223832,68.05333,OK,1.0120319999999999e-184
1,y_cuped,1,4.851992,7.734221,2.882229,59.403002,OK,1.740353e-213


In [18]:
# Check variance reduction achieved by CUPED
result_cuped.variance_reduction_report

Unnamed: 0,Transformed Metric Name,Variance Reduction (%)
0,y_cuped,28.79786


## CUPAC Implementation

CUPAC allows multiple covariates and different regression models. Let's try different CUPAC configurations:

In [19]:
# CUPAC with multiple covariates using linear regression
test_cupac_linear = ABTest(
    cupac_features={'y': ['y0_lag_1', 'X1_lag', 'X2_lag']},
    cupac_model='linear'
)
result_cupac_linear = test_cupac_linear.execute(data)

result_cupac_linear.resume

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value
0,y,1,4.737214,7.961045,3.223832,68.05333,OK,1.0120319999999999e-184
1,y_cupac,1,4.977168,7.486851,2.509683,50.423924,OK,8.271012e-182


In [20]:
# CUPAC with ridge regression
test_cupac_ridge = ABTest(
    cupac_features={'y': ['y0_lag_1', 'X1_lag', 'X2_lag']},
    cupac_model='ridge'
)
result_cupac_ridge = test_cupac_ridge.execute(data)

result_cupac_ridge.resume

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value
0,y,1,4.737214,7.961045,3.223832,68.05333,OK,1.0120319999999999e-184
1,y_cupac,1,4.977124,7.486938,2.509814,50.426999,OK,7.840533e-182


In [21]:
# CUPAC with automatic model selection
test_cupac_auto = ABTest(
    cupac_features={'y': ['y0_lag_1', 'X1_lag', 'X2_lag']},
    cupac_model=['linear', 'ridge', 'lasso']  # Will select best performing
)
result_cupac_auto = test_cupac_auto.execute(data)

result_cupac_auto.resume

Unnamed: 0,feature,group,control mean,test mean,difference,difference %,TTest pass,TTest p-value
0,y,1,4.737214,7.961045,3.223832,68.05333,OK,1.0120319999999999e-184
1,y_cupac,1,4.977124,7.486938,2.509814,50.426999,OK,7.840533e-182


In [22]:
# Check variance reduction for CUPAC methods
result_cupac_ridge.variance_reduction_report

Unnamed: 0,Transformed Metric Name,Variance Reduction (%)
0,y_cupac,38.607063


## Best Practices

### When to use CUPED:
- You have **one strong covariate** (correlation > 0.5 with target)
- Need **simple, interpretable** results
- **Small to medium** sample sizes
- **Linear relationship** between covariate and outcome

### When to use CUPAC:
- You have **multiple useful covariates**
- Want **maximum variance reduction**
- **Large sample sizes** available
- May have **non-linear relationships**
- Need **automatic feature selection**

### Important considerations:
1. **Pre-treatment data only**: Covariates must be measured before treatment assignment
2. **Balance check**: Ensure covariates are balanced across treatment groups
3. **Missing data**: Both methods require complete covariate data
4. **Model validation**: Always compare against baseline and simpler methods