# A/A Testing

Before trusting a measurement model with real treatment effects, we need to confirm it behaves correctly when there is **no effect**. An A/A test assigns treatment labels randomly with no actual intervention — the true treatment effect is 0 by construction. Any model applied to this data should recover an estimate close to 0.

This notebook answers two questions:

1. **Model swappability** — Given the same A/A data, do different cross-sectional models all produce estimates near 0?
2. **Sampling variability** — Is a single estimate reliable, or do we need multiple replications to separate bias from noise?

We use a single-day simulation with random treatment labels assigned to 50% of products. The `quality_boost` is set to 0, so there is no real intervention.

## Setup

In [None]:
import copy
import os
from pathlib import Path

import numpy as np
import pandas as pd
import yaml
from impact_engine_measure import evaluate_impact, load_results, parse_config_file
from online_retail_simulator import simulate

In [None]:
# Configurable via environment variables for CI (reduced values speed up execution)
NUM_PRODUCTS = int(os.environ.get("IE_DEMO_NUM_PRODUCTS", 20000))
N_REPS = int(os.environ.get("IE_DEMO_N_REPS", 10))

output_path = Path("output/demo_aa_testing")
output_path.mkdir(parents=True, exist_ok=True)

## Shared Data

All models will use the same product catalog.

In [None]:
with open("configs/demo_model_selection_catalog.yaml") as f:
    catalog_config = yaml.safe_load(f)
catalog_config["RULE"]["PRODUCTS"]["PARAMS"]["num_products"] = NUM_PRODUCTS

tmp_catalog = output_path / "catalog_config.yaml"
with open(tmp_catalog, "w") as f:
    yaml.dump(catalog_config, f, default_flow_style=False)

catalog_job = simulate(str(tmp_catalog), job_id="catalog")
products = catalog_job.load_df("products")

print(f"Generated {len(products)} products")
products.head()

## Configuration

Treatment assignment is controlled by the config's `DATA.ENRICHMENT` section. The `product_detail_boost` function randomly assigns 50% of products to treatment (`enrichment_fraction: 0.5`). Because `quality_boost: 0.0`, there is no actual effect — this is an A/A test and the true treatment effect is 0 by construction.

In [None]:
config_path = "configs/demo_model_selection.yaml"
true_te = 0  # A/A design: no treatment effect by construction

In [None]:
def run_with_override(base_config, measurement_override, storage_url, job_id, source_seed=None):
    """Override MEASUREMENT in base config, write temp YAML, run evaluate_impact().

    Optionally override the data-generating seed for Monte Carlo replications.
    """
    config = copy.deepcopy(base_config)
    config["MEASUREMENT"] = measurement_override
    if source_seed is not None:
        config["DATA"]["SOURCE"]["CONFIG"]["seed"] = source_seed

    tmp_config_path = Path(storage_url) / f"config_{job_id}.yaml"
    tmp_config_path.parent.mkdir(parents=True, exist_ok=True)
    with open(tmp_config_path, "w") as f:
        yaml.dump(config, f, default_flow_style=False)

    job_info = evaluate_impact(str(tmp_config_path), storage_url, job_id=job_id)
    result = load_results(job_info)
    return result.impact_results

In [None]:
base_config = parse_config_file(config_path)

model_overrides = {
    "Experiment (OLS)": {
        "MODEL": "experiment",
        "PARAMS": {"formula": "revenue ~ enriched + price"},
    },
    "Subclassification": {
        "MODEL": "subclassification",
        "PARAMS": {
            "treatment_column": "enriched",
            "covariate_columns": ["price"],
            "n_strata": 5,
            "estimand": "att",
            "dependent_variable": "revenue",
        },
    },
    "Nearest Neighbour Matching": {
        "MODEL": "nearest_neighbour_matching",
        "PARAMS": {
            "treatment_column": "enriched",
            "covariate_columns": ["price"],
            "dependent_variable": "revenue",
            "caliper": 0.2,
            "replace": True,
            "ratio": 1,
        },
    },
}


def extract_te(results):
    """Extract the treatment effect from model results regardless of model type."""
    estimates = results["data"]["impact_estimates"]
    model_type = results["model_type"]
    if model_type == "experiment":
        return estimates["params"].get("enriched[T.True]", estimates["params"].get("enriched", 0))
    elif model_type == "nearest_neighbour_matching":
        return estimates["att"]
    else:
        return estimates["treatment_effect"]

## Part 1: Model Swappability

We load one base config and override `MEASUREMENT` for each model.
Each iteration writes a temporary YAML and calls `evaluate_impact()`.

In [None]:
model_results = {}
model_estimates = {}

for name, measurement in model_overrides.items():
    job_id = measurement["MODEL"]
    results = run_with_override(base_config, measurement, str(output_path), job_id)
    model_results[name] = results
    model_estimates[name] = extract_te(results)
    print(f"{name}: treatment effect = {model_estimates[name]:.4f}")

print(f"\nTrue effect: {true_te:.4f}")

In [None]:
comparison = pd.DataFrame(
    [
        {
            "Model": name,
            "Estimate": est,
            "True Effect": true_te,
            "Absolute Error": abs(est - true_te),
        }
        for name, est in model_estimates.items()
    ]
)

print("=" * 80)
print("CROSS-SECTIONAL MODEL COMPARISON (A/A)")
print("=" * 80)
print(comparison.to_string(index=False, float_format=lambda x: f"{x:.4f}"))
print("=" * 80)

In [None]:
from notebook_support import plot_model_comparison

plot_model_comparison(
    model_names=list(model_estimates.keys()),
    estimates=list(model_estimates.values()),
    true_effect=true_te,
    ylabel="Treatment Effect",
    title="A/A Test: Model Estimates (True Effect = 0)",
)

## Part 2: Monte Carlo Model Comparison

Part 1 used a single random seed, making it impossible to distinguish systematic bias from sampling noise. Here we run all three models across multiple replications with varying outcome seeds to obtain sampling distributions. In the A/A setting, all models should be centered around 0.

**Design**: We vary `DATA.SOURCE.CONFIG.seed` (outcome noise) while keeping `DATA.ENRICHMENT.PARAMS.seed` fixed (same treatment assignment). This isolates estimator sampling variability from treatment assignment variability.

In [None]:
rng = np.random.default_rng(seed=2024)
mc_seeds = rng.integers(low=0, high=2**31, size=N_REPS).tolist()

In [None]:
mc_results = {name: [] for name in model_overrides}

for i, seed in enumerate(mc_seeds):
    for name, measurement in model_overrides.items():
        job_id = f"mc_{measurement['MODEL']}_rep{i}"
        results = run_with_override(
            base_config,
            measurement,
            str(output_path),
            job_id,
            source_seed=seed,
        )
        mc_results[name].append(extract_te(results))

    if (i + 1) % 5 == 0:
        print(f"Completed {i + 1}/{N_REPS} replications")

print(f"Monte Carlo simulation complete: {N_REPS} replications x {len(model_overrides)} models")

In [None]:
mc_summary = pd.DataFrame(
    [
        {
            "Model": name,
            "Mean": np.mean(estimates),
            "Std": np.std(estimates, ddof=1),
            "Bias": np.mean(estimates) - true_te,
            "RMSE": np.sqrt(np.mean([(e - true_te) ** 2 for e in estimates])),
            "Min": np.min(estimates),
            "Max": np.max(estimates),
        }
        for name, estimates in mc_results.items()
    ]
)

print("=" * 90)
print(f"MONTE CARLO MODEL COMPARISON ({N_REPS} replications)")
print("=" * 90)
print(f"True treatment effect: {true_te:.4f}")
print("-" * 90)
print(mc_summary.to_string(index=False, float_format=lambda x: f"{x:.4f}"))
print("=" * 90)

In [None]:
from notebook_support import plot_monte_carlo_distribution

plot_monte_carlo_distribution(
    model_names=list(mc_results.keys()),
    distributions=mc_results,
    true_effect=true_te,
    ylabel="Treatment Effect",
    title=f"A/A Monte Carlo Model Comparison ({N_REPS} replications)",
)

## Key Takeaways

**A/A design.**
- The true treatment effect is 0 by construction (treatment labels are random with no real intervention). This tests whether models correctly report no effect when there is none.

**Model swappability.**
- All three models produce estimates close to 0 from the same simulated data, confirming correct behavior under the null.
- Switching models requires only changing the `MEASUREMENT` entry in the config.
- The `evaluate_impact()` interface stays the same regardless of the model.

**Monte Carlo analysis.**
- Running multiple replications reveals the sampling distribution of each estimator, separating bias from variance.
- Under the A/A design, all three models should be centered around 0. Deviations indicate bias.
- With `N_REPS=10`, the distributions are informative but coarse. For publication-quality analysis, increase to `N_REPS >= 500`.