# Model Selection and Parameter Tuning

This notebook demonstrates key capabilities of the measurement framework using an **A/A test** — treatment labels are assigned randomly to 50% of products with no real intervention. The true treatment effect is 0 by construction.

1. **Model swappability** — Given the same data, switch between cross-sectional models by overriding a single config entry.
2. **Monte Carlo model comparison** — Run models across multiple replications to obtain sampling distributions, separating bias from variance.
3. **Parameter sensitivity** — For a given model, investigate how tuning parameters affect the treatment effect estimate.
4. **Parameter sensitivity with uncertainty** — Add uncertainty bands to parameter sweeps via Monte Carlo replications.

All three cross-sectional models share the same data-generation process: a single-day simulation with random treatment labels assigned to 50% of products (no real intervention).

### Workflow Overview

1. Generate a shared product catalog and define model overrides
2. Loop over models and override `MEASUREMENT` in the base config, write a temp YAML, call `evaluate_impact()`
3. Compare treatment effect estimates against the known true effect of 0
4. Run Monte Carlo replications (varying outcome seeds, fixed treatment assignment) and visualize sampling distributions
5. Sweep tuning parameters for subclassification and nearest neighbour matching
6. Re-run parameter sweeps across replications and visualize uncertainty bands

## Initial Setup

In [None]:
import copy
from pathlib import Path

import pandas as pd
import yaml
from impact_engine_measure import evaluate_impact, load_results, parse_config_file
from online_retail_simulator import simulate

## Step 1 — Shared Data

All models will use the same product catalog.

In [None]:
catalog_job = simulate("configs/demo_model_selection_catalog.yaml", job_id="catalog")
products = catalog_job.load_df("products")

print(f"Generated {len(products)} products")
products.head()

## Step 2 — Configuration

Point to the impact engine config and set the ground truth.

Treatment assignment is controlled by the config's `DATA.ENRICHMENT` section. The `product_detail_boost` function randomly assigns 50% of products to treatment (`enrichment_fraction: 0.5`). Because `quality_boost: 0.0`, there is no actual effect — this is an A/A test and the true treatment effect is 0 by construction.

In [None]:
output_path = Path("output/demo_model_selection")
output_path.mkdir(parents=True, exist_ok=True)

config_path = "configs/demo_model_selection.yaml"
true_te = 0  # A/A design: no treatment effect by construction

## Part 1: Model Swappability

We load one base config and override `MEASUREMENT` for each model.
Each iteration writes a temporary YAML and calls `evaluate_impact()`.

In [None]:
def run_with_override(base_config, measurement_override, storage_url, job_id, source_seed=None):
    """Override MEASUREMENT in base config, write temp YAML, run evaluate_impact().

    Optionally override the data-generating seed for Monte Carlo replications.
    """
    config = copy.deepcopy(base_config)
    config["MEASUREMENT"] = measurement_override
    if source_seed is not None:
        config["DATA"]["SOURCE"]["CONFIG"]["seed"] = source_seed

    tmp_config_path = Path(storage_url) / f"config_{job_id}.yaml"
    tmp_config_path.parent.mkdir(parents=True, exist_ok=True)
    with open(tmp_config_path, "w") as f:
        yaml.dump(config, f, default_flow_style=False)

    job_info = evaluate_impact(str(tmp_config_path), storage_url, job_id=job_id)
    result = load_results(job_info)
    return result.impact_results

In [None]:
base_config = parse_config_file(config_path)

model_overrides = {
    "Experiment (OLS)": {
        "MODEL": "experiment",
        "PARAMS": {"formula": "revenue ~ enriched + price"},
    },
    "Subclassification": {
        "MODEL": "subclassification",
        "PARAMS": {
            "treatment_column": "enriched",
            "covariate_columns": ["price"],
            "n_strata": 5,
            "estimand": "att",
            "dependent_variable": "revenue",
        },
    },
    "Nearest Neighbour Matching": {
        "MODEL": "nearest_neighbour_matching",
        "PARAMS": {
            "treatment_column": "enriched",
            "covariate_columns": ["price"],
            "dependent_variable": "revenue",
            "caliper": 0.2,
            "replace": True,
            "ratio": 1,
        },
    },
}


def extract_te(results):
    """Extract the treatment effect from model results regardless of model type."""
    estimates = results["data"]["impact_estimates"]
    model_type = results["model_type"]
    if model_type == "experiment":
        return estimates["params"].get("enriched[T.True]", estimates["params"].get("enriched", 0))
    elif model_type == "nearest_neighbour_matching":
        return estimates["att"]
    else:
        return estimates["treatment_effect"]

In [None]:
model_results = {}
model_estimates = {}

for name, measurement in model_overrides.items():
    job_id = measurement["MODEL"]
    results = run_with_override(base_config, measurement, str(output_path), job_id)
    model_results[name] = results
    model_estimates[name] = extract_te(results)
    print(f"{name}: treatment effect = {model_estimates[name]:.4f}")

print(f"\nTrue effect: {true_te:.4f}")

In [None]:
comparison = pd.DataFrame(
    [
        {
            "Model": name,
            "Estimate": est,
            "True Effect": true_te,
            "Absolute Error": abs(est - true_te),
        }
        for name, est in model_estimates.items()
    ]
)

print("=" * 80)
print("CROSS-SECTIONAL MODEL COMPARISON (A/A)")
print("=" * 80)
print(comparison.to_string(index=False, float_format=lambda x: f"{x:.4f}"))
print("=" * 80)

In [None]:
from notebook_support import plot_model_comparison

plot_model_comparison(
    model_names=list(model_estimates.keys()),
    estimates=list(model_estimates.values()),
    true_effect=true_te,
    ylabel="Treatment Effect",
    title="A/A Test: Model Estimates (True Effect = 0)",
)

In [None]:
import numpy as np

N_REPS = 10
rng = np.random.default_rng(seed=2024)
mc_seeds = rng.integers(low=0, high=2**31, size=N_REPS).tolist()

## Part 2: Monte Carlo Model Comparison

Part 1 used a single random seed, making it impossible to distinguish systematic bias from sampling noise. Here we run all three models across multiple replications with varying outcome seeds to obtain sampling distributions. In the A/A setting, all models should be centered around 0.

**Design**: We vary `DATA.SOURCE.CONFIG.seed` (outcome noise) while keeping `DATA.ENRICHMENT.PARAMS.seed` fixed (same treatment assignment). This isolates estimator sampling variability from treatment assignment variability.

In [None]:
mc_results = {name: [] for name in model_overrides}

for i, seed in enumerate(mc_seeds):
    for name, measurement in model_overrides.items():
        job_id = f"mc_{measurement['MODEL']}_rep{i}"
        results = run_with_override(
            base_config,
            measurement,
            str(output_path),
            job_id,
            source_seed=seed,
        )
        mc_results[name].append(extract_te(results))

    if (i + 1) % 5 == 0:
        print(f"Completed {i + 1}/{N_REPS} replications")

print(f"Monte Carlo simulation complete: {N_REPS} replications x {len(model_overrides)} models")

In [None]:
mc_summary = pd.DataFrame(
    [
        {
            "Model": name,
            "Mean": np.mean(estimates),
            "Std": np.std(estimates, ddof=1),
            "Bias": np.mean(estimates) - true_te,
            "RMSE": np.sqrt(np.mean([(e - true_te) ** 2 for e in estimates])),
            "Min": np.min(estimates),
            "Max": np.max(estimates),
        }
        for name, estimates in mc_results.items()
    ]
)

print("=" * 90)
print(f"MONTE CARLO MODEL COMPARISON ({N_REPS} replications)")
print("=" * 90)
print(f"True treatment effect: {true_te:.4f}")
print("-" * 90)
print(mc_summary.to_string(index=False, float_format=lambda x: f"{x:.4f}"))
print("=" * 90)

In [None]:
from notebook_support import plot_monte_carlo_distribution

plot_monte_carlo_distribution(
    model_names=list(mc_results.keys()),
    distributions=mc_results,
    true_effect=true_te,
    ylabel="Treatment Effect",
    title=f"A/A Monte Carlo Model Comparison ({N_REPS} replications)",
)

## Part 3: Parameter Sensitivity

For a given model and data, how sensitive is the treatment effect estimate to tuning parameters?
We use the same override pattern, varying one parameter at a time.

### 3a. Subclassification: `n_strata`

More strata means finer partitioning of the covariate space.
This can improve precision but may leave strata without common support.

In [None]:
n_strata_values = [2, 3, 5, 10, 20, 50, 100]
subclass_estimates = []
strata_used = []
strata_dropped = []

for n in n_strata_values:
    measurement = {
        "MODEL": "subclassification",
        "PARAMS": {
            "treatment_column": "enriched",
            "covariate_columns": ["price"],
            "n_strata": n,
            "estimand": "att",
            "dependent_variable": "revenue",
        },
    }
    results = run_with_override(base_config, measurement, str(output_path), f"subclass_strata_{n}")
    estimates = results["data"]["impact_estimates"]

    subclass_estimates.append(estimates["treatment_effect"])
    strata_used.append(estimates["n_strata"])
    strata_dropped.append(estimates["n_strata_dropped"])

subclass_sensitivity = pd.DataFrame(
    {
        "n_strata (requested)": n_strata_values,
        "Strata Used": strata_used,
        "Strata Dropped": strata_dropped,
        "Treatment Effect": subclass_estimates,
        "Absolute Error": [abs(est - true_te) for est in subclass_estimates],
    }
)

print("Subclassification: n_strata Sensitivity")
print("-" * 70)
print(subclass_sensitivity.to_string(index=False, float_format=lambda x: f"{x:.4f}"))

In [None]:
from notebook_support import plot_parameter_sensitivity

plot_parameter_sensitivity(
    param_values=n_strata_values,
    estimates=subclass_estimates,
    true_effect=true_te,
    xlabel="Number of Strata (n_strata)",
    ylabel="Treatment Effect",
    title="Subclassification: Sensitivity to n_strata",
)

### 3b. Nearest Neighbour Matching: `caliper`

The caliper controls the maximum allowed distance between a treated unit and its matched control.
Smaller values enforce tighter matches but may discard units, while larger values allow more matches with worse balance.

In [None]:
caliper_values = [0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0]
matching_estimates = []
n_matched_att_list = []

for cal in caliper_values:
    measurement = {
        "MODEL": "nearest_neighbour_matching",
        "PARAMS": {
            "treatment_column": "enriched",
            "covariate_columns": ["price"],
            "dependent_variable": "revenue",
            "caliper": cal,
            "replace": True,
            "ratio": 1,
        },
    }
    results = run_with_override(base_config, measurement, str(output_path), f"matching_caliper_{cal}")
    estimates = results["data"]["impact_estimates"]
    summary = results["data"]["model_summary"]

    matching_estimates.append(estimates["att"])
    n_matched_att_list.append(summary["n_matched_att"])

matching_sensitivity = pd.DataFrame(
    {
        "Caliper": caliper_values,
        "N Matched (ATT)": n_matched_att_list,
        "Treatment Effect (ATT)": matching_estimates,
        "Absolute Error": [abs(est - true_te) for est in matching_estimates],
    }
)

print("Nearest Neighbour Matching: Caliper Sensitivity")
print("-" * 70)
print(matching_sensitivity.to_string(index=False, float_format=lambda x: f"{x:.4f}"))

In [None]:
plot_parameter_sensitivity(
    param_values=caliper_values,
    estimates=matching_estimates,
    true_effect=true_te,
    xlabel="Caliper",
    ylabel="Treatment Effect (ATT)",
    title="Nearest Neighbour Matching: Sensitivity to Caliper",
)

## Part 4: Parameter Sensitivity with Uncertainty

Part 3 showed how estimates change with tuning parameters using a single seed.
Here we add uncertainty bands by running each parameter value across multiple replications.
This reveals whether apparent sensitivity is real or just noise.

### 4a. Subclassification: `n_strata`

In [None]:
n_strata_mc = {n: [] for n in n_strata_values}

for i, seed in enumerate(mc_seeds):
    for n in n_strata_values:
        measurement = {
            "MODEL": "subclassification",
            "PARAMS": {
                "treatment_column": "enriched",
                "covariate_columns": ["price"],
                "n_strata": n,
                "estimand": "att",
                "dependent_variable": "revenue",
            },
        }
        results = run_with_override(
            base_config,
            measurement,
            str(output_path),
            f"mc_subclass_{n}_rep{i}",
            source_seed=seed,
        )
        n_strata_mc[n].append(results["data"]["impact_estimates"]["treatment_effect"])

    if (i + 1) % 5 == 0:
        print(f"Subclassification sweep: {i + 1}/{N_REPS} replications")

In [None]:
from notebook_support import plot_parameter_sensitivity_mc

strata_means = [np.mean(n_strata_mc[n]) for n in n_strata_values]
strata_stds = [np.std(n_strata_mc[n], ddof=1) for n in n_strata_values]
strata_lower = [m - s for m, s in zip(strata_means, strata_stds)]
strata_upper = [m + s for m, s in zip(strata_means, strata_stds)]

plot_parameter_sensitivity_mc(
    param_values=n_strata_values,
    mean_estimates=strata_means,
    lower_band=strata_lower,
    upper_band=strata_upper,
    true_effect=true_te,
    xlabel="Number of Strata (n_strata)",
    ylabel="Treatment Effect",
    title=f"Subclassification: n_strata Sensitivity ({N_REPS} replications)",
)

### 4b. Nearest Neighbour Matching: `caliper`

In [None]:
caliper_mc = {c: [] for c in caliper_values}

for i, seed in enumerate(mc_seeds):
    for cal in caliper_values:
        measurement = {
            "MODEL": "nearest_neighbour_matching",
            "PARAMS": {
                "treatment_column": "enriched",
                "covariate_columns": ["price"],
                "dependent_variable": "revenue",
                "caliper": cal,
                "replace": True,
                "ratio": 1,
            },
        }
        results = run_with_override(
            base_config,
            measurement,
            str(output_path),
            f"mc_matching_{cal}_rep{i}",
            source_seed=seed,
        )
        caliper_mc[cal].append(results["data"]["impact_estimates"]["att"])

    if (i + 1) % 5 == 0:
        print(f"Matching sweep: {i + 1}/{N_REPS} replications")

In [None]:
cal_means = [np.mean(caliper_mc[c]) for c in caliper_values]
cal_stds = [np.std(caliper_mc[c], ddof=1) for c in caliper_values]
cal_lower = [m - s for m, s in zip(cal_means, cal_stds)]
cal_upper = [m + s for m, s in zip(cal_means, cal_stds)]

plot_parameter_sensitivity_mc(
    param_values=caliper_values,
    mean_estimates=cal_means,
    lower_band=cal_lower,
    upper_band=cal_upper,
    true_effect=true_te,
    xlabel="Caliper",
    ylabel="Treatment Effect (ATT)",
    title=f"Nearest Neighbour Matching: Caliper Sensitivity ({N_REPS} replications)",
)

## Key Takeaways

**A/A design.**
- The true treatment effect is 0 by construction (treatment labels are random with no real intervention). This tests whether models correctly report no effect when there is none.

**Model swappability.**
- All three models produce estimates close to 0 from the same simulated data, confirming correct behavior under the null.
- Switching models requires only changing the `MEASUREMENT` entry in the config.
- The `evaluate_impact()` interface stays the same regardless of the model.

**Monte Carlo analysis.**
- Running multiple replications reveals the sampling distribution of each estimator, separating bias from variance.
- Under the A/A design, all three models should be centered around 0. Deviations indicate bias.
- With `N_REPS=10`, the distributions are informative but coarse. For publication-quality analysis, increase to `N_REPS >= 500`.

**Parameter sensitivity.**
- **Subclassification** is relatively stable across `n_strata` values. Very low values may under-partition, while very high values may drop strata with insufficient common support.
- **Nearest neighbour matching** is more sensitive to `caliper`. Very small calipers may discard too many units, while very large calipers degrade match quality.
- Parameter sensitivity plots with uncertainty bands show which apparent patterns are robust to sampling variation and which are noise.