# Linear Regression

This notebook demonstrates **experiment-based** impact estimation via [statsmodels](https://www.statsmodels.org/) [`ols()`](https://www.statsmodels.org/stable/generated/statsmodels.formula.api.ols.html).

The experiment model fits an OLS regression with an R-style formula. The `enriched` indicator acts as the treatment variable, and the coefficient on `enriched` estimates the average treatment effect.

## Workflow Overview

1. User provides `products.csv`
2. User configures `DATA.ENRICHMENT` for treatment assignment
3. User calls `evaluate_impact(config.yaml)`
4. Engine handles everything internally (adapter, enrichment, model)

## Initial Setup

In [None]:
from pathlib import Path

import pandas as pd
from impact_engine_measure import evaluate_impact, load_results, parse_config_file
from impact_engine_measure.metrics import create_metrics_manager
from impact_engine_measure.models.factory import get_model_adapter
from online_retail_simulator import simulate

## Step 1 — Product Catalog

In production, this would be your actual product catalog.

In [None]:
output_path = Path("output/demo_experiment")
output_path.mkdir(parents=True, exist_ok=True)

job_info = simulate("configs/demo_experiment_catalog.yaml", job_id="catalog")
products = job_info.load_df("products")

print(f"Generated {len(products)} products")
print(f"Products catalog: {job_info.get_store().full_path('products.csv')}")
products.head()

## Step 2 — Engine Configuration

Configure the engine with the following sections.
- `ENRICHMENT` — Treatment assignment via quality boost (50/50 split)
- `MODEL` — `experiment` with formula `revenue ~ enriched + price`

The formula specifies an OLS regression where `revenue` is the outcome, `enriched` is the treatment indicator, and `price` is a control variable. The coefficient on `enriched` estimates the treatment effect.

In [None]:
config_path = "configs/demo_experiment.yaml"
baseline_config_path = "configs/demo_experiment_baseline.yaml"

## Step 3 — Impact Evaluation

A single call to `evaluate_impact()` handles everything.
- Engine creates `CatalogSimulatorAdapter`
- Adapter simulates metrics (single-day, cross-sectional)
- Adapter applies enrichment (treatment assignment + revenue boost)
- `ExperimentAdapter` fits OLS regression with the specified formula

In [None]:
job_info = evaluate_impact(config_path, str(output_path), job_id="results")
print(f"Job ID: {job_info.job_id}")

## Step 4 — Review Results

In [None]:
results = load_results(job_info)

data = results.impact_results["data"]
model_params = data["model_params"]
estimates = data["impact_estimates"]
summary = data["model_summary"]

print("=" * 60)
print("EXPERIMENT (OLS REGRESSION) RESULTS")
print("=" * 60)

print(f"\nFormula: {model_params['formula']}")
print(f"R-squared: {summary['rsquared']:.4f}")
print(f"R-squared (adj): {summary['rsquared_adj']:.4f}")
print(f"F-statistic: {summary['fvalue']:.2f} (p={summary['f_pvalue']:.4e})")
print(f"Observations: {summary['nobs']}")

print("\n--- Coefficients ---")
print("-" * 70)
print(f"{'Variable':<15} {'Coef':<12} {'Std Err':<12} {'P-value':<12} {'95% CI'}")
print("-" * 70)
for var in estimates["params"]:
    coef = estimates["params"][var]
    se = estimates["bse"][var]
    pval = estimates["pvalues"][var]
    ci = estimates["conf_int"][var]
    print(f"{var:<15} {coef:<12.4f} {se:<12.4f} {pval:<12.4e} [{ci[0]:.4f}, {ci[1]:.4f}]")

print("\n" + "=" * 60)
print(
    f"Treatment effect (enriched coefficient): {estimates['params'].get('enriched[T.True]', estimates['params'].get('enriched', 'N/A'))}"
)
print("=" * 60)

## Step 5 — Model Validation

Compare the model's estimate against the **true causal effect** computed from counterfactual vs factual data.

In [None]:
def calculate_true_effect(
    baseline_metrics: pd.DataFrame,
    enriched_metrics: pd.DataFrame,
) -> dict:
    """Calculate TRUE ATT by comparing per-product revenue for treated products."""
    treated_ids = enriched_metrics[enriched_metrics["enriched"]]["product_id"].unique()

    enriched_treated = enriched_metrics[enriched_metrics["product_id"].isin(treated_ids)]
    baseline_treated = baseline_metrics[baseline_metrics["product_id"].isin(treated_ids)]

    enriched_mean = enriched_treated.groupby("product_id")["revenue"].mean().mean()
    baseline_mean = baseline_treated.groupby("product_id")["revenue"].mean().mean()
    treatment_effect = enriched_mean - baseline_mean

    return {
        "enriched_mean": float(enriched_mean),
        "baseline_mean": float(baseline_mean),
        "treatment_effect": float(treatment_effect),
    }

In [None]:
parsed_baseline = parse_config_file(baseline_config_path)
baseline_manager = create_metrics_manager(parsed_baseline)
baseline_metrics = baseline_manager.retrieve_metrics(products)

parsed_enriched = parse_config_file(config_path)
enriched_manager = create_metrics_manager(parsed_enriched)
enriched_metrics = enriched_manager.retrieve_metrics(products)

print(f"Baseline records: {len(baseline_metrics)}")
print(f"Enriched records: {len(enriched_metrics)}")

In [None]:
true_effect = calculate_true_effect(baseline_metrics, enriched_metrics)

true_te = true_effect["treatment_effect"]
# The enriched coefficient name depends on how statsmodels encodes the boolean
model_te = estimates["params"].get("enriched[T.True]", estimates["params"].get("enriched", 0))

if true_te != 0:
    recovery_accuracy = (1 - abs(1 - model_te / true_te)) * 100
else:
    recovery_accuracy = 100 if model_te == 0 else 0

print("=" * 60)
print("TRUTH RECOVERY VALIDATION")
print("=" * 60)
print(f"True treatment effect:  {true_te:.4f}")
print(f"Model estimate:         {model_te:.4f}")
print(f"Recovery accuracy:      {max(0, recovery_accuracy):.1f}%")
print("=" * 60)

### Convergence Analysis

How does the estimate converge to the true effect as sample size increases?

In [None]:
sample_sizes = [20, 50, 100, 200, 300, 500, 1500]
estimates_list = []
truth_list = []

parsed = parse_config_file(config_path)
measurement_config = parsed["MEASUREMENT"]
all_product_ids = enriched_metrics["product_id"].unique()

for n in sample_sizes:
    subset_ids = all_product_ids[:n]
    enriched_sub = enriched_metrics[enriched_metrics["product_id"].isin(subset_ids)]
    baseline_sub = baseline_metrics[baseline_metrics["product_id"].isin(subset_ids)]

    true = calculate_true_effect(baseline_sub, enriched_sub)
    truth_list.append(true["treatment_effect"])

    model = get_model_adapter("experiment")
    model.connect(measurement_config["PARAMS"])
    result = model.fit(data=enriched_sub)
    coef = result.data["impact_estimates"]["params"]
    te = coef.get("enriched[T.True]", coef.get("enriched", 0))
    estimates_list.append(te)

print("Convergence analysis complete.")

In [None]:
from notebook_support import plot_convergence

plot_convergence(
    sample_sizes,
    estimates_list,
    truth_list,
    xlabel="Number of Products",
    ylabel="Treatment Effect",
    title="Experiment: Convergence of Estimate to True Effect",
)