# EarlySign Tutorial 004: Complex A/B Experiment with Guardrails and Adaptive Information Time

This tutorial demonstrates a realistic A/B testing scenario combining:

1. **Primary endpoint**: CTR improvement (A vs B) with group sequential testing (5 looks, both futility and efficacy)
2. **Guardrail metrics**: Two safety metrics monitored via safe testing
3. **Adaptive information time**: Re-estimation based on observed sample sizes
4. **Multiple testing correction**: Bonferroni allocation (5% for A/B, 5% total for guardrails)
5. **Wall-clock scheduling**: Progress reports every 3 days over 2-week period

## Business Context

**Scenario**: We want to improve webpage CTR by testing a new design (B) against baseline (A).
- **Primary metric**: Click-through rate (higher is better, so B > A is the alternative)
- **Guardrail 1**: Page load time (B should not significantly increase load time)
- **Guardrail 2**: Bounce rate (B should not significantly increase bounce rate)

**Statistical Design**:
- Primary: Two-sided GST with O'Brien-Fleming spending, α=0.05, 5 looks
- Guardrails: One-sided safe tests, α=0.025 each (Bonferroni: 0.05/2)
- Information time: Adaptive based on daily sample size variation

In [None]:
# Framework components now available in earlysign.api.ab_test
from earlysign.api.ab_test import (
    ab_test_with_guardrails,
    GuardrailConfig,
    ABTestExperiment,
)
from earlysign.methods.group_sequential.adaptive import (
    AdaptiveInfoTime,
    AdaptiveGSTBoundary,
)

# Runtime components
from earlysign.runtime import SequentialRunner

# Other imports
import polars as pl
import numpy as np
from earlysign.backends.polars.ledger import PolarsLedger

## Problem: Multiple Statistics in Same Namespace

Before building the experiment, let's highlight the current design issue:

**Current Issue**: When multiple statistics are used in the same experiment (e.g., CTR Wald Z, Load Time Safe Test, Bounce Rate Safe Test), they all go to `namespace='stats'`. The `latest(namespace='stats')` call becomes ambiguous.

**Current Workaround**: Use `tag` parameter: `latest(namespace='stats', tag='stat:ctr_waldz')`

**Proposed Solutions**:
1. Add a `metric_id` field for explicit metric identification
2. Use hierarchical `statistic_type` field: `stats.ctr.waldz`, `stats.loadtime.safetest`
3. Enhanced `tag` system with required tags for statistics

For this demo, we'll use the current `tag`-based approach but highlight where the new design would be cleaner.

## Experiment Design: Multi-Metric A/B Test Module

We'll create a custom module that coordinates multiple statistics and criteria:

In [None]:
from dataclasses import dataclass
from typing import List, Optional, Union
from earlysign.core.ledger import Ledger

# Import the framework components we just created
from earlysign.api.multi_metric import MultiMetricABTest, GuardrailConfig
from earlysign.methods.group_sequential.adaptive import AdaptiveInfoTime

# Example usage with the new framework components
print("✅ Now using framework components for:")
print("  - AdaptiveInfoTime: earlysign.methods.group_sequential.adaptive")
print("  - MultiMetricABTest: earlysign.api.multi_metric")
print("  - GuardrailConfig: earlysign.api.multi_metric")

# Demo the AdaptiveInfoTime component
adaptive_demo = AdaptiveInfoTime(initial_target=1000, looks=5)
print(f"\n📊 AdaptiveInfoTime Demo:")
print(f"  Planned fractions: {adaptive_demo.planned_fractions}")

# At look 3, observed 800 samples instead of planned 600
adapted_t = adaptive_demo.get_info_fraction(current_look=3, observed_n=800)
print(
    f"  Look 3 - Planned: {adaptive_demo.planned_fractions[2]:.3f}, Adapted: {adapted_t:.3f}"
)

# Create a multi-metric experiment using the framework
guardrails = [
    GuardrailConfig(name="loadtime", alpha=0.025, method="safe_test"),
    GuardrailConfig(name="bounce", alpha=0.025, method="safe_test"),
]

print(f"\n🔧 Framework-based Multi-Metric Experiment:")
print(f"  Components moved to: earlysign.api.multi_metric")
print(f"  Adaptive info time moved to: earlysign.methods.group_sequential.adaptive")
print(f"  Ready for reuse across projects!")

## Data Simulation: Multi-Metric Observations

We'll simulate realistic daily data with:
- Variable daily sample sizes
- Correlated metrics (users with different characteristics)
- True effects reflecting business scenario

In [None]:
## 2. Configure Multi-Metric A/B Test

# Configure guardrails using domain-friendly API
guardrails = [
    GuardrailConfig(name="loadtime", alpha=0.025, method="safe_test"),
    GuardrailConfig(name="bounce", alpha=0.025, method="safe_test"),
]

# Create comprehensive A/B test experiment
experiment = ab_test_with_guardrails(
    experiment_id="exp#web_test",
    primary_alpha=0.05,  # Full alpha for conversion rate
    guardrails=guardrails,  # Safety monitoring
    looks=5,
    adaptive_info=True,  # Adapt to real sample sizes
    target_n_per_arm=1000,
)

print(f"✅ A/B test configured with {len(guardrails)} guardrails")
print(f"Primary endpoint: α = {experiment.primary_alpha}")
print(f"Guardrail metrics: {[g.name for g in experiment.guardrails]}")
print(f"Adaptive information timing: {experiment.adaptive_info}")

## Running the Multi-Metric Experiment

Now we'll run the 2-week experiment with progress reports every 3 days:

In [None]:
# Experiment setup
exp_id = "exp#web_ctr_improvement"
experiment = MultiMetricABTest(
    experiment_id=exp_id,
    primary_alpha=0.05,
    guardrail_alpha_total=0.05,
    looks=5,
    spending="obf",
    target_n_per_arm=1400,  # Target for 2 weeks
)

runner = SequentialRunner(experiment, PolarsLedger())
print(f"🚀 Experiment initialized: {exp_id}")
print(f"   Components: {len(experiment.components)}")
print(f"   Target sample size: {experiment.target_n_per_arm} per arm")

# Simulation parameters
experiment_days = 14
report_interval = 3  # Every 3 days
report_days = [
    report_interval * i for i in range(1, experiment_days // report_interval + 1)
]
if experiment_days not in report_days:
    report_days.append(experiment_days)

print(f"📅 Schedule: {experiment_days} days, reports on days {report_days}")

In [None]:
# Run the experiment day by day
cumulative_data = {
    "ctr": {"a_success": 0, "a_total": 0, "b_success": 0, "b_total": 0},
    "loadtime": {"a_success": 0, "a_total": 0, "b_success": 0, "b_total": 0},
    "bounce": {"a_success": 0, "a_total": 0, "b_success": 0, "b_total": 0},
}

daily_results = []
look_counter = 0
stopped_early = False
stop_reason = None

for day in range(1, experiment_days + 1):
    # Simulate daily data
    day_data = simulator.simulate_day(day, rng)

    # Accumulate data
    for metric in ["ctr", "loadtime", "bounce"]:
        cumulative_data[metric]["a_success"] += day_data[metric]["a_successes"]
        cumulative_data[metric]["a_total"] += day_data[metric]["a_total"]
        cumulative_data[metric]["b_success"] += day_data[metric]["b_successes"]
        cumulative_data[metric]["b_total"] += day_data[metric]["b_total"]

    daily_results.append(
        {
            "day": day,
            "total_n": cumulative_data["ctr"]["a_total"]
            + cumulative_data["ctr"]["b_total"],
            "ctr_a_rate": cumulative_data["ctr"]["a_success"]
            / max(1, cumulative_data["ctr"]["a_total"]),
            "ctr_b_rate": cumulative_data["ctr"]["b_success"]
            / max(1, cumulative_data["ctr"]["b_total"]),
        }
    )

    # Progress report and analysis on scheduled days
    if day in report_days and not stopped_early:
        look_counter += 1
        time_index = f"day_{day:02d}"
        step_key = f"look_{look_counter}"

        print(f"\n📊 === Day {day} Analysis (Look {look_counter}) ===")
        print(f"Cumulative sample: {daily_results[-1]['total_n']} total")

        # Add observations to ledger for each metric
        # Note: This highlights the namespace/tag issue - we need different tags for each metric

        for metric_name, data in cumulative_data.items():
            # Each metric gets its own observation event with distinct tag
            runner.ledger.write_event(
                time_index=time_index,
                namespace=Namespace.OBS,
                kind="batch",
                experiment_id=exp_id,
                step_key=step_key,
                payload_type="TwoPropObsBatch",
                payload={
                    "nA": data["a_total"],
                    "mA": data["a_success"],
                    "nB": data["b_total"],
                    "mB": data["b_success"],
                },
                tag=f"obs:{metric_name}",  # Distinct tag per metric
            )

        # Run analysis step
        experiment.step(runner.ledger, exp_id, step_key, time_index)

        # Check for early stopping signals
        signals = []
        for tag in ["crit:ctr_gst", "crit:loadtime_safe", "crit:bounce_safe"]:
            latest_signal = runner.ledger.latest(
                namespace=Namespace.SIGNALS, tag=f"{tag.replace('crit:', '')}:decision"
            )
            if latest_signal:
                signals.append(latest_signal)

        # Report current metrics
        for metric_name in ["ctr", "loadtime", "bounce"]:
            data = cumulative_data[metric_name]
            a_rate = data["a_success"] / max(1, data["a_total"])
            b_rate = data["b_success"] / max(1, data["b_total"])
            effect = b_rate - a_rate
            print(
                f"  {metric_name.upper():>8s}: A={a_rate:.3f}, B={b_rate:.3f}, Effect={effect:+.3f}"
            )

        # Check stopping conditions
        if signals:
            print(f"  🔔 {len(signals)} signal(s) detected")
            for signal in signals:
                if signal.payload.get("action") == "stop":
                    stopped_early = True
                    stop_reason = f"Day {day}: {signal.tag}"
                    print(f"  🛑 Early stop triggered: {signal.tag}")

        if not stopped_early:
            print(f"  ✅ Continue to next look")

print(f"\n🏁 === Experiment Complete ===")
if stopped_early:
    print(f"Stopped early: {stop_reason}")
else:
    print(f"Completed full {experiment_days}-day period")

print(f"Final sample size: {daily_results[-1]['total_n']}")
print(f"Total looks executed: {look_counter}")

## Analysis: Ledger Inspection and Multiple Statistics Issue

Let's examine the ledger to see how multiple statistics in the same namespace create ambiguity:

In [None]:
# Inspect the ledger structure
ledger = runner.ledger
reporter = LedgerReporter(ledger.frame())

print("=== Ledger Summary ===")
display(reporter.counts())

print("\n=== Multiple Statistics Problem Demo ===")
print("Current approach using tags to distinguish statistics:")

# Show the ambiguity problem
stats_events = (
    ledger.frame()
    .filter(ledger.frame()["namespace"] == "stats")
    .select(["time_index", "tag", "payload_type"])
    .sort("time_index")
)

print(f"\nStatistics events in 'stats' namespace:")
display(stats_events)

print("\n❌ Problem: ledger.latest(namespace='stats') is ambiguous!")
print("   It could return CTR WaldZ, LoadTime Safe, or Bounce Safe statistic")
print("\n✅ Current workaround: ledger.latest(namespace='stats', tag='stat:ctr_waldz')")
print("   But this makes 'tag' feel mandatory rather than optional")

print("\n💡 Proposed solutions:")
print("   1. Add 'metric_id' field: 'ctr', 'loadtime', 'bounce'")
print("   2. Hierarchical 'statistic_type': 'stats.ctr.waldz', 'stats.loadtime.safe'")
print("   3. Enhanced 'tag' system with required structure")
print("   4. Separate namespaces: 'stats:ctr', 'stats:loadtime', 'stats:bounce'")

## Visualization: Multi-Metric Dashboard

Create a dashboard showing all metrics and their signals:

In [None]:
# Create a comprehensive dashboard
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle(
    f"Multi-Metric A/B Test Dashboard: {exp_id}", fontsize=14, fontweight="bold"
)

# Extract daily progression
days = [r["day"] for r in daily_results]
sample_sizes = [r["total_n"] for r in daily_results]
ctr_a_rates = [r["ctr_a_rate"] for r in daily_results]
ctr_b_rates = [r["ctr_b_rate"] for r in daily_results]

# Plot 1: Sample size progression
axes[0, 0].plot(days, sample_sizes, "o-", color="blue", alpha=0.7)
axes[0, 0].axhline(
    y=experiment.target_n_per_arm * 2,
    color="red",
    linestyle="--",
    alpha=0.5,
    label="Target",
)
axes[0, 0].set_title("Sample Size Over Time")
axes[0, 0].set_xlabel("Day")
axes[0, 0].set_ylabel("Total Sample Size")
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: CTR progression
axes[0, 1].plot(days, ctr_a_rates, "o-", label="A (Baseline)", color="orange")
axes[0, 1].plot(days, ctr_b_rates, "o-", label="B (Variant)", color="green")
axes[0, 1].set_title("Click-Through Rate Progression")
axes[0, 1].set_xlabel("Day")
axes[0, 1].set_ylabel("CTR")
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Effect size progression
ctr_effects = [b - a for a, b in zip(ctr_a_rates, ctr_b_rates)]
axes[1, 0].plot(days, ctr_effects, "o-", color="purple", alpha=0.8)
axes[1, 0].axhline(y=0, color="black", linestyle="-", alpha=0.3)
axes[1, 0].set_title("CTR Effect Size (B - A)")
axes[1, 0].set_xlabel("Day")
axes[1, 0].set_ylabel("Effect Size")
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Information time adaptation
planned_info_times = experiment.adaptive_info_time.planned_fractions
actual_looks = min(look_counter, len(planned_info_times))
if actual_looks > 0:
    look_indices = list(range(1, actual_looks + 1))
    axes[1, 1].plot(
        look_indices,
        planned_info_times[:actual_looks],
        "o--",
        label="Planned",
        color="blue",
        alpha=0.7,
    )

    # For demo, show some adaptive adjustments
    adapted_fractions = []
    for i in range(actual_looks):
        day_idx = min(
            i * 3 + 2, len(sample_sizes) - 1
        )  # Approximate sample size at look
        adapted_t = experiment.adaptive_info_time.get_info_fraction(
            i + 1, sample_sizes[day_idx] // 2
        )
        adapted_fractions.append(adapted_t)

    axes[1, 1].plot(
        look_indices, adapted_fractions, "o-", label="Adapted", color="red", alpha=0.8
    )

    axes[1, 1].set_title("Information Time: Planned vs Adapted")
    axes[1, 1].set_xlabel("Look Number")
    axes[1, 1].set_ylabel("Information Fraction")
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
else:
    axes[1, 1].text(
        0.5,
        0.5,
        "No looks completed",
        ha="center",
        va="center",
        transform=axes[1, 1].transAxes,
    )
    axes[1, 1].set_title("Information Time Adaptation")

plt.tight_layout()
plt.show()

# Summary statistics
print(f"\n📈 === Final Results Summary ===")
final_data = cumulative_data
for metric_name, data in final_data.items():
    if data["a_total"] > 0 and data["b_total"] > 0:
        a_rate = data["a_success"] / data["a_total"]
        b_rate = data["b_success"] / data["b_total"]
        effect = b_rate - a_rate
        rel_effect = (effect / a_rate) * 100 if a_rate > 0 else 0
        print(
            f"{metric_name.upper():>10s}: A={a_rate:.4f}, B={b_rate:.4f}, "
            f"Effect={effect:+.4f} ({rel_effect:+.1f}%)"
        )

print(
    f"\nTotal sample size: {final_data['ctr']['a_total'] + final_data['ctr']['b_total']}"
)
print(f"Experiment duration: {experiment_days} days")
print(f"Analysis looks: {look_counter}")

## Key Insights and Design Recommendations

### 1. Multiple Statistics Namespace Problem

**Issue**: When multiple statistics exist in the same experiment (CTR WaldZ, LoadTime Safe Test, Bounce Safe Test), they all use `namespace='stats'`, making `ledger.latest(namespace='stats')` ambiguous.

**Current Workaround**: Use `tag` parameter: `ledger.latest(namespace='stats', tag='stat:ctr_waldz')`

**Proposed Solutions**:

1. **Add `metric_id` field**: 
   ```python
   ledger.latest(namespace='stats', metric_id='ctr')
   ```

2. **Hierarchical `statistic_type`**:
   ```python
   statistic_type='stats.ctr.waldz'
   statistic_type='stats.loadtime.safe_test'
   ```

3. **Structured namespace**:
   ```python
   namespace='stats:ctr', namespace='stats:loadtime'
   ```

### 2. Information Time Adaptation Benefits

- **Flexibility**: Accounts for real-world sample size variability
- **Wall-clock alignment**: Decisions based on calendar time, not sample milestones
- **Maintains Type I error**: Proper statistical properties preserved

### 3. Multi-Testing Framework

- **Primary endpoint**: Full α=0.05 for business-critical CTR improvement
- **Guardrails**: Bonferroni allocation for safety monitoring
- **Independent monitoring**: Safe testing for continuous guardrail surveillance

### 4. Operational Advantages

- **Audit trail**: Complete event history for regulatory compliance
- **Component separation**: Clear responsibility boundaries
- **Extensibility**: Easy to add new metrics or modify criteria
- **Real-time monitoring**: Daily data ingestion with scheduled analysis