# Test Health & Diagnostics

Before trusting your A/B test results, check for common issues:
- **Sample Ratio Mismatch (SRM)** — are visitors split correctly?
- **Test health** — is the test powered and running properly?
- **Novelty/primacy effects** — is the effect fading or growing over time?

In [1]:
from pyexpstats import diagnostics

## Sample Ratio Mismatch Detection

In [2]:
# Healthy test: roughly 50/50 split
srm_ok = diagnostics.check_sample_ratio(
    control_visitors=5023,
    variant_visitors=4977,
    expected_ratio=0.5,
)

print(f"Valid split: {srm_ok.is_valid}")
print(f"Observed ratio: {srm_ok.observed_ratio:.4f}")
print(f"Deviation: {srm_ok.deviation_percent:.2f}%")
print(f"Severity: {srm_ok.severity}")
print(f"P-value: {srm_ok.p_value:.4f}")

Valid split: True
Observed ratio: 0.5023
Deviation: 0.46%
Severity: ok
P-value: 0.6455


In [3]:
# Problem: significant imbalance suggesting a bug
srm_bad = diagnostics.check_sample_ratio(
    control_visitors=5500,
    variant_visitors=4500,
    expected_ratio=0.5,
)

print(f"Valid split: {srm_bad.is_valid}")
print(f"Observed ratio: {srm_bad.observed_ratio:.4f}")
print(f"Deviation: {srm_bad.deviation_percent:.2f}%")
print(f"Severity: {srm_bad.severity}")
print(f"P-value: {srm_bad.p_value:.6f}")
if srm_bad.warning:
    print(f"\nWarning: {srm_bad.warning}")

Valid split: False
Observed ratio: 0.5500
Deviation: 10.00%
Severity: critical
P-value: 0.000000



In [4]:
# Custom split ratio (e.g., 70/30 test)
srm_custom = diagnostics.check_sample_ratio(
    control_visitors=7050,
    variant_visitors=2950,
    expected_ratio=0.7,
)

print(f"Valid 70/30 split: {srm_custom.is_valid}")
print(f"Observed ratio: {srm_custom.observed_ratio:.4f}")
print(f"Severity: {srm_custom.severity}")

Valid 70/30 split: True
Observed ratio: 0.7050
Severity: ok


## Test Health Dashboard

In [5]:
# Comprehensive health check for your test
health = diagnostics.check_health(
    control_visitors=5000,
    control_conversions=250,
    variant_visitors=5000,
    variant_conversions=285,
    expected_visitors_per_variant=5000,
    test_start_date="2025-01-01",
    daily_traffic=5000,
    minimum_sample_per_variant=100,
    minimum_days=7,
)

print(f"Status: {health.overall_status}")
print(f"Score:  {health.score}/100")
print(f"Can trust results: {health.can_trust_results}")
print(f"\nHealth checks:")
for check in health.checks:
    symbol = {"pass": "PASS", "warning": "WARN", "fail": "FAIL"}[check.status]
    print(f"  [{symbol}] {check.name}: {check.message}")

Status: unhealthy
Score:  83/100
Can trust results: True

Health checks:
  [PASS] Sample Ratio: Traffic split is valid (50.0%/50.0%)
  [PASS] Minimum Sample: Sufficient sample size (5,000 >= 100 minimum)
  [PASS] Test Duration: Running for 406 days (>= 7 day minimum)
  [FAIL] Statistical Power: Power is very low (20%)
  [PASS] Peeking Risk: First analysis - no peeking penalty
  [PASS] Sample Progress: Reached planned sample size (100%)


In [6]:
# Print the full health summary
print(health.summary)

## Test Health Report

###  Overall: UNHEALTHY (Score: 83/100)

- **Total visitors:** 10,000
- **Test duration:** 406 days
- **Can trust results:** Yes

### Health Checks

 **Sample Ratio:** Traffic split is valid (50.0%/50.0%)
 **Minimum Sample:** Sufficient sample size (5,000 >= 100 minimum)
 **Test Duration:** Running for 406 days (>= 7 day minimum)
 **Statistical Power:** Power is very low (20%)
   _Test is unlikely to detect the minimum effect size._
 **Peeking Risk:** First analysis - no peeking penalty
 **Sample Progress:** Reached planned sample size (100%)

### Recommendation

TEST HAS CRITICAL ISSUES. Do not trust results until resolved.

Critical issues: Statistical power too low


In [7]:
# Underpowered test example
health_bad = diagnostics.check_health(
    control_visitors=200,
    control_conversions=10,
    variant_visitors=180,
    variant_conversions=12,
    expected_visitors_per_variant=5000,
    daily_traffic=100,
)

print(f"Status: {health_bad.overall_status}")
print(f"Score: {health_bad.score}/100")
print(f"Can trust: {health_bad.can_trust_results}")
if health_bad.primary_issues:
    print(f"\nIssues:")
    for issue in health_bad.primary_issues:
        print(f"  - {issue}")

Status: unhealthy
Score: 66/100
Can trust: True

Issues:
  - Statistical power too low


## Novelty & Primacy Effect Detection

In [8]:
# Simulate daily results showing a novelty effect (fading lift)
import random
random.seed(42)

daily_data = []
for day in range(1, 22):
    control_v = 500
    variant_v = 500
    control_c = int(control_v * 0.050)
    # Novelty: high initial lift that fades over time
    variant_rate = 0.050 + 0.015 * max(0, 1 - day / 14)
    variant_c = int(variant_v * variant_rate)
    daily_data.append({
        "day": day,
        "control_visitors": control_v,
        "control_conversions": control_c,
        "variant_visitors": variant_v,
        "variant_conversions": variant_c,
    })

novelty = diagnostics.detect_novelty_effect(daily_data)

print(f"Effect detected: {novelty.effect_detected}")
print(f"Effect type: {novelty.effect_type}")
print(f"Initial lift: {novelty.initial_lift:.1%}")
print(f"Current lift: {novelty.current_lift:.1%}")
print(f"Trend slope: {novelty.trend_slope:.4f}")

Effect detected: True
Effect type: novelty
Initial lift: 2266.7%
Current lift: 0.0%
Trend slope: -1.3455


In [9]:
# Stable test (no novelty/primacy effect)
stable_data = []
for day in range(1, 22):
    stable_data.append({
        "day": day,
        "control_visitors": 500,
        "control_conversions": 25,
        "variant_visitors": 500,
        "variant_conversions": 30,
    })

stable = diagnostics.detect_novelty_effect(stable_data)

print(f"Effect detected: {stable.effect_detected}")
print(f"Effect type: {stable.effect_type}")
print(f"Recommendation: {stable.recommendation[:200]}")

Effect detected: False
Effect type: stable
Recommendation: The effect (+20.0%) appears consistent over time. You can trust this as a reliable estimate of long-term impact.


## Diagnostic Checklist

Before trusting your A/B test results, verify:

1. **No SRM** — `check_sample_ratio()` shows valid split
2. **Adequate power** — `check_health()` shows healthy status
3. **No novelty effect** — `detect_novelty_effect()` shows stable trend
4. **Sufficient duration** — at least 1-2 full business cycles

If any check fails, investigate the root cause before making decisions.