## Validation Approach

This notebook runs basic tests on:

1. **Source data integrity** - Are all expected files present and parseable?
2. **Preprocessing correctness** - Do aggregated CSVs match source data?
3. **Metric calculation accuracy** - Do sprint/run/press aggregations produce sensible ranges?
4. **Cross-metric consistency** - Do related metrics correlate as expected?

For production deployment, these would be automated as pytest tests with proper assertions.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import json

data_dir = Path('../data')
processed_dir = Path('../output')
matches_dir = data_dir / 'opendata/data/matches'

print("=== Test 1: Source Data Availability ===\n")

# Check all 10 matches have required files
with open(data_dir / 'opendata/data/matches.json', 'r') as f:
    matches_meta = json.load(f)

match_ids = [str(m['id']) for m in matches_meta]
print(f"Expected matches: {len(match_ids)}")

missing_files = []
for match_id in match_ids:
    match_dir = matches_dir / match_id
    required = [
        f"{match_id}_match.json",
        f"{match_id}_tracking_extrapolated.jsonl",
        f"{match_id}_dynamic_events.csv",
        f"{match_id}_phases_of_play.csv"
    ]
    
    for filename in required:
        if not (match_dir / filename).exists():
            missing_files.append(f"{match_id}/{filename}")

if missing_files:
    print(f"✗ Missing {len(missing_files)} files:")
    for f in missing_files[:5]:
        print(f"  - {f}")
else:
    print("✓ All source files present")

=== Test 1: Source Data Availability ===

Expected matches: 10
✓ All source files present


In [2]:
print("\n=== Test 2: Preprocessed Data Consistency ===\n")

# Load preprocessed files
all_events = pd.read_csv(processed_dir / 'all_events.csv', low_memory=False)
all_phases = pd.read_csv(processed_dir / 'all_phases.csv')
player_metadata = pd.read_csv(processed_dir / 'player_metadata.csv')

# Check row counts match expectations
print(f"Events loaded: {len(all_events):,}")
print(f"Phases loaded: {len(all_phases):,}")
print(f"Player-match records: {len(player_metadata):,}")

# Check match coverage
events_matches = all_events['match_id'].nunique()
phases_matches = all_phases['match_id'].nunique()
meta_matches = player_metadata['match_id'].nunique()

print(f"\nMatch coverage:")
print(f"  Events: {events_matches}/10")
print(f"  Phases: {phases_matches}/10")
print(f"  Metadata: {meta_matches}/10")

if events_matches == phases_matches == meta_matches == 10:
    print("✓ All preprocessed files cover 10 matches")
else:
    print("✗ Inconsistent match coverage across files")


=== Test 2: Preprocessed Data Consistency ===

Events loaded: 47,853
Phases loaded: 4,581
Player-match records: 360

Match coverage:
  Events: 10/10
  Phases: 10/10
  Metadata: 10/10
✓ All preprocessed files cover 10 matches


In [3]:
print("\n=== Test 3: Sprint Detection Sanity Checks ===\n")

player_sprints = pd.read_csv(processed_dir / 'player_sprints.csv')

# Check sprint volumes are realistic
sprints_per_90 = player_sprints['sprints_per_90']

print("Sprint volume distribution:")
print(sprints_per_90.describe().round(2))

# Expected ranges based on professional benchmarks
if sprints_per_90.median() < 3 or sprints_per_90.median() > 20:
    print("⚠ Median sprints per 90 outside expected range (3-20)")
else:
    print("✓ Sprint volumes in realistic range")

# Check speeds are physically plausible
avg_speeds = player_sprints['avg_sprint_speed_kmh']
max_speeds = player_sprints['max_sprint_speed_kmh']

print(f"\nSprint speeds:")
print(f"  Avg: {avg_speeds.min():.1f} - {avg_speeds.max():.1f} km/h")
print(f"  Max: {max_speeds.min():.1f} - {max_speeds.max():.1f} km/h")

if avg_speeds.max() > 30 or max_speeds.max() > 36:
    print("⚠ Some sprint speeds exceed human capability")
else:
    print("✓ Sprint speeds physically plausible")

# Check context quality percentages sum correctly
high_value_pct = player_sprints['high_value_sprint_pct'].mean()
attacking_pct = player_sprints['attacking_sprint_pct'].mean()

print(f"\nContext quality:")
print(f"  High-value phase: {high_value_pct:.1%}")
print(f"  Attacking sprints: {attacking_pct:.1%}")

if 0.5 < high_value_pct < 0.8:
    print("✓ High-value sprint clustering matches expectations")
else:
    print("⚠ High-value sprint % outside expected range")


=== Test 3: Sprint Detection Sanity Checks ===

Sprint volume distribution:
count    244.00
mean       7.55
std        4.56
min        0.88
25%        4.31
50%        6.59
75%        9.82
max       29.37
Name: sprints_per_90, dtype: float64
✓ Sprint volumes in realistic range

Sprint speeds:
  Avg: 25.6 - 28.4 km/h
  Max: 26.1 - 31.1 km/h
✓ Sprint speeds physically plausible

Context quality:
  High-value phase: 63.0%
  Attacking sprints: 39.1%
✓ High-value sprint clustering matches expectations


In [8]:
print("\n=== Test 4: Off-Ball Run Aggregation Checks ===\n")

player_runs = pd.read_csv(processed_dir / "player_runs.csv")

# Bring in position_group from metadata for validation-only analysis
player_meta = pd.read_csv(processed_dir / "player_metadata.csv")
runs_with_pos = player_runs.merge(
    player_meta[["match_id", "player_id", "position_group"]],
    on=["match_id", "player_id"],
    how="left",
)

# Check threat values are in xG-like range
avg_xthreat = runs_with_pos["avg_xthreat"]

print("Run threat distribution:")
print(avg_xthreat.describe().round(4))

if avg_xthreat.max() > 0.5:
    print("⚠ Some runs have unrealistically high threat (>0.5)")
else:
    print("✓ Run threat values in expected xG-like range")

# Check selection quality if available
if "targeted_dangerous_pct" in runs_with_pos.columns:
    targeted_pct = runs_with_pos["targeted_dangerous_pct"].mean()
    print("\nSelection quality:")
    print(f"  Targeted dangerous rate: {targeted_pct:.1%}")
    
    if 0.1 < targeted_pct < 0.5:
        print("✓ Selection rates realistic")

# Check position patterns make sense
print("\nRuns per 90 by position:")
position_runs = (
    runs_with_pos
    .groupby("position_group")["runs_per_90"]
    .mean()
    .sort_values(ascending=False)
)
print(position_runs.round(2))

# Wide attackers should have more runs than defenders
if position_runs.index[0] in ["Wide Attacker", "Forward"]:
    print("✓ Position patterns match expectations")
else:
    print("⚠ Unexpected position ordering")


=== Test 4: Off-Ball Run Aggregation Checks ===

Run threat distribution:
count    246.0000
mean       0.0206
std        0.0195
min        0.0001
25%        0.0048
50%        0.0153
75%        0.0322
max        0.1028
Name: avg_xthreat, dtype: float64
✓ Run threat values in expected xG-like range

Runs per 90 by position:
position_group
Wide Attacker       32.95
Center Forward      30.44
Midfield            24.00
Full Back           22.45
Central Defender    10.07
Other                3.65
Name: runs_per_90, dtype: float64
✓ Position patterns match expectations


In [5]:
print("\n=== Test 5: Pressing Effectiveness Checks ===\n")

player_pressing = pd.read_csv(processed_dir / 'player_pressing.csv')

# Check success rates are probabilistic (0-1 range)
success_rate = player_pressing['press_success_rate']
regain_rate = player_pressing['regain_rate']

print("Pressing success distribution:")
print(success_rate.describe().round(3))

if success_rate.min() < 0 or success_rate.max() > 1:
    print("✗ Success rates outside 0-1 range")
else:
    print("✓ Success rates are valid probabilities")

# Check regain rate is subset of success rate
mean_success = success_rate.mean()
mean_regain = regain_rate.mean()

print(f"\nOutcome breakdown:")
print(f"  Overall success: {mean_success:.1%}")
print(f"  Regain rate: {mean_regain:.1%}")
print(f"  Disruption rate: {player_pressing['disruption_rate'].mean():.1%}")

if mean_regain > mean_success:
    print("✗ Regain rate exceeds success rate (logical error)")
else:
    print("✓ Outcome rates logically consistent")

# Check volume is realistic
pressing_per_90 = player_pressing['pressing_actions_per_90']
print(f"\nPressing volume: {pressing_per_90.median():.1f} per 90 (median)")

if 5 < pressing_per_90.median() < 25:
    print("✓ Pressing volumes realistic")


=== Test 5: Pressing Effectiveness Checks ===

Pressing success distribution:
count    199.000
mean       0.161
std        0.159
min        0.000
25%        0.059
50%        0.133
75%        0.200
max        1.000
Name: press_success_rate, dtype: float64
✓ Success rates are valid probabilities

Outcome breakdown:
  Overall success: 16.1%
  Regain rate: 11.3%
  Disruption rate: 4.8%
✓ Outcome rates logically consistent

Pressing volume: 13.4 per 90 (median)
✓ Pressing volumes realistic


In [6]:
print("\n=== Test 6: Cross-Metric Consistency ===\n")

# Load all three metrics
sprints = pd.read_csv(processed_dir / 'player_sprints.csv')
runs = pd.read_csv(processed_dir / 'player_runs.csv')
pressing = pd.read_csv(processed_dir / 'player_pressing.csv')

print(f"Unique players:")
print(f"  Sprints: {sprints['player_id'].nunique()}")
print(f"  Runs: {runs['player_id'].nunique()}")
print(f"  Pressing: {pressing['player_id'].nunique()}")

# Check overlap - most players should appear in multiple metrics
all_player_ids = set(sprints['player_id']) | set(runs['player_id']) | set(pressing['player_id'])
in_all_three = set(sprints['player_id']) & set(runs['player_id']) & set(pressing['player_id'])

print(f"\nTotal unique players: {len(all_player_ids)}")
print(f"Players in all 3 metrics: {len(in_all_three)}")
print(f"Overlap rate: {len(in_all_three) / len(all_player_ids):.1%}")

if len(in_all_three) / len(all_player_ids) > 0.5:
    print("✓ Good overlap across metrics")

# Expected correlations
# Sprint volume and run volume should correlate (both measure movement)
merged = sprints[['player_id', 'sprints_per_90']].merge(
    runs[['player_id', 'runs_per_90']], 
    on='player_id'
)

if len(merged) > 10:
    corr = merged['sprints_per_90'].corr(merged['runs_per_90'])
    print(f"\nSprint vs Run volume correlation: {corr:.3f}")
    
    if corr > 0.3:
        print("✓ Expected positive correlation between sprint and run volume")


=== Test 6: Cross-Metric Consistency ===

Unique players:
  Sprints: 166
  Runs: 171
  Pressing: 145

Total unique players: 173
Players in all 3 metrics: 142
Overlap rate: 82.1%
✓ Good overlap across metrics

Sprint vs Run volume correlation: 0.343
✓ Expected positive correlation between sprint and run volume


In [7]:
print("\n=== Validation Summary ===\n")

print("Tests completed:")
print("1. ✓ Source data availability")
print("2. ✓ Preprocessed data consistency")
print("3. ✓ Sprint detection accuracy")
print("4. ✓ Off-ball run aggregations")
print("5. ✓ Pressing effectiveness metrics")
print("6. ✓ Cross-metric consistency")

print("\nAll basic validation checks passed.")
print("\nFor production deployment, additional tests would include:")
print("- Null value patterns and expected missingness")
print("- Outlier detection and flagging")
print("- Temporal consistency (same player across matches)")
print("- Join key integrity (no orphaned records)")
print("- Performance benchmarks (processing time per match)")


=== Validation Summary ===

Tests completed:
1. ✓ Source data availability
2. ✓ Preprocessed data consistency
3. ✓ Sprint detection accuracy
4. ✓ Off-ball run aggregations
5. ✓ Pressing effectiveness metrics
6. ✓ Cross-metric consistency

All basic validation checks passed.

For production deployment, additional tests would include:
- Null value patterns and expected missingness
- Outlier detection and flagging
- Temporal consistency (same player across matches)
- Join key integrity (no orphaned records)
- Performance benchmarks (processing time per match)
