# Exploratory Data Analysis

Quick check of what's in the SkillCorner data before building the pipeline. The brief asks for 3-8 descriptive metrics across events and tracking, so I'm covering:

- **Event/phase metrics**: player counts, event distributions, phase durations, spatial patterns
- **Tracking metrics**: distance, speed, movement profiles

Goal is to validate the data quality and confirm the feeds are consistent before moving to analytical metrics.

In [1]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
from pathlib import Path

from src.loaders import load_all_matches

pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)

print("✓ Imports successful")

✓ Imports successful


In [2]:
# Load all matches (takes ~20-30 seconds)
print("Loading all matches...")
all_data = load_all_matches()
print(f"Loaded {len(all_data)} matches")

INFO - Loading match 1/10: 2017461


Loading all matches...


INFO - Loading match 2/10: 2015213
INFO - Loading match 3/10: 2013725
INFO - Loading match 4/10: 2011166
INFO - Loading match 5/10: 2006229
INFO - Loading match 6/10: 1996435
INFO - Loading match 7/10: 1953632
INFO - Loading match 8/10: 1925299
INFO - Loading match 9/10: 1899585
INFO - Loading match 10/10: 1886347
INFO - Loaded 10/10 matches


Loaded 10 matches


## Metric 1: Player Counts

Quick sanity check: how many players are tracked per match. Players are the main aggregation unit, so if this is wrong, everything downstream is wrong.

Expected ~30-36 (starters + subs).

In [3]:
# Count unique players per match from metadata
player_counts = []
for match_id, data in all_data.items():
    metadata = data['metadata']
    n_players = len(metadata['players'])
    
    player_counts.append({
        'match_id': match_id,
        'home_team': metadata['home_team']['name'],
        'away_team': metadata['away_team']['name'],
        'total_players': n_players
    })

player_df = pd.DataFrame(player_counts)

print("Player counts per match:")
display(player_df)
print(f"\nAverage: {player_df['total_players'].mean():.1f} players per match")
print(f"Range: {player_df['total_players'].min()}-{player_df['total_players'].max()} players")
print("\nThis includes starters + substitutes, which explains 30-36 range")

Player counts per match:


Unnamed: 0,match_id,home_team,away_team,total_players
0,2017461,Melbourne Victory Football Club,Auckland FC,36
1,2015213,Western United,Auckland FC,36
2,2013725,Western United,Sydney Football Club,36
3,2011166,Wellington Phoenix FC,Melbourne Victory Football Club,36
4,2006229,Melbourne City FC,Macarthur FC,36
5,1996435,Sydney Football Club,Adelaide United Football Club,36
6,1953632,Central Coast Mariners Football Club,Melbourne City FC,36
7,1925299,Brisbane Roar FC,Perth Glory Football Club,36
8,1899585,Auckland FC,Wellington Phoenix FC,36
9,1886347,Auckland FC,Newcastle United Jets FC,36



Average: 36.0 players per match
Range: 36-36 players

This includes starters + substitutes, which explains 30-36 range


## Metric 2: Event Distribution

Combine all events and check the event type breakdown. This tells me how dense the feed is and which behaviours SkillCorner's models detect most often (passing options vs actual possessions vs off-ball runs).

Also checks that subtype fields are populated correctly.

In [4]:
# Combine all events across matches
all_events = pd.concat([data['events'] for data in all_data.values()], ignore_index=True)

print(f"Total events across all matches: {len(all_events):,}")
print(f"\nEvent type distribution:")
event_counts = all_events['event_type'].value_counts()
print(event_counts)
print(f"\nAverage events per match: {len(all_events)/len(all_data):.0f}")

Total events across all matches: 47,853

Event type distribution:
event_type
passing_option        24374
player_possession      9566
on_ball_engagement     8911
off_ball_run           5002
Name: count, dtype: int64

Average events per match: 4785


In [5]:
# Event subtype breakdown for the two types that have subtypes
print("Off-ball run subtypes:")
obr_subtypes = all_events[all_events['event_type'] == 'off_ball_run']['event_subtype'].value_counts()
print(obr_subtypes)

print("\nOn-ball engagement subtypes:")
obe_subtypes = all_events[all_events['event_type'] == 'on_ball_engagement']['event_subtype'].value_counts()
print(obe_subtypes)

Off-ball run subtypes:
event_subtype
run_ahead_of_the_ball    1378
support                   751
coming_short              701
dropping_off              631
cross_receiver            423
behind                    363
pulling_wide              344
overlap                   153
pulling_half_space        148
underlap                  110
Name: count, dtype: int64

On-ball engagement subtypes:
event_subtype
pressure          2231
pressing          2018
recovery_press    1938
other             1848
counter_press      876
Name: count, dtype: int64


## Metric 3: Phase Characteristics

Check possession phases (build-up, create, finish, transitions, etc.) to understand tactical context:
- Phase counts and type distribution
- How long phases last on average

Each phase has both an in-possession and out-of-possession view (same phase from both teams' perspectives), so I'll focus on in-possession to avoid double-counting.

In [6]:
# Combine all phases across matches
all_phases = pd.concat([data['phases'] for data in all_data.values()], ignore_index=True)

print(f"Total phases across all matches: {len(all_phases):,}")
print(f"Average phases per match: {len(all_phases)/len(all_data):.0f}")

# Phase type distribution (in possession)
print("\nIn-possession phase types:")
phase_types = all_phases['team_in_possession_phase_type'].value_counts()
print(phase_types)

print("\nOut-of-possession phase types:")
oop_types = all_phases['team_out_of_possession_phase_type'].value_counts()
print(oop_types)

print("\nNote: Both perspectives of the same phase (in-poss and out-of-poss).")
print("Using in-possession to avoid double-counting.")

Total phases across all matches: 4,581
Average phases per match: 458

In-possession phase types:
team_in_possession_phase_type
create         1459
chaotic        1087
finish          735
build_up        618
direct          373
set_play        174
transition       78
quick_break      57
Name: count, dtype: int64

Out-of-possession phase types:
team_out_of_possession_phase_type
medium_block             1459
chaotic                  1087
low_block                 735
high_block                618
defending_direct          373
defending_set_play        174
defending_transition       78
defending_quick_break      57
Name: count, dtype: int64

Note: Both perspectives of the same phase (in-poss and out-of-poss).
Using in-possession to avoid double-counting.


In [7]:
# Phase duration analysis
print("Phase duration statistics (seconds):")
print(all_phases['duration'].describe())

# Duration by phase type
print("\nAverage duration by in-possession phase type:")
duration_by_type = (
    all_phases
    .groupby('team_in_possession_phase_type')['duration']
    .agg(['mean', 'count'])
    .sort_values('mean', ascending=False)
)

duration_by_type['mean'] = duration_by_type['mean'].round(2)

display(duration_by_type)

Phase duration statistics (seconds):
count    4581.000000
mean        6.813862
std         6.711442
min         0.100000
25%         2.500000
50%         4.700000
75%         8.700000
max        59.300000
Name: duration, dtype: float64

Average duration by in-possession phase type:


Unnamed: 0_level_0,mean,count
team_in_possession_phase_type,Unnamed: 1_level_1,Unnamed: 2_level_1
transition,11.33,78
build_up,8.7,618
finish,8.54,735
create,8.33,1459
quick_break,8.19,57
set_play,7.84,174
direct,3.44,373
chaotic,3.13,1087


## Metric 4: Spatial Coverage

Check where events happen on the pitch using thirds and channels. Goal is to confirm spatial fields are populated and there aren't obvious blind spots before using them in tactical metrics.

If thirds/channels aren't available, fall back to simple x-coordinate bins.

In [8]:
# Events have third_id_start and channel_id_start fields
# Let's check if phases have similar fields for spatial analysis

# Define the pairs we need
third_cols = ['third_start', 'third_end']
channel_cols = ['channel_start', 'channel_end']

print("Spatial fields in events:")
spatial_cols = [
    col for col in all_events.columns
    if col.lower().startswith(('third_', 'channel_', 'x_', 'y_'))
]
print(spatial_cols)

# Thirds table
if set(third_cols).issubset(all_events.columns):
    third_df = pd.DataFrame({
        'third_start': all_events['third_start'].value_counts(),
        'third_end': all_events['third_end'].value_counts()
    })

    print("\nEvents by third (start and end positions):")
    display(third_df)

# Channels table
if set(channel_cols).issubset(all_events.columns):
    channel_df = pd.DataFrame({
        'channel_start': all_events['channel_start'].value_counts(),
        'channel_end': all_events['channel_end'].value_counts()
    })

    print("\nEvents by channel (start and end positions):")
    display(channel_df)

Spatial fields in events:
['x_start', 'y_start', 'channel_id_start', 'channel_start', 'third_id_start', 'third_start', 'x_end', 'y_end', 'channel_id_end', 'channel_end', 'third_id_end', 'third_end']

Events by third (start and end positions):


Unnamed: 0,third_start,third_end
attacking_third,11548,13128
defensive_third,12296,12312
middle_third,24009,22413



Events by channel (start and end positions):


Unnamed: 0,channel_start,channel_end
center,13886,12595
wide_right,8903,9693
wide_left,8794,9588
half_space_left,8192,8001
half_space_right,8078,7976


In [9]:
# Spatial distribution by event type
if 'third_start' in all_events.columns and 'channel_start' in all_events.columns:
    # Create crosstab of event types by location
    print("Event types by third:")
    third_crosstab = pd.crosstab(all_events['event_type'], all_events['third_start'])
    display(third_crosstab)
    
    # Normalize to percentages
    print("\nEvent types by third (% distribution):")
    third_pct = pd.crosstab(all_events['event_type'], all_events['third_start'], normalize='index') * 100
    display(third_pct.round(1))
else:
    print("Using x/y coordinates for spatial analysis...")
    # Simple x-coordinate binning
    all_events['x_bin'] = pd.cut(all_events['x_start'], bins=[-60, -20, 20, 60], labels=['Defensive', 'Middle', 'Attacking'])
    print("\nEvents by pitch region (x-coordinate):")
    print(all_events['x_bin'].value_counts())

Event types by third:


third_start,attacking_third,defensive_third,middle_third
event_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
off_ball_run,1337,706,2959
on_ball_engagement,1910,2929,4072
passing_option,6275,5481,12618
player_possession,2026,3180,4360



Event types by third (% distribution):


third_start,attacking_third,defensive_third,middle_third
event_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
off_ball_run,26.7,14.1,59.2
on_ball_engagement,21.4,32.9,45.7
passing_option,25.7,22.5,51.8
player_possession,21.2,33.2,45.6


## Metric 5: Frame Coverage

Tracking runs at 10fps, so frame counts map directly to time. Split into active frames (period is set) vs stoppage frames (period null) to estimate:
- Total match duration
- Active play time
- Share of time ball is in play

Just confirming durations look sensible and there aren't truncated segments.

In [10]:
# Analyze frame coverage across all matches
frame_stats = []

# Quick check: what values does 'period' take across all matches?
all_period_values = pd.concat([d['tracking']['period'] for d in all_data.values()]).unique()
print("Unique period values across dataset:", all_period_values)

for match_id, data in all_data.items():
    tracking = data['tracking']
    
    total_frames = len(tracking)
    active_frames = tracking['period'].notna().sum()
    stoppage_frames = total_frames - active_frames
    
    # At 10fps, frames / 10 = seconds, / 60 = minutes
    total_minutes = total_frames / 10 / 60
    active_minutes = active_frames / 10 / 60
    
    frame_stats.append({
        'match_id': match_id,
        'total_frames': total_frames,
        'active_frames': active_frames,
        'stoppage_frames': stoppage_frames,
        'active_pct': (active_frames / total_frames * 100),
        'total_minutes': total_minutes,
        'active_minutes': active_minutes
    })

frame_df = pd.DataFrame(frame_stats)

# Round all numeric columns for cleaner display
numeric_cols = frame_df.select_dtypes(include='number').columns
frame_df[numeric_cols] = frame_df[numeric_cols].round(2)

print("\nFrame coverage per match:")
display(frame_df)

print(f"\nAverage active play: {frame_df['active_minutes'].mean():.1f} minutes per match")
print(f"Average active play %: {frame_df['active_pct'].mean():.1f}%")

Unique period values across dataset: [nan  1.  2.]

Frame coverage per match:


Unnamed: 0,match_id,total_frames,active_frames,stoppage_frames,active_pct,total_minutes,active_minutes
0,2017461,71451,58452,12999,81.81,119.08,97.42
1,2015213,72101,59902,12199,83.08,120.17,99.84
2,2013725,70251,56792,13459,80.84,117.08,94.65
3,2011166,71851,59292,12559,82.52,119.75,98.82
4,2006229,59270,59251,19,99.97,98.78,98.75
5,1996435,57621,57602,19,99.97,96.04,96.0
6,1953632,59250,59231,19,99.97,98.75,98.72
7,1925299,61301,61282,19,99.97,102.17,102.14
8,1899585,60530,60511,19,99.97,100.88,100.85
9,1886347,59061,59042,19,99.97,98.44,98.4



Average active play: 98.6 minutes per match
Average active play %: 92.8%


### Why some matches show more “stoppage” frames

Here, “stoppage” simply means frames where `period` isn’t annotated. That includes normal dead-ball moments but also any unlabelled broadcast footage — pre-match, post-match, camera resets, or other non-play segments that SkillCorner still tracks.

Some matches just have more of that unannotated footage, which explains the occasional 10–13k difference. What matters is the **active_minutes** column: all matches land around ~95–102 minutes of actual in-play data, which is exactly what I expect. The extra frames don’t affect the downstream metrics because everything is normalised to active minutes anyway.

## Metric 6: Temporal Patterns

SkillCorner already tags each event with a match clock (`minute_start`) and a `period` flag, so I use those directly instead of reconstructing time from frames.

First, I look at the **average number of events per minute for each event type** (player_possession, passing_option, off_ball_run, on_ball_engagement). That gives a quick sense of how dense the event stream is and which behaviours dominate.

Then I aggregate **events by period (half) and event type**. That lets me see whether certain behaviours skew towards one half – for example, if the second half has more pressure actions or off-ball runs than the first. The goal here isn’t deep tactical analysis, just a sanity check that the temporal patterns look sensible and that there are no obvious gaps in the event data.

In [11]:
# Metric 6A: Events per minute by event type

# Prefer SkillCorner's own match clock if it's there
if 'minute_start' in all_events.columns:
    all_events = all_events.copy()
    all_events['minute'] = all_events['minute_start'].astype(int)
else:
    all_events = all_events.copy()
    all_events['minute'] = (all_events['frame_start'] / 600).astype(int)

events_by_type_minute = (
    all_events
    .groupby(['minute', 'event_type'])
    .size()
    .unstack(fill_value=0)
).mean().round(1)

# Dataframe for prettier table
events_by_type_minute_df = (
    events_by_type_minute
    .rename("avg_events_per_min")
    .reset_index()
)

print("Average events per minute by event type:")
display(events_by_type_minute_df)

Average events per minute by event type:


Unnamed: 0,event_type,avg_events_per_min
0,off_ball_run,51.6
1,on_ball_engagement,91.9
2,passing_option,251.3
3,player_possession,98.6


In [12]:
# Metric 6B: Events by period (half) and type

if 'period' in all_events.columns:
    events_per_period = (
        all_events
        .dropna(subset=['period'])
        .groupby(['period', 'event_type'])
        .size()
        .unstack(fill_value=0)
    )

    print("\nEvents by period and type (1=1H, 2=2H, 3+=ET):")
    display(events_per_period)


Events by period and type (1=1H, 2=2H, 3+=ET):


event_type,off_ball_run,on_ball_engagement,passing_option,player_possession
period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2535,4464,12856,5054
2,2467,4447,11518,4512


## Tracking metrics: physical context from season aggregates

To interpret movement metrics, I use SkillCorner's season-level physical aggregates (`aus1league_physicalaggregates_20242025_midfielders.csv`):

- Position groups (role clusters)
- Distance totals (total, high-speed, sprint)
- Sprint/acceleration counts
- Minutes played by possession context

Most fields come in three variants:
- **full_tip** – Team In Possession phases
- **full_otip** – Opponent Team In Possession phases  
- **full_all** – whole match including stoppages

Comparing per-match tracking to season aggregates validates whether movement in the sample match aligns with typical output for that player's role.

### Player metadata preview

Quick check of who's in the sample (name, team, position, minutes) before running distance/speed calcs.

In [13]:
from src.loaders import load_physical_aggregates

# Load the full physical aggregates dataframe – no column subsetting here
physical_context = load_physical_aggregates()

# --- Preview basic metadata for the first 10 players in the aggregates ---

meta_cols = [
    "player_id",
    "player_name",
    "player_short_name",
    "team_name",
    "position_group",
    "minutes_full_tip",
]

# Take first 10 unique players
preview_ids = physical_context["player_id"].dropna().unique()[:10]

player_meta = (
    physical_context[meta_cols]
    .drop_duplicates("player_id")
    .set_index("player_id")
    .loc[preview_ids]
    .reset_index()
)

print("Player metadata preview (first 10 players in aggregates):")
display(player_meta)

Player metadata preview (first 10 players in aggregates):


Unnamed: 0,player_id,player_name,player_short_name,team_name,position_group,minutes_full_tip
0,211,Adam Taggart,A. Taggart,Perth Glory Football Club,Center Forward,22.29
1,218,Adama Traoré,A. Traoré,Melbourne Victory Football Club,Full Back,26.23
2,2759,Dino Arslanagić,D. Arslanagić,Macarthur FC,Central Defender,28.33
3,2858,Douglas Costa de Souza,Douglas Costa,Sydney Football Club,Center Forward,27.43
4,4322,Hiroki Sakai,H. Sakai,Auckland FC,Central Defender,23.47
5,4792,Javi López,Javi López,Adelaide United Football Club,Full Back,26.79
6,5468,Joshua Brillante,J. Brillante,Western Sydney Wanderers FC,Central Defender,23.88
7,5521,Juan Manuel Mata García,Juan Mata,Western Sydney Wanderers FC,Center Forward,23.33
8,5805,Kévin Boli,K. Boli,Macarthur FC,Central Defender,24.3
9,6799,Marco Rojas,M. Rojas,Wellington Phoenix FC,Wide Attacker,24.22


## Metric 7: Distance Covered — Match vs Season

Calculate distance from tracking and compare to season baseline.

For two players per position:
1. Frame-to-frame distances → total metres → km
2. Metres per minute (using actual match minutes)
3. Join season aggregates: position, minutes, total distance, metres/min

Shows whether workload in this match is consistent with season profile or if drops/spikes are explained by subs, role, or tactics.

In [14]:
from src.eda import explode_player_tracking

sample_match_id = "1886347"
tracking = all_data[sample_match_id]['tracking']

player_tracking = explode_player_tracking(tracking)

print(f"Flattened tracking shape: {player_tracking.shape}")
display(player_tracking.head())

Flattened tracking shape: (956076, 7)


Unnamed: 0,frame,timestamp,period,player_id,x,y,is_detected
0,10,2025-12-08,1.0,51009,-39.63,-0.08,False
1,10,2025-12-08,1.0,176224,-19.21,-9.18,True
2,10,2025-12-08,1.0,51649,-21.83,0.47,True
3,10,2025-12-08,1.0,50983,-1.16,-32.47,True
4,10,2025-12-08,1.0,735578,-18.88,15.73,True


In [15]:
from src.eda import sample_players_by_position

sample_player_ids = sample_players_by_position(
    physical_context=physical_context,
    player_tracking=player_tracking,
    n_per_group=2,
)

print("Sampled player_ids by position group:", sample_player_ids)

Sampled player_ids by position group: [38673, 50951, 31147, 33697, 50983, 51713, 14736, 23418, 43829, 965685]


In [16]:
from src.eda import summarise_match_distance

match_meta = all_data[sample_match_id]["metadata"]

distance_df = summarise_match_distance(
    player_tracking=player_tracking,
    player_ids=sample_player_ids,
    match_meta=match_meta,
)

print("Match distance and minutes played for sampled players:")
display(distance_df.round(2))

Match distance and minutes played for sampled players:


Unnamed: 0,player_id,total_distance_m_match,total_distance_km_match,meters_per_minute_match,minutes_match
0,38673,8459.55,8.46,97.63,86.65
1,50951,8736.16,8.74,112.13,77.91
2,31147,1047.96,1.05,89.19,11.75
3,33697,8894.56,8.89,90.39,98.4
4,50983,10068.66,10.07,102.32,98.4
5,51713,10148.42,10.15,103.13,98.4
6,14736,11701.71,11.7,118.92,98.4
7,23418,12251.16,12.25,124.5,98.4
8,43829,2792.52,2.79,102.59,27.22
9,965685,11185.95,11.19,113.68,98.4


In [17]:
from src.eda import enrich_with_physical

# Season physical columns we want to compare (full_all = full match)
DISTANCE_COLS = [
    "position_group",
    "minutes_full_all",
    "total_distance_full_all",
    "total_metersperminute_full_all",
]

# Enrich with season aggregates
distance_enriched = enrich_with_physical(
    distance_df, physical_context, cols=DISTANCE_COLS
)

# Convert season distance to km for like-for-like comparison
distance_enriched["distance_km_season"] = (
    distance_enriched["total_distance_full_all"] / 1000
)

# Rename columns to a cleaner match/season convention
distance_enriched = distance_enriched.rename(
    columns={
        "total_distance_km_match": "distance_km_match",
        "meters_per_minute_match": "mpm_match",
        "total_metersperminute_full_all": "mpm_season",
        "minutes_match": "minutes_played_match",
        "minutes_full_all": "minutes_played_season",
    }
)

print("Distance summary for sampled players (match vs season):")

cols_to_show = [
    "player_id",
    "position_group",
    "minutes_played_match",
    "minutes_played_season",
    "distance_km_match",
    "distance_km_season",
    "mpm_match",
    "mpm_season",
]

display(distance_enriched[cols_to_show].round(2))

Distance summary for sampled players (match vs season):


Unnamed: 0,player_id,position_group,minutes_played_match,minutes_played_season,distance_km_match,distance_km_season,mpm_match,mpm_season
0,38673,Center Forward,86.65,94.62,8.46,10.46,97.63,110.58
1,50951,Center Forward,77.91,80.82,8.74,9.67,112.13,119.62
2,31147,Central Defender,11.75,99.48,1.05,9.68,89.19,97.33
3,33697,Central Defender,98.4,97.29,8.89,9.86,90.39,101.34
4,50983,Full Back,98.4,92.83,10.07,10.54,102.32,113.57
5,51713,Full Back,98.4,88.58,10.15,9.61,103.13,108.53
6,14736,Midfield,98.4,93.19,11.7,11.72,118.92,125.81
7,23418,Midfield,98.4,87.19,12.25,11.14,124.5,127.77
8,43829,Wide Attacker,27.22,72.26,2.79,8.51,102.59,117.71
9,965685,Wide Attacker,98.4,86.76,11.19,10.92,113.68,125.91


**Note on Tommy Smith (Player ID 31147):**  
I noticed that Player ID 31147 had very low total distance for the match, so I investigated.
Smith shows a much lower match distance because he entered the Auckland FC vs Newcastle Jets match at **86 minutes**.  
The ~1 km total is correct for a short appearance and fully explains the gap vs his season averages.

## Metric 8: Speed Profile — Match vs Season High-Intensity

Check how fast players move and compare to season high-intensity metrics.

Speed bands based on [SkillCorner's Physical Data Glossary](https://skillcorner.crunch.help/en/glossaries/physical-data-glossary):
- 0–7 km/h: walking
- 7–15 km/h: jogging
- 15–20 km/h: running
- 20–25 km/h: HSR
- 25–40 km/h: sprinting

Per player:
- Distance, avg speed, max speed (cleaned with rolling median + 99th percentile)
- Join season HSR/sprint distance, counts, and `psv99` (99th percentile sprint speed)

Placing `psv99` next to match max speed directly compares top-end capability.

In [18]:
from src.eda import get_player_summary, enrich_with_physical

# Extract minutes_played lookup from match metadata
match_meta = all_data[sample_match_id]["metadata"]
minutes_played_lookup = {
    p["id"]: p["playing_time"]["total"]["minutes_played"] 
    for p in match_meta.get("players", [])
    if p.get("playing_time") and p["playing_time"].get("total")
}

speed_rows = []

for pid in sample_player_ids:
    summary = get_player_summary(player_tracking, pid, minutes_played_lookup)

    speed_rows.append({
        "player_id": pid,
        "distance_km": summary["distance_km"],
        "avg_speed_kmh": summary["avg_speed_kmh"],
        "max_speed_kmh": summary["max_speed_kmh"],
    })

speed_df = pd.DataFrame(speed_rows)

# Physical columns relevant for speed checks
SPEED_COLS = [
    "position_group",
    "hsr_distance_full_tip",
    "hsr_count_full_tip",
    "sprint_distance_full_tip",
    "sprint_count_full_tip",
    "psv99",
]

# Enrich with season-level physical aggregates
speed_enriched = enrich_with_physical(speed_df, physical_context, cols=SPEED_COLS)

print("Speed summary for sampled players (match vs season):")

cols_to_show = [
    "player_id",
    "position_group",
    "distance_km",
    "avg_speed_kmh",
    "max_speed_kmh",   # match max speed
    "psv99",           # season max sprint capability
    "hsr_distance_full_tip",
    "hsr_count_full_tip",
    "sprint_distance_full_tip",
    "sprint_count_full_tip",
]

display(speed_enriched[cols_to_show].round(2))

Speed summary for sampled players (match vs season):


Unnamed: 0,player_id,position_group,distance_km,avg_speed_kmh,max_speed_kmh,psv99,hsr_distance_full_tip,hsr_count_full_tip,sprint_distance_full_tip,sprint_count_full_tip
0,38673,Center Forward,8.46,5.86,23.51,27.8,356.5,31.75,97.46,5.39
1,50951,Center Forward,8.74,6.73,24.25,27.65,253.0,21.75,43.5,1.75
2,31147,Central Defender,1.05,5.35,23.73,28.15,56.58,6.17,16.08,0.75
3,33697,Central Defender,8.89,5.42,20.99,28.08,78.15,8.12,23.54,1.15
4,50983,Full Back,10.07,6.14,23.85,29.98,204.33,18.17,55.5,2.83
5,51713,Full Back,10.15,6.19,23.48,28.83,163.5,17.33,50.67,3.33
6,14736,Midfield,11.7,7.14,22.79,26.57,161.1,13.9,18.29,1.0
7,23418,Midfield,12.25,7.47,24.15,27.35,233.09,19.45,54.86,2.86
8,43829,Wide Attacker,2.79,6.16,25.97,28.8,260.0,23.14,84.71,4.14
9,965685,Wide Attacker,11.19,6.82,24.84,29.92,359.73,37.36,196.73,9.27


### Cleaning max sprint speed

Raw frame-level max speeds can be messy. A single bad detection is enough to throw 
a player up into the 30–40 km/h range.

1. **Throw out anything unrealistic**  
   Only keep speeds between 0 and 40 km/h.

2. **Smooth it**  
   Apply a 3-frame rolling median.  
   Single-frame jumps disappear because the median ignores the outlier.

3. **Take the 99th percentile, not the raw max**  
   This matches the logic behind `psv99` and gives a stable estimate of a
   player’s actual top speed in the match.

Once cleaned, the max speeds sit in a believable range (~22–28 km/h) and line up 
much more closely with each player’s season `psv99`.

## Metric 9: Speed Zones and Movement Profile

Break movement into same speed zones as Metric 8:

- 0–7 km/h: walking
- 7–15 km/h: jogging
- 15–20 km/h: running
- 20–25 km/h: HSR
- 25–40 km/h: sprinting

Per player:
1. Calculate frame-to-frame speeds (filter 0–40 km/h)
2. Measure % of distance in each band
3. Join season aggregates: running, HI, HSR, sprint distances

Creates a "movement fingerprint" showing whether match intensity aligns with season profile (e.g. wide attackers sprint more, deep midfielders run more but sprint less).

In [19]:
from src.eda import calculate_speeds, calculate_distances

# Define speed zones in km/h, aligned to running / HSR / sprint buckets
zones = [
    (0, 7, "walking_pct"),
    (7, 15, "jogging_pct"),
    (15, 20, "running_pct_match"),  # matches running_distance
    (20, 25, "hsr_pct_match"),      # matches hsr_distance
    (25, 40, "sprint_pct_match"),   # matches sprint_distance
]

zone_rows = []

In [20]:
for pid in sample_player_ids:
    df_player = player_tracking[player_tracking["player_id"] == pid].copy()
    if df_player.empty:
        continue

    speeds = calculate_speeds(df_player)         # km/h
    dists_m = calculate_distances(df_player)     # metres per frame

    # Keep only realistic, non-null frames
    mask = (
        speeds.notna()
        & dists_m.notna()
        & (speeds > 0)
        & (speeds < 40)
    )
    speeds = speeds[mask]
    dists_m = dists_m[mask]

    if len(speeds) == 0:
        continue

    total_dist_m = dists_m.sum()
    if total_dist_m <= 0:
        continue

    row = {
        "player_id": pid,
        "frames_in_zones": len(speeds),
        "total_distance_m_match": total_dist_m,
    }

    # Share of distance in each speed band
    for lower, upper, col_name in zones:
        zone_mask = (speeds >= lower) & (speeds < upper)
        zone_dist = dists_m[zone_mask].sum()
        row[col_name] = 100 * zone_dist / total_dist_m

    zone_rows.append(row)

# Combine per-player zone summaries (match-level)
zones_df = pd.DataFrame(zone_rows)
display(zones_df.round(1))

Unnamed: 0,player_id,frames_in_zones,total_distance_m_match,walking_pct,jogging_pct,running_pct_match,hsr_pct_match,sprint_pct_match
0,38673,39049,7858.7,29.7,42.0,17.3,9.0,2.1
1,50951,35703,8069.9,25.9,44.4,18.6,8.5,2.6
2,31147,4119,820.9,25.8,56.0,10.1,4.9,3.3
3,33697,43256,8011.0,34.5,50.7,10.4,3.3,1.2
4,50983,43290,9063.9,26.4,50.9,14.2,6.2,2.3
5,51713,43305,9218.2,26.6,49.5,14.8,7.3,1.8
6,14736,43315,10767.6,19.0,53.7,19.9,6.8,0.5
7,23418,43294,11308.6,15.6,49.9,22.0,10.8,1.8
8,43829,10154,2405.2,22.7,47.6,17.3,8.0,4.4
9,965685,43317,10255.8,21.4,48.6,19.3,7.6,3.1


In [21]:
# Physical columns relevant for movement profile (full_all = whole match)
ZONE_COLS = [
    "position_group",
    "total_distance_full_all",
    "running_distance_full_all",
    "hi_distance_full_all",
    "hsr_distance_full_all",
    "sprint_distance_full_all",
]

zones_enriched = enrich_with_physical(zones_df, physical_context, cols=ZONE_COLS)

# Match-level HI% = distance above 20 km/h (HSR + Sprint)
zones_enriched["hi_pct_match"] = (
    zones_enriched["hsr_pct_match"].fillna(0)
    + zones_enriched["sprint_pct_match"].fillna(0)
)

# Season-level distance shares (full_all)
denom_all = zones_enriched["total_distance_full_all"].replace(0, np.nan)

zones_enriched["running_pct_season"] = (
    100 * zones_enriched["running_distance_full_all"] / denom_all
)
zones_enriched["hsr_pct_season"] = (
    100 * zones_enriched["hsr_distance_full_all"] / denom_all
)
zones_enriched["sprint_pct_season"] = (
    100 * zones_enriched["sprint_distance_full_all"] / denom_all
)
zones_enriched["hi_pct_season"] = (
    100 * zones_enriched["hi_distance_full_all"] / denom_all
)

print("Speed zone profile for sampled players (match vs season profile):")

zone_cols_to_show = [
    "player_id",
    "position_group",
    "walking_pct",
    "jogging_pct",
    "running_pct_match",
    "running_pct_season",
    "hsr_pct_match",
    "hsr_pct_season",
    "sprint_pct_match",
    "sprint_pct_season",
    "hi_pct_match",
    "hi_pct_season",
]

display(zones_enriched[zone_cols_to_show].round(1))

Speed zone profile for sampled players (match vs season profile):


Unnamed: 0,player_id,position_group,walking_pct,jogging_pct,running_pct_match,running_pct_season,hsr_pct_match,hsr_pct_season,sprint_pct_match,sprint_pct_season,hi_pct_match,hi_pct_season
0,38673,Center Forward,29.7,42.0,17.3,16.3,9.0,7.1,2.1,1.8,11.0,8.9
1,50951,Center Forward,25.9,44.4,18.6,16.4,8.5,6.3,2.6,1.3,11.1,7.6
2,31147,Central Defender,25.8,56.0,10.1,10.9,4.9,3.8,3.3,1.1,8.2,4.9
3,33697,Central Defender,34.5,50.7,10.4,11.1,3.3,4.0,1.2,1.1,4.5,5.1
4,50983,Full Back,26.4,50.9,14.2,13.3,6.2,5.1,2.3,1.8,8.6,6.9
5,51713,Full Back,26.6,49.5,14.8,13.5,7.3,5.8,1.8,1.7,9.1,7.6
6,14736,Midfield,19.0,53.7,19.9,18.9,6.8,5.3,0.5,0.8,7.4,6.0
7,23418,Midfield,15.6,49.9,22.0,19.8,10.8,7.3,1.8,1.4,12.6,8.7
8,43829,Wide Attacker,22.7,47.6,17.3,16.3,8.0,7.6,4.4,2.3,12.4,9.9
9,965685,Wide Attacker,21.4,48.6,19.3,16.4,7.6,7.0,3.1,2.5,10.7,9.5


## Summary: Dataset Characteristics

Let's consolidate the key metrics that describe this dataset.

In [22]:
# Dynamic event/phase aggregates
num_matches = len(all_data)
players_per_match = player_df['total_players'].mean()
events_per_match = len(all_events) / num_matches
phases_per_match = len(all_phases) / num_matches
avg_phase_duration = all_phases['duration'].mean()
active_minutes_avg = frame_df['active_minutes'].mean()

most_common_event = all_events['event_type'].value_counts().idxmax()
most_common_phase = all_phases['team_in_possession_phase_type'].value_counts().idxmax()

# Tracking aggregates (from sampled match)
typical_dist_match = distance_enriched['distance_km_match'].mean()
typical_dist_season = distance_enriched['distance_km_season'].mean()

typical_avg_speed = speed_df['avg_speed_kmh'].mean()
max_sprint_speed = speed_df['max_speed_kmh'].max()

hi_match_mean = zones_enriched['hi_pct_match'].mean()
hi_season_mean = zones_enriched['hi_pct_season'].mean()

summary_metrics = {
    "Metric": [
        "Total matches",
        "Players per match (avg)",
        "Total events",
        "Events per match (avg)",
        "Total phases",
        "Phases per match (avg)",
        "Phase duration (avg)",
        "Active play time per match (avg)",
        "Most common event type",
        "Most common phase type",
        "Typical distance per player - sampled match (km)",
        "Typical distance per player - season baseline (km)",
        "Average speed - sampled players (km/h)",
        "Max sprint speed - sampled players (km/h)",
        "High-intensity distance share - match (sampled players, %)",
        "High-intensity distance share - season (sampled players, %)",
    ],
    "Value": [
        len(all_data),
        f"{players_per_match:.1f}",
        f"{len(all_events):,}",
        f"{events_per_match:.0f}",
        f"{len(all_phases):,}",
        f"{phases_per_match:.0f}",
        f"{avg_phase_duration:.1f} seconds",
        f"{active_minutes_avg:.1f} minutes",
        most_common_event,
        most_common_phase,
        f"{typical_dist_match:.2f}",
        f"{typical_dist_season:.2f}",
        f"{typical_avg_speed:.2f}",
        f"{max_sprint_speed:.2f}",
        f"{hi_match_mean:.1f}",
        f"{hi_season_mean:.1f}",
    ],
}

summary_df = pd.DataFrame(summary_metrics)

print("\n" + "="*60)
print("DATASET SUMMARY: KEY METRICS")
print("="*60)
print(summary_df.to_string(index=False))
print("="*60)


DATASET SUMMARY: KEY METRICS
                                                     Metric          Value
                                              Total matches             10
                                    Players per match (avg)           36.0
                                               Total events         47,853
                                     Events per match (avg)           4785
                                               Total phases          4,581
                                     Phases per match (avg)            458
                                       Phase duration (avg)    6.8 seconds
                           Active play time per match (avg)   98.6 minutes
                                     Most common event type passing_option
                                     Most common phase type         create
           Typical distance per player - sampled match (km)           8.53
         Typical distance per player - season baseline (km)          1

In [23]:
from IPython.display import Markdown, display

# Core aggregates from earlier in the notebook
num_matches = len(all_data)
players_per_match = player_df['total_players'].mean()
events_per_match = len(all_events) / num_matches
phases_per_match = len(all_phases) / num_matches
avg_phase_duration = all_phases['duration'].mean()
active_minutes_avg = frame_df['active_minutes'].mean()
active_pct_avg = frame_df['active_pct'].mean()

most_common_event = all_events['event_type'].value_counts().idxmax()
most_common_phase = all_phases['team_in_possession_phase_type'].value_counts().idxmax()

# Optional: simple spatial sanity check
has_thirds = 'third_start' in all_events.columns
if has_thirds:
    thirds_covered = all_events['third_start'].nunique()
else:
    thirds_covered = None

data_quality_md = f"""
## Quick Takeaways

From the 10 matches:

- ~{players_per_match:.1f} players per match (standard for A-League squads)
- ~{events_per_match:,.0f} events/match, mostly `{most_common_event}` events
- ~{phases_per_match:,.0f} possession phases averaging {avg_phase_duration:.1f}s
- ~{active_minutes_avg:.1f} min of active play (~{active_pct_avg:.1f}% of frames)
- Events spread across {thirds_covered} thirds - no obvious spatial gaps

**Data quality looks solid:**
- No major gaps in events, phases, or tracking
- Event/phase distributions match SkillCorner docs
- Tracking-derived distances line up with season aggregates

Good enough to move forward with analytical metrics.
"""

display(Markdown(data_quality_md))


## Quick Takeaways

From the 10 matches:

- ~36.0 players per match (standard for A-League squads)
- ~4,785 events/match, mostly `passing_option` events
- ~458 possession phases averaging 6.8s
- ~98.6 min of active play (~92.8% of frames)
- Events spread across 3 thirds - no obvious spatial gaps

**Data quality looks solid:**
- No major gaps in events, phases, or tracking
- Event/phase distributions match SkillCorner docs
- Tracking-derived distances line up with season aggregates

Good enough to move forward with analytical metrics.


## Summary

Checked nine metrics across events, phases, and tracking:

**Events/phases:**
- Player counts, event types, temporal distribution
- Phase structure (counts, durations, types)
- Spatial patterns (thirds/channels)
- Schema matches SkillCorner docs

**Tracking:**
- Distance/speed calculations validated against SkillCorner's season aggregates
- Movement profiles (speed zones) look realistic per position
- No impossible values or structural issues

Both feeds are clean and usable for the analytical work.

In [24]:
# Optional: Save summary data for use in metrics notebook
output_dir = Path('../data/processed')
output_dir.mkdir(parents=True, exist_ok=True)

# Save combined events and phases for easier access
all_events.to_csv(output_dir / 'all_events.csv', index=False)
all_phases.to_csv(output_dir / 'all_phases.csv', index=False)

print(f"Saved combined events and phases to {output_dir}")
print("\nReady for metric calculation")

Saved combined events and phases to ../data/processed

Ready for metric calculation
