# Data Loading

Loading the SkillCorner A-League tracking data. Need to understand the structure before jumping into analysis.

Files per match:
- `{id}_match.json` - metadata, lineups, pitch dimensions
- `{id}_tracking_extrapolated.jsonl` - 10fps tracking data
- `{id}_dynamic_events.csv` - pre-computed events
- `{id}_phases_of_play.csv` - game phases

In [1]:
import sys
sys.path.append('..')

import pandas as pd
import json
from pathlib import Path

from src.loaders import (
    load_match_metadata,
    load_tracking_data,
    load_dynamic_events,
    load_phases,
    get_all_match_ids,
    load_all_matches
)

pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)

print("✓ Imports successful")

✓ Imports successful


## Explore One Match First

Let's load match 1886347 to see what we're working with.

In [None]:
match_id = "1886347"
metadata = load_match_metadata(match_id)

# Basic match info
print(f"Match: {metadata['home_team']['name']} vs {metadata['away_team']['name']}")
print(f"Date: {metadata['date_time']}")
print(f"Score: {metadata['home_team_score']}-{metadata['away_team_score']}")
print(f"Pitch: {metadata['pitch_length']}m x {metadata['pitch_width']}m")
print(f"Players: {len(metadata['players'])}")

In [None]:
# Check player structure
print("Sample player object:")
print(json.dumps(metadata['players'][0], indent=2))

### Tracking Data

This is the big one - 10fps position data for players and ball.

In [None]:
tracking = load_tracking_data(match_id)

print(f"Shape: {tracking.shape}")
print(f"Memory: {tracking.memory_usage(deep=True).sum() / 1024 / 1024:.1f} MB")
print(f"\nColumns: {list(tracking.columns)}")

In [None]:
# Look at first few frames
tracking.head()

In [None]:
# The nested structures - ball_data, possession, player_data are stored as objects
# Will need to flatten these for analysis
print("Sample ball_data:")
print(tracking[tracking['ball_data'].notna()].iloc[0]['ball_data'])

print("\nSample possession:")
print(tracking[tracking['possession'].notna()].iloc[0]['possession'])

print("\nSample player_data (first player):")
for idx, row in tracking.iterrows():
    if isinstance(row['player_data'], list) and len(row['player_data']) > 0:
        print(row['player_data'][0])
        break

**Note**: The tracking data has nested lists/dicts. For player-level metrics, I'll need to explode player_data into individual rows. Each frame has ~22 players tracked.

### Dynamic Events

In [None]:
events = load_dynamic_events(match_id)

print(f"Shape: {events.shape}")
print(f"Memory: {events.memory_usage(deep=True).sum() / 1024 / 1024:.1f} MB")
print(f"\n212 columns is a lot. Let's see the key ones:")

In [None]:
# Show key columns
key_cols = ['event_id', 'event_type', 'event_subtype', 'player_id', 'player_name', 
            'frame_start', 'frame_end', 'time_start', 'time_end']
events[key_cols].head(10)

In [None]:
# What event types do we have?
print("Event type distribution:")
print(events['event_type'].value_counts())

In [None]:
# Lots of sparse columns (expected - different events have different fields)
null_pct = (events.isnull().sum() / len(events) * 100)
print(f"Columns with >90% nulls: {(null_pct > 90).sum()}")
print("\nThis is normal - most columns only apply to specific event types")

### Phases of Play

In [None]:
phases = load_phases(match_id)

print(f"Shape: {phases.shape}")
phases.head()

In [None]:
# Phase types
print(phases['team_in_possession_phase_type'].value_counts())

## Load All 10 Matches

In [None]:
match_ids = get_all_match_ids()
print(f"Found {len(match_ids)} matches: {match_ids}")

In [None]:
# This takes ~1-2 mins
all_data = load_all_matches()
print(f"\n✓ Loaded {len(all_data)} matches")

In [None]:
# Quick summary
for match_id, data in all_data.items():
    meta = data['metadata']
    print(f"{match_id}: {meta['home_team']['name'][:15]:15s} vs {meta['away_team']['name'][:15]:15s} | "
          f"Frames: {len(data['tracking']):6,} | Events: {len(data['events']):4,}")

## Quick Data Quality Checks

In [None]:
# Player counts per match (expect ~30-36 with subs)
player_counts = {mid: len(d['metadata']['players']) for mid, d in all_data.items()}
print("Player counts per match:")
print(pd.Series(player_counts).describe())

if min(player_counts.values()) < 22:
    print(f"\n⚠ Warning: Some matches have <22 players")

In [None]:
# Tracking frame counts (expect ~54k for 90min match at 10fps)
frame_counts = {mid: len(d['tracking']) for mid, d in all_data.items()}
print("\nTracking frame counts:")
print(pd.Series(frame_counts).describe())

In [None]:
# Total memory usage
total_mb = sum(
    d['tracking'].memory_usage(deep=True).sum() + 
    d['events'].memory_usage(deep=True).sum() + 
    d['phases'].memory_usage(deep=True).sum()
    for d in all_data.values()
) / 1024 / 1024

print(f"\nTotal memory: {total_mb:.1f} MB (~{total_mb/len(all_data):.1f} MB per match)")
print("This is manageable in-memory for 10 matches")

## Key Observations

**Data structure:**
- All 10 matches loaded successfully
- Tracking data is ~85-90 MB per match, nested structures will need flattening
- Events CSV has 212 columns but most are sparse (event-type specific)
- Coordinate system uses pitch center as origin, not normalized

**Next steps for EDA:**
- Need to explode tracking player_data for player-level analysis
- Can extract off-ball runs, passing options directly from events
- For sprints, need to calculate speeds from tracking (frame-to-frame distances)
- Should normalize coordinates using pitch_length/pitch_width from metadata

**Things to watch:**
- Some tracking frames have null periods (stoppages) - filter these out
- Player detection can be false (extrapolated) - check is_detected flags
- Events have lots of metrics we won't use - focus on core columns