In [1]:
%pip install statsbombpy

Collecting statsbombpy
  Downloading statsbombpy-1.16.0-py3-none-any.whl.metadata (63 kB)
Collecting requests-cache (from statsbombpy)
  Downloading requests_cache-1.3.0-py3-none-any.whl.metadata (9.2 kB)
Collecting inflect (from statsbombpy)
  Downloading inflect-7.5.0-py3-none-any.whl.metadata (24 kB)
Collecting typeguard>=4.0.1 (from inflect->statsbombpy)
  Downloading typeguard-4.5.1-py3-none-any.whl.metadata (3.8 kB)
Collecting typing_extensions>=4.14.0 (from typeguard>=4.0.1->inflect->statsbombpy)
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting cattrs>=22.2 (from requests-cache->statsbombpy)
  Downloading cattrs-26.1.0-py3-none-any.whl.metadata (8.5 kB)
Collecting url-normalize>=2.0 (from requests-cache->statsbombpy)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Collecting attrs>=21.2 (from requests-cache->statsbombpy)
  Downloading attrs-25.4.0-py3-none-any.whl.metadata (10 kB)
Downloading statsbombpy-1.16.0-py3-none-an

In [None]:
# statsbombpy.sb is the main interface to the StatsBomb Open Data API.
# sb.competitions() fetches a catalog of all available competition/season combinations
# from the StatsBomb public GitHub repository. No credentials are required for open data;
# a NoAuthWarning is raised to indicate that access is limited to the free dataset.
# The returned DataFrame includes: competition_id, season_id, country_name,
# competition_name, competition_gender, and availability timestamps.
from statsbombpy import sb

sb.competitions()

In [None]:
# sb.matches() retrieves the full match list for a specific competition and season.
# competition_id=43 is the FIFA World Cup; season_id=3 maps to the 2018 tournament.
# Returns a DataFrame with one row per match containing: match_id, match_date, kick_off,
# home_team, away_team, home_score, away_score, stadium, referee, match_week,
# competition_stage, home/away managers, and data version metadata.
# match_id is the key used to load the event stream in sb.events().
sb.matches(competition_id=43, season_id=3)

In [None]:
# sb.events() fetches the complete event stream for a single match.
# Each row represents one discrete on-ball action (pass, shot, dribble, tackle, pressure,
# carry, foul, goalkeeper action, substitution, etc.).
# The event stream is returned as a wide-format DataFrame: every possible attribute
# across all event types becomes a column. Attributes irrelevant to a given event type
# are stored as NaN (e.g., shot_statsbomb_xg is NaN for a pass event).
# match_id=7585 is the Colombia vs England Round of 16 match, 2018 FIFA World Cup.
events = sb.events(match_id=7585)

In [None]:
# Inspect all available columns in the events DataFrame.
# StatsBomb's wide-format model exposes every possible attribute as a column.
# Columns are grouped by event type prefix: shot_*, pass_*, dribble_*, duel_*,
# goalkeeper_*, foul_committed_*, foul_won_*, block_*, interception_*, carry_*.
# Core columns present on every event: id, index, period, timestamp, minute, second,
# type, team, player, location, play_pattern, possession, possession_team.
# The 'tactics' column carries formation and lineup data for Starting XI events only.
events.keys()

In [None]:
# Reduce the wide DataFrame to only the columns relevant for spatial and tactical analysis.
# This significantly reduces memory usage and simplifies downstream processing.
# Selected columns:
#   'team'              — the team in possession at the time of the event
#   'tactics'           — formation and lineup JSON; non-null only on Starting XI / Tactical Shift events
#   'player'            — name of the player who performed the action
#   'type'              — event type label (Pass, Shot, Dribble, Pressure, etc.)
#   'location'          — [x, y] origin coordinates in the StatsBomb 120x80 pitch system
#   'minute'            — match minute of the event
#   'pass_end_location' — [x, y] destination of the pass; NaN for all non-pass events
events = events[['team', 'tactics', 'player', 'type', 'location', 'minute', 'pass_end_location']]

In [None]:
# Display the first 100 rows of the filtered event stream to verify structure.
# Key observations:
# - Rows 0-1 are Starting XI events; 'tactics' is populated with a dict containing
#   the team's formation (e.g. 433, 352) and the full starting lineup with player roles.
# - Rows 2-4 are Half Start events marking kick-off periods; no spatial data.
# - Pass events carry location as [x, y] and pass_end_location as [x, y] in the
#   StatsBomb coordinate system (origin at bottom-left, pitch is 120 x 80 yards).
# - 'tactics' is NaN for all events except Starting XI and Tactical Shift.
events.head(100)

In [41]:
events

Unnamed: 0,team,tactics,player,type,location,minute,pass_end_location
0,Colombia,"{'formation': 433, 'lineup': [{'player': {'id'...",,Starting XI,,0,
1,England,"{'formation': 352, 'lineup': [{'player': {'id'...",,Starting XI,,0,
2,England,,,Half Start,,0,
3,Colombia,,,Half Start,,0,
4,Colombia,,,Half Start,,45,
...,...,...,...,...,...,...,...
4014,England,,Kyle Walker,Substitution,,112,
4015,Colombia,,Santiago Arias Naranjo,Substitution,,115,
4016,Colombia,"{'formation': 442, 'lineup': [{'player': {'id'...",,Tactical Shift,,61,
4017,Colombia,,,Camera off,,76,


## Summary: StatsBomb Open Data Exploration

### What This Notebook Does

This notebook introduces the StatsBomb Open Data dataset via the `statsbombpy` library. It covers the three-level discovery hierarchy — competitions, matches, events — and demonstrates how to load the full event stream for a single match and reduce its wide-format structure to the columns relevant for further analysis.

StatsBomb Open Data provides free, high-resolution event-level tracking for selected competitions, making it the standard entry point for academic and independent football data science work.

### Key Concepts

- **StatsBomb event model**: Every discrete on-ball action (pass, shot, press, carry, dribble, tackle, etc.) is recorded as one row. The model is exhaustive — over 80 columns capture attributes across all event types, most of which are NaN for any given row.
- **Coordinate system**: StatsBomb uses a 120 x 80 yard pitch model. Origin `(0, 0)` is the bottom-left corner from the perspective of the home team. All `location` and `pass_end_location` values are in this system.
- **Wide format**: The events DataFrame is intentionally sparse. Downstream analysis always starts by filtering rows to a specific event type (e.g. `type == 'Pass'`) and then selecting only the relevant columns.
- **`tactics` column**: Populated only for `Starting XI` and `Tactical Shift` events. It contains a nested dict with the team's formation code and the full lineup, including each player's listed position — the primary source for formation data.

### Data Available

| Function | Output |
|---|---|
| `sb.competitions()` | 75 competition/season entries in the open data catalog |
| `sb.matches(competition_id, season_id)` | Full match list with metadata: teams, score, stadium, referee, managers |
| `sb.events(match_id)` | Complete event stream — 4,019 events for Colombia vs England (WC 2018 R16) |

### Ideas to Extract More Value

- **Sequence reconstruction**: The `related_events` column (available in the full unfiltered DataFrame) links events that are part of the same possession sequence. Traversing these links enables pre-shot sequence analysis and possession chain modeling.
- **Tactics column parsing**: Extract the `tactics` column for `Starting XI` events and normalize the nested lineup dicts to get each player's position, enabling formation-level analysis and role classification.
- **Pressing intensity (PPDA)**: Filter to `type == 'Pressure'` events and count them against opponent passes in the defensive half. PPDA (Passes Per Defensive Action) is a widely used metric to quantify pressing aggressiveness.
- **Player heatmaps**: Using the `location` field across all events for a single player produces an activity heatmap that reveals positional tendencies and zone of influence within the team's shape.
- **Multi-match player profiles**: Join `sb.events()` across all matches in a competition to build per-player season aggregates: total xG generated and conceded, pass accuracy, progressive carries, defensive actions per 90.
- **Under-pressure passing accuracy**: The `under_pressure` column flags events executed while being pressured. Comparing pass accuracy under pressure vs not under pressure is a reliable proxy for a player's composure on the ball.