# Data Extraction and Transformation

#### Parsing raw `StatsBomb` data and store it in a Pandas DataFrame

In [6]:
import requests
import pandas as pd
from tqdm import tqdm

In [2]:
base_url = "https://raw.githubusercontent.com/statsbomb/open-data/master/data/"
comp_url = base_url + "matches/{}/{}.json"
match_url = base_url + "events/{}.json"

In [9]:
def parse_data(competition_id, season_id):
    matches = requests.get(url=comp_url.format(competition_id,season_id)).json()
    match_ids = [m['match_id'] for m in matches]
    
    all_events = []
    
    for match_id in tqdm(match_ids):
        events = requests.get(url=match_url.format(match_id)).json()
        
        shots = [x for x in events if x['type']['name'] == "Shot"]
        for s in shots:
            attributes = {
                "match_id" : match_id,
                "team" : s["possession_team"]["name"],
                "player" : s['player']['name'],
                "x" : s['location'][0],
                "y" : s['location'][1],
                "outcome" : s['shot']['outcome']['name'],
            }
            all_events.append(attributes)
            
    return pd.DataFrame(all_events)

The `parse_data` function handles the full Extract & Transform process.

The sequence of events is this:
1. The list of matches is loaded into the `matches` list.
2. Match IDs are extracted into a separate list using a list comprehension on `matches`.
3. Iterate over Match ID's, and load each match's raw data into the `events` list.
4. Shots are extracted into a separate list using a list comprehension as a filter on `events`.
5. Iterate over shots and extract individual features and store them in the `attributes` dictionary.
6. Append each shot's `attributes` into the `all_events` list.
7. Return a Pandas DataFrame from the `all_events` list.

---

In [4]:
competition_id = 43
season_id = 3

In [10]:
df = parse_data(competition_id,season_id)

100%|██████████████████████████████████████████████████████████████████████████████████| 64/64 [01:11<00:00,  1.12s/it]


In [12]:
df.head()

Unnamed: 0,match_id,team,player,x,y,outcome
0,7581,Denmark,Mathias Jattah-Njie Jørgensen,115.0,34.0,Goal
1,7581,Croatia,Mario Mandžukić,112.0,36.0,Goal
2,7581,Croatia,Ivan Perišić,101.0,55.0,Blocked
3,7581,Croatia,Ivan Perišić,103.0,24.0,Blocked
4,7581,Denmark,Christian Dannemann Eriksen,96.0,37.0,Blocked
