# Data Extraction & Transformation

##### Parsing raw StatsBomb data and storing it in a Pandas DataFrame

---

In [1]:
import requests
import pandas as pd
from tqdm import tqdm

- `requests` is a great library for executing HTTP requests
- `pandas` is a data analysis and manipulation package
- `tqdm` is a clean progress bar library

---

In [2]:
base_url = "https://raw.githubusercontent.com/statsbomb/open-data/master/data/"
comp_url = base_url + "matches/{}/{}.json"
match_url = base_url + "events/{}.json"

These URLs are the locations where the raw StatsBomb data lives. Notice the `{}` in there, which are dynamically replaced with IDs with `.format()`

___

In [3]:
def parse_data(competition_id, season_id):
    matches = requests.get(url=comp_url.format(competition_id, season_id)).json()
    match_ids = [m['match_id'] for m in matches]

    all_events = []
    for match_id in tqdm(match_ids):

        events = requests.get(url=match_url.format(match_id)).json()

        shots = [x for x in events if x['type']['name'] == "Shot"]
        for s in shots:
            attributes = {
                "match_id": match_id,
                "team": s["possession_team"]["name"],
                "player": s['player']['name'],
                "x": s['location'][0],
                "y": s['location'][1],
                "outcome": s['shot']['outcome']['name'],
            }
            all_events.append(attributes)
            
    return pd.DataFrame(all_events)

The `parse_data` function handles the full Extract & Transform process.

The sequence of events is this:
1. The list of matches is loaded into the `matches` list.
2. Match IDs are extracted into a separate list using a list comprehension on `matches`.
3. Iterate over Match ID's, and load each match's raw data into the `events` list.
4. Shots are extracted into a separate list using a list comprehension as a filter on `events`.
5. Iterate over shots and extract individual features and store them in the `attributes` dictionary.
6. Append each shot's `attributes` into the `all_events` list.
7. Return a Pandas DataFrame from the `all_events` list.

---

In [4]:
competition_id = 43
season_id = 3

- `competition_id = 43` - StatsBomb's Competition ID for the World Cup
- `season_id = 3` - StatsBomb's Season ID for the 2018 Season

In [5]:
df = parse_data(competition_id, season_id)

100%|██████████| 64/64 [00:13<00:00,  4.86it/s]


The `parse_data` function is executed, and it's output is placed in variable `df`

The progress bar is produced by `tqdm`

---

In [6]:
df.head(10)

Unnamed: 0,match_id,team,player,x,y,outcome
0,7562,Australia,Mile Jedinak,97.0,53.0,Off T
1,7562,Australia,Tom Rogić,95.0,46.0,Blocked
2,7562,Peru,André Martín Carrillo Díaz,104.0,53.0,Goal
3,7562,Australia,Mathew Leckie,112.0,42.0,Wayward
4,7562,Peru,José Paolo Guerrero González,109.0,37.0,Saved
5,7562,Australia,Tom Rogić,105.0,40.0,Saved
6,7562,Peru,Víctor Yoshimar Yotún Flores,83.0,33.0,Off T
7,7562,Australia,Trent Sainsbury,116.0,43.0,Off T
8,7562,Australia,Trent Sainsbury,115.0,34.0,Off T
9,7562,Peru,José Paolo Guerrero González,111.0,36.0,Goal


The `.head(10)` method on a DataFrame object shows you the first 10 records in the DataFrame.

There are roughly `1700` shots in this DataFrame, which represent every shot attempted at the 2018 Men's World Cup.

---

Devin Pleuler 2020