# Data Extraction & Transformation

##### Parsing raw StatsBomb data and storing it in a Pandas DataFrame

---

In [None]:
import requests
import pandas as pd
from tqdm import tqdm

- `requests` is a great library for executing HTTP requests
- `pandas` is a data analysis and manipulation package
- `tqdm` is a clean progress bar library

---

In [None]:
base_url = "https://raw.githubusercontent.com/statsbomb/open-data/master/data/"
comp_url = base_url + "matches/{}/{}.json"
match_url = base_url + "events/{}.json"

These URLs are the locations where the raw StatsBomb data lives. Notice the `{}` in there, which are dynamically replaced with IDs with `.format()`

___

In [None]:
def parse_data(competition_id, season_id):
    matches = requests.get(url=comp_url.format(competition_id, season_id)).json()
    match_ids = [m['match_id'] for m in matches]

    all_events = []
    for match_id in tqdm(match_ids):

        events = requests.get(url=match_url.format(match_id)).json()

        shots = [x for x in events if x['type']['name'] == "Shot"]
        for s in shots:
            attributes = {
                "match_id": match_id,
                "team": s["possession_team"]["name"],
                "player": s['player']['name'],
                "x": s['location'][0],
                "y": s['location'][1],
                "outcome": s['shot']['outcome']['name'],
            }
            all_events.append(attributes)
            
    return pd.DataFrame(all_events)

The `parse_data` function handles the full Extract & Transform process.

The sequence of events is this:
1. The list of matches is loaded into the `matches` list.
2. Match IDs are extracted into a separate list using a list comprehension on `matches`.
3. Iterate over Match ID's, and load each match's raw data into the `events` list.
4. Shots are extracted into a separate list using a list comprehension as a filter on `events`.
5. Iterate over shots and extract individual features and store them in the `attributes` dictionary.
6. Append each shot's `attributes` into the `all_events` list.
7. Return a Pandas DataFrame from the `all_events` list.

---

In [None]:
competition_id = 43
season_id = 3

- `competition_id = 43` - StatsBomb's Competition ID for the World Cup
- `season_id = 3` - StatsBomb's Season ID for the 2018 Season

In [None]:
df = parse_data(competition_id, season_id)

100%|██████████| 64/64 [00:24<00:00,  2.60it/s]


The `parse_data` function is executed, and it's output is placed in variable `df`

The progress bar is produced by `tqdm`

---

In [None]:
df.head(10)

Unnamed: 0,match_id,team,player,x,y,outcome
0,7578,Uruguay,Edinson Roberto Cavani Gómez,97.0,32.0,Saved
1,7578,Egypt,Mahmoud Ibrahim Hassan,108.0,51.0,Saved
2,7578,Uruguay,Luis Alberto Suárez Díaz,109.0,55.0,Off T
3,7578,Uruguay,Edinson Roberto Cavani Gómez,102.0,23.0,Blocked
4,7578,Uruguay,José Martín Cáceres Silva,114.0,48.0,Wayward
5,7578,Uruguay,Luis Alberto Suárez Díaz,116.0,35.0,Off T
6,7578,Egypt,Marwan Mohsen,100.0,51.0,Saved
7,7578,Uruguay,Matías Vecino Falero,83.0,53.0,Off T
8,7578,Uruguay,Luis Alberto Suárez Díaz,88.0,38.0,Blocked
9,7578,Egypt,Abdalla Mahmoud El Said Bekhit,105.0,48.0,Wayward


The `.head(10)` method on a DataFrame object shows you the first 10 records in the DataFrame.

There are roughly `1700` shots in this DataFrame, which represent every shot attempted at the 2018 Men's World Cup.