<a id='top'></a>

# Data Parsing of StatsBomb Data
##### Notebook to parse JSON data from the [StatsBomb Open Data GitHub repository](https://github.com/statsbomb/open-data) to create one unified Events DataFrame.

### By [Edd Webster](https://www.twitter.com/eddwebster)
Notebook first written: 10/11/2020<br>
Notebook last updated: 15/02/2021

![title](../../img/logos/stats-bomb-logo.png)

Click [here](#section5) to jump straight to the Exploratory Data Analysis section and skip the [Task Brief](#section2), [Data Sources](#section3), and [Data Engineering](#section4) sections. Or click [here](#section6) to jump straight to the Conclusion.

___


## <a id='import_libraries'>Introduction</a>
This notebook parses pubicly available [StatsBomb](https://statsbomb.com/) Event data, using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames.

For more information about this notebook and the author, I'm available through all the following channels:
*    [eddwebster.com](https://www.eddwebster.com/);
*    edd.j.webster@gmail.com;
*    [@eddwebster](https://www.twitter.com/eddwebster);
*    [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/);
*    [github/eddwebster](https://github.com/eddwebster/);
*    [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster);
*    [kaggle.com/eddwebster](https://www.kaggle.com/eddwebster); and
*    [hackerrank.com/eddwebster](https://www.hackerrank.com/eddwebster).

![title](../../img/edd_webster/fifa21eddwebsterbanner.png)

The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/football_analytics) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/2_data_parsing/StatsBomb%20Parsing%20and%20Data%20Engineering.ipynb).

___

## <a id='notebook_contents'>Notebook Contents</a>
1.    [Notebook Dependencies](#section1)<br>
2.    [Project Brief](#section2)<br>
3.    [Data Sources](#section3)<br>
      1.    [Introduction](#section3.1)<br>
      2.    [Download the Data](#section3.2)<br>
      3.    [Read in the Datasets](#section3.3)<br>
      4.    [Join the Datasets](#section3.4)<br>
      5.    [Initial Data Handling](#section3.5)<br>
4.    [Data Engineering](#section4)<br>
      1.    [Assign Raw DataFrame to Engineered DataFrame](#section4.1)<br>
      2.    [Sort the DataFrame](#section4.2)<br>
      3.    [Create Sort the DataFrame](#section4.3)<br>
      4.    [Subset DataFrame](#section4.4)<br>
5.    [Export DataFrame](#section5)<br>
6.    [Summary](#section6)<br>
7.    [Next Steps](#section7)<br>
8.    [Bibliography](#section8)<br>

___

<a id='section1'></a>

## <a id='#section1'>1. Notebook Dependencies</a>

This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:
*    [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;
*    [`NumPy`](http://www.numpy.org/) for multidimensional array computing;
*    [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation; and
*    `tqdm` for a clean progress bar;

All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/).

### Import Libraries and Modules

In [1]:
%load_ext autoreload
%autoreload 2

# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv

# Import Dependencies
%matplotlib inline

# Math Operations
import numpy as np
from math import pi

# Datetime
import datetime
from datetime import date
import time

# Data Preprocessing
import pandas as pd    # version 1.0.3
import os    #  used to read the csv filenames
import re
import random
from io import BytesIO
from pathlib import Path

# Reading directories
import glob
import os

# Working with JSON
import json
import codecs
from pandas.io.json import json_normalize

# Football Libraries
from FCPython import createPitch

# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno    # visually display missing data

# Progress Bar
from tqdm import tqdm    # a clean progress bar library

# Display in Jupyter
from IPython.display import Image, Video, YouTubeVideo
from IPython.core.display import HTML

# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

print('Setup Complete')

Setup Complete


In [2]:
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
print('Seaborn: {}'.format(sns.__version__))

Python: 3.7.6
NumPy: 1.18.0
pandas: 1.2.0
matplotlib: 3.3.2
Seaborn: 0.11.1


### Defined Variables

In [3]:
# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

### Defined Filepaths

In [4]:
# Set up initial paths to subfolders
base_dir = os.path.join('..', '..', )
data_dir = os.path.join(base_dir, 'data')
data_dir_sb = os.path.join(base_dir, 'data', 'sb')
scripts_dir = os.path.join(base_dir, 'scripts')
scripts_dir_sb = os.path.join(base_dir, 'scripts', 'sb')
data_dir_understat = os.path.join(base_dir, 'data', 'understat')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
video_dir = os.path.join(base_dir, 'video')

### Custom Functions

In [5]:
# Define custom function to read JSON files that also handles the encoding of special characters e.g. accents in names of players and teams
def read_json_file(filename):
    with open(filename, 'rb') as json_file:
        return BytesIO(json_file.read()).getvalue().decode('unicode_escape')
    
# Define custom function to flatten pandas DataFrames with nested JSON columns. Source: https://stackoverflow.com/questions/39899005/how-to-flatten-a-pandas-dataframe-with-some-columns-as-json
def flatten_nested_json_df(df):

    df = df.reset_index()

    print(f"original shape: {df.shape}")
    print(f"original columns: {df.columns}")


    # search for columns to explode/flatten
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()

    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    print(f"lists: {list_columns}, dicts: {dict_columns}")
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            print(f"flattening: {col}")
            # explode dictionaries horizontally, adding new columns
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            print(f"exploding: {col}")
            # explode lists vertically, adding new columns
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        # check if there are still dict o list fields to flatten
        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()

        print(f"lists: {list_columns}, dicts: {dict_columns}")

    print(f"final shape: {df.shape}")
    print(f"final columns: {df.columns}")
    return df

### Notebook Settings

In [6]:
pd.set_option('display.max_columns', None)

---

<a id='section2'></a>

## <a id='#section2'>2. Project Brief</a>
This Jupyter notebook explores how to parse publicly available Event data from [StatsBomb](https://statsbomb.com/) using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames.


The combined event data roduced in this notebook is exported to CSV. This data can be further analysed in Python, joined to other datasets, or explored using Tableau, PowerBI, Microsoft Excel.


**Notebook Conventions**:<br>
*    Variables that refer a `DataFrame` object are prefixed with `df_`.
*    Variables that refer to a collection of `DataFrame` objects (e.g., a list, a set or a dict) are prefixed with `dfs_`.

---

## <a id='#section3'>3. Data Sources</a>

### <a id='#section3.1'>3.1. Introduction</a>

#### <a id='#section3.1.1'>3.1.1. About StatsBomb</a>
[StatsBomb](https://statsbomb.com/) are a football analytics and data company.

![title](../../img/logos/stats-bomb-logo.png)

Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and Cleaned in the Data Engineering section [Section 4](#section4).

We'll be using the [pandas](http://pandas.pydata.org/) library to import our data to this workbook as a DataFrame.

#### <a id='#section3.1.2'>3.1.2. About the StatsBomb publicly available data</a>
The complete data set contains:
- 7 competitions;
- 879 matches;
- 3,161,917 events; and
- z players.

The datasets we will be using are:
- competitions;
- matches;
- events;
- lineups; and
- tactics;

The data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and cleaned in the Data Engineering section [Section 4](#section4).

### <a id='#section3.2'>3.2. Download the Data</a>
This section downloads the StatsBomb datasets if not already present in the data folder.

The data is available at the following: https://github.com/statsbomb/open-data

In [7]:
# WRITE CODE HERE TO DOWNLOAD DATA DIRECT FROM GITHUB, IF NOT ALREADY IN SB DIRECTORY

### <a id='#section3.3'>3.3. Read in Data</a>
The following cells read the the `JSON` files into a `DataFrame` object with some basic Data Engineering to flatten the data and select only the columns of interest, to ensure the notebook doesn't crash on a standard laptop.

#### <a id='#section3.3.1.'>3.3.1. Competitions</a>

##### Data dictionary

##### Read in data

In [1]:
# Show files in directory
print(glob.glob(os.path.join(data_dir_sb, 'competitions', 'raw', 'json')))

NameError: name 'glob' is not defined

In [9]:
# Read in exported CSV file if exists, if not, read in JSON file
if not os.path.exists(os.path.join(data_dir_sb, 'competitions', 'raw', 'csv', 'competitions.csv')):
    json_competitions = read_json_file(os.path.join(data_dir_sb, 'competitions', 'raw', 'json', 'competitions.json'))
    df_competitions_flat = pd.read_json(json_competitions)
else:
    df_competitions_flat = pd.read_csv(os.path.join(data_dir_sb, 'competitions', 'raw', 'csv', 'competitions.csv'))    
    
# Display DataFrame
df_competitions_flat

Unnamed: 0,competition_id,season_id,country_name,competition_name,competition_gender,season_name,match_updated,match_available
0,16,4,Europe,Champions League,male,2018/2019,2020-07-29T05:00,2020-07-29T05:00
1,16,1,Europe,Champions League,male,2017/2018,2020-07-29T05:00,2020-07-29T05:00
2,16,2,Europe,Champions League,male,2016/2017,2020-08-26T12:33:15.869622,2020-07-29T05:00
3,16,27,Europe,Champions League,male,2015/2016,2020-08-26T12:33:15.869622,2020-07-29T05:00
4,16,26,Europe,Champions League,male,2014/2015,2020-08-26T12:33:15.869622,2020-07-29T05:00
5,16,25,Europe,Champions League,male,2013/2014,2020-08-26T12:33:15.869622,2020-07-29T05:00
6,16,24,Europe,Champions League,male,2012/2013,2020-08-26T12:33:15.869622,2020-07-29T05:00
7,16,23,Europe,Champions League,male,2011/2012,2020-08-26T12:33:15.869622,2020-07-29T05:00
8,16,22,Europe,Champions League,male,2010/2011,2020-07-29T05:00,2020-07-29T05:00
9,16,21,Europe,Champions League,male,2009/2010,2020-07-29T05:00,2020-07-29T05:00


In [10]:
df_competitions_flat.shape

(37, 8)

##### Streamline the DataFrame

In [11]:
# Display DataFrame columns
df_competitions_flat.columns

Index(['competition_id', 'season_id', 'country_name', 'competition_name',
       'competition_gender', 'season_name', 'match_updated',
       'match_available'],
      dtype='object')

In [12]:
# Select columns of interest
cols_competitions = ['competition_id',
                     'season_id',
                     'country_name',
                     'competition_name',
                     'competition_gender',
                     'season_name'
                    ]
                     
# Create more concise DataFrame using only columns of interest
df_competitions_flat_select = df_competitions_flat[cols_competitions]

In [13]:
df_competitions_flat_select.shape

(37, 6)

##### Export DataFrame

In [14]:
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'competitions', 'raw', 'csv', 'competitions.csv')):
    df_competitions_flat.to_csv(os.path.join(data_dir_sb, 'competitions', 'raw', 'csv', 'competitions.csv'), index=None, header=True)
else:
    pass

In [15]:
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'competitions', 'raw', 'csv', 'competitions_select.csv')):
    df_competitions_flat_select.to_csv(os.path.join(data_dir_sb, 'competitions', 'raw', 'csv', 'competitions_select.csv'), index=None, header=True)
else:
    pass

#### <a id='#section3.3.2.'>3.3.2. Matches</a>

##### Data dictionary

##### Define competitions
The following cell lists the competitions to be included in the dataset. Dataset includes data for seven different competitions - 5 domestic and 2 international.

In [16]:
# Show files in directory
print(glob.glob(data_dir_sb + '/matches/raw/json/*'))

['../../data/sb/matches/raw/json/11', '../../data/sb/matches/raw/json/16', '../../data/sb/matches/raw/json/72', '../../data/sb/matches/raw/json/43', '../../data/sb/matches/raw/json/37', '../../data/sb/matches/raw/json/49', '../../data/sb/matches/raw/json/2']


In [17]:
# Define a list to select only the competitions of interest. 
# For this Expected Goals model, I will use all the data available

# Define list of competitions
lst_competitions = [2,     # Premier League
                    11,    # La Liga
                    16,    # Champions League
                    37,    # FA Women's Super League
                    43,    # FIFA World Cup
                    49,    # NWSL
                    72,    # Women's World Cup
                   ]

# Flatmap all competition IDs to use all available competitions
#lst_competitions = df_competitions['competition_id'].unique().tolist()

# Display list of competitions
lst_competitions

[2, 11, 16, 37, 43, 49, 72]

In [18]:
len(lst_competitions)

7

##### Read in JSON files

In [19]:
# Show files in directory
print(glob.glob(data_dir_sb + '/matches/raw/json/' + '/*json'))

[]


In [20]:
# Read in exported CSV file if exists, if not, read in JSON file
if not os.path.exists(os.path.join(data_dir_sb, 'matches', 'raw', 'csv', 'matches.csv')):
    # Loop through the competition files for the selected competition(s)
    # Take the separate JSON files each representing a match for the selected competition(s).
    # The file is called {match_id}.json.
    # Read the corresponding JSON matches files using the auxillary function
    # Read JSON file as a pandas DataFrame
    # Append the DataFrames to a list
    # Finally, concatenate all the separate DataFrames into one DataFrame

    ## Create empty list for DataFrames
    dfs_matches = []

    ## Loop through the competition files for the selected competition(s) and append DataFrame to dfs_matches list
    for competition_id in lst_competitions:
        filepath_competition = data_dir_sb + '/matches/raw/json/' + str(competition_id)
        filepath_matches = (glob.glob(filepath_competition + '/*.json'))
        for filepath_match in filepath_matches:
            df_match = pd.read_json(filepath_match)
            dfs_matches.append(df_match)

    ## Concatenate DataFrames to one DataFrame
    df_matches = pd.concat(dfs_matches)
    
    # Flatten the nested columns
    df_matches_flat = flatten_nested_json_df(df_matches)
    
    ## Rename columns
    df_matches_flat.columns = df_matches_flat.columns.str.replace('[.]', '_')
    
else:    
    df_matches_flat = pd.read_csv(os.path.join(data_dir_sb, 'matches', 'raw', 'csv', 'matches.csv'))
    
    
# Display DataFrame
df_matches_flat.head()

Unnamed: 0,index,match_id,match_date,kick_off,home_score,away_score,match_status,last_updated,match_week,referee,stadium,competition_competition_id,competition_country_name,competition_competition_name,season_season_id,season_season_name,home_team_home_team_id,home_team_home_team_name,home_team_home_team_gender,home_team_home_team_group,home_team_managers,home_team_country_id,home_team_country_name,away_team_away_team_id,away_team_away_team_name,away_team_away_team_gender,away_team_away_team_group,away_team_managers,away_team_country_id,away_team_country_name,metadata_data_version,metadata_shot_fidelity_version,metadata_xy_fidelity_version,competition_stage_id,competition_stage_name
0,0,3749257,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season
1,1,3749246,2004-03-28,17:05:00.000,1,1,available,2020-07-29T05:00,30,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,,68,England,39,Manchester United,male,,,68,England,1.1.0,2.0,2.0,1,Regular Season
2,2,3749153,2004-01-10,16:00:00.000,4,1,available,2020-08-30T08:12:14.579037,21,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,47,Middlesbrough,male,,"[{'id': 40, 'name': 'Steve McClaren', 'nicknam...",68,England,1.1.0,2.0,2.0,1,Regular Season
3,3,3749642,2004-02-28,16:00:00.000,2,1,available,2020-07-29T05:00,27,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,,68,England,75,Charlton Athletic,male,,,68,England,1.1.0,2.0,2.0,1,Regular Season
4,4,3749358,2003-08-24,17:05:00.000,0,4,available,2020-07-29T05:00,2,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,47,Middlesbrough,male,,,68,England,1,Arsenal,male,,,68,England,1.1.0,2.0,2.0,1,Regular Season


In [21]:
df_matches_flat.shape

(879, 35)

##### Streamline the DataFrame

In [22]:
# Display DataFrame columns
df_matches_flat.columns

Index(['index', 'match_id', 'match_date', 'kick_off', 'home_score',
       'away_score', 'match_status', 'last_updated', 'match_week', 'referee',
       'stadium', 'competition_competition_id', 'competition_country_name',
       'competition_competition_name', 'season_season_id',
       'season_season_name', 'home_team_home_team_id',
       'home_team_home_team_name', 'home_team_home_team_gender',
       'home_team_home_team_group', 'home_team_managers',
       'home_team_country_id', 'home_team_country_name',
       'away_team_away_team_id', 'away_team_away_team_name',
       'away_team_away_team_gender', 'away_team_away_team_group',
       'away_team_managers', 'away_team_country_id', 'away_team_country_name',
       'metadata_data_version', 'metadata_shot_fidelity_version',
       'metadata_xy_fidelity_version', 'competition_stage_id',
       'competition_stage_name'],
      dtype='object')

In [23]:
# Select columns of interest
cols_matches= ['index',
               'match_id',
               'match_date',
               'kick_off',
               'home_score',
               'away_score',
               'match_status',
               'last_updated',
               'match_week',
               'referee',
               'stadium',
               'competition_competition_id',
               'competition_country_name',
               'competition_competition_name',
               'season_season_id',
               'season_season_name',
               'home_team_home_team_id',
               'home_team_home_team_name',
               'home_team_home_team_group',
               'home_team_managers',
               'home_team_country_name',
               'away_team_away_team_name',
               'away_team_managers',
               'away_team_country_name',
               'competition_stage_name'
              ]

# Create more concise DataFrame using only columns of interest
df_matches_flat_select = df_matches_flat[cols_matches]

In [24]:
df_matches_flat_select.shape

(879, 25)

##### Convert `match_id` column to list
List used as reference of matches to parse for Events, Lineups, and Tactics data - iteration through list comprehension.

In [25]:
# Flatmap all competition IDs to use all available competitions
lst_matches = df_matches_flat_select['match_id'].tolist()

# Display list of competitions
lst_matches

[3749257,
 3749246,
 3749153,
 3749642,
 3749358,
 3749346,
 3749253,
 3749079,
 3749465,
 3749133,
 3749528,
 3749233,
 3749462,
 3749552,
 3749296,
 3749454,
 3749276,
 3749068,
 3749310,
 3749493,
 3749434,
 3749192,
 3749196,
 3749522,
 3749448,
 3749403,
 3749360,
 3749453,
 3749278,
 3749526,
 3749052,
 3749603,
 3749431,
 69225,
 69212,
 69235,
 69232,
 69216,
 69209,
 69231,
 69217,
 69273,
 69223,
 69222,
 69195,
 69251,
 69185,
 69142,
 69139,
 69189,
 69171,
 69249,
 69215,
 69138,
 69147,
 69149,
 69177,
 69207,
 69228,
 69183,
 69279,
 69285,
 69230,
 69211,
 69144,
 69151,
 69169,
 68360,
 69186,
 69180,
 69143,
 69181,
 68365,
 69178,
 68364,
 69170,
 68359,
 68356,
 69158,
 69187,
 68363,
 69166,
 68366,
 69148,
 69145,
 69184,
 69173,
 69146,
 69182,
 68358,
 68361,
 69141,
 68314,
 68313,
 68316,
 68315,
 69153,
 68352,
 68353,
 69243,
 69241,
 69257,
 69253,
 69244,
 69239,
 69277,
 69229,
 69219,
 69218,
 69250,
 69242,
 69256,
 69298,
 69221,
 69259,
 69224,
 69210

In [26]:
len(lst_matches)

879

##### Export DataFrame

In [27]:
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'matches', 'raw', 'csv', 'matches.csv')):
    df_matches_flat.to_csv(os.path.join(data_dir_sb, 'matches', 'raw', 'csv', 'matches.csv'), index=None, header=True)
else:
    pass

In [28]:
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'matches', 'raw', 'csv', 'matches_select.csv')):
    df_matches_flat_select.to_csv(os.path.join(data_dir_sb, 'matches', 'raw', 'csv', 'matches_select.csv'), index=None, header=True)
else:
    pass

#### <a id='#section3.3.3.'>3.3.3. Events</a>

##### Data Dictionary

The [StatsBomb](https://statsbomb.com/) dataset has one hundred and fifty features (columns) with the following definitions and data types:

| Feature     | Data type    |
|------|-----|
| `id`    | `object`
| `index`    | `object`
| `period`    | `object`
| `timestamp`    | `object`
| `minute`    | `object`
| `second`    | `object`
| `possession`    | `object`
| `duration`    | `object`
| `type.id`    | `object`
| `type.name`    | `object`
| `possession_team.id`    | `object`
| `possession_team.name`    | `object`
| `play_pattern.id`    | `object`
| `play_pattern.name`    | `object`
| `team.id`    | `object`
| `team.name`    | `object`
| `tactics.formation`    | `object`
| `tactics.lineup`    | `object`
| `related_events`    | `object`
| `location`    | `object`
| `player.id`    | `object`
| `player.name`    | `object`
| `position.id`    | `object`
| `position.name`    | `object`
| `pass.recipient.id`    | `object`
| `pass.recipient.name`    | `object`
| `pass.length`    | `object`
| `pass.angle`    | `object`
| `pass.height.id`    | `object`
| `pass.height.name`    | `object`
| `pass.end_location`    | `object`
| `pass.type.id`    | `object`
| `pass.type.name`    | `object`
| `pass.body_part.id`    | `object`
| `pass.body_part.name`    | `object`
| `carry.end_location`    | `object`
| `under_pressure`    | `object`
| `duel.type.id`    | `object`
| `duel.type.name`    | `object`
| `out`    | `object`
| `miscontrol.aerial_won`    | `object`
| `pass.outcome.id`    | `object`
| `pass.outcome.name`    | `object`
| `ball_receipt.outcome.id`    | `object`
| `ball_receipt.outcome.name`    | `object`
| `pass.aerial_won`    | `object`
| `counterpress`    | `object`
| `off_camera`    | `object`
| `dribble.outcome.id`    | `object`
| `dribble.outcome.name`    | `object`
| `dribble.overrun`    | `object`
| `ball_recovery.offensive`    | `object`
| `shot.statsbomb_xg`    | `object`
| `shot.end_location`    | `object`
| `shot.outcome.id`    | `object`
| `shot.outcome.name`    | `object`
| `shot.type.id`    | `object`
| `shot.type.name`    | `object`
| `shot.body_part.id`    | `object`
| `shot.body_part.name`    | `object`
| `shot.technique.id`    | `object`
| `shot.technique.name`    | `object`
| `shot.freeze_frame`    | `object`
| `goalkeeper.end_location`    | `object`
| `goalkeeper.type.id`    | `object`
| `goalkeeper.type.name`    | `object`
| `goalkeeper.position.id`    | `object`
| `goalkeeper.position.name`    | `object`
| `pass.straight`    | `object`
| `pass.technique.id`    | `object`
| `pass.technique.name`    | `object`
| `clearance.head`    | `object`
| `clearance.body_part.id`    | `object`
| `clearance.body_part.name`    | `object`
| `pass.switch`    | `object`
| `duel.outcome.id`    | `object`
| `duel.outcome.name`    | `object`
| `foul_committed.advantage`    | `object`
| `foul_won.advantage`    | `object`
| `pass.cross`    | `object`
| `pass.assisted_shot_id`    | `object`
| `pass.shot_assist`    | `object`
| `shot.one_on_one`    | `object`
| `shot.key_pass_id`    | `object`
| `goalkeeper.body_part.id`    | `object`
| `goalkeeper.body_part.name`    | `object`
| `goalkeeper.technique.id`    | `object`
| `goalkeeper.technique.name`    | `object`
| `goalkeeper.outcome.id`    | `object`
| `goalkeeper.outcome.name`    | `object`
| `clearance.aerial_won`    | `object`
| `foul_committed.card.id`    | `object`
| `foul_committed.card.name`    | `object`
| `foul_won.defensive`    | `object`
| `clearance.right_foot`    | `object`
| `shot.first_time`    | `object`
| `pass.through_ball`    | `object`
| `interception.outcome.id`    | `object`
| `interception.outcome.name`    | `object`
| `clearance.left_foot`    | `object`
| `ball_recovery.recovery_failure`    | `object`
| `shot.aerial_won`    | `object`
| `pass.goal_assist`    | `object`
| `pass.cut_back`    | `object`
| `pass.deflected`    | `object`
| `clearance.other`    | `object`
| `pass.outswinging`    | `object`
| `substitution.outcome.id`    | `object`
| `substitution.outcome.name`    | `object`
| `substitution.replacement.id`    | `object`
| `substitution.replacement.name`    | `object`
| `block.deflection`    | `object`
| `block.offensive`    | `object`
| `injury_stoppage.in_chain`    | `object`

For a full list of definitions, see the official documentation [[link](https://statsbomb.com/stat-definitions/)].

##### Read in JSON files

In [29]:
# Show files in directory
print(glob.glob(data_dir_sb + '/events/raw/json/' + '/*json'))

['../../data/sb/events/raw/json/2275050.json', '../../data/sb/events/raw/json/19795.json', '../../data/sb/events/raw/json/7298.json', '../../data/sb/events/raw/json/265958.json', '../../data/sb/events/raw/json/69182.json', '../../data/sb/events/raw/json/18242.json', '../../data/sb/events/raw/json/69301.json', '../../data/sb/events/raw/json/303696.json', '../../data/sb/events/raw/json/69244.json', '../../data/sb/events/raw/json/2275142.json', '../../data/sb/events/raw/json/266620.json', '../../data/sb/events/raw/json/7559.json', '../../data/sb/events/raw/json/69213.json', '../../data/sb/events/raw/json/2275154.json', '../../data/sb/events/raw/json/69340.json', '../../data/sb/events/raw/json/69205.json', '../../data/sb/events/raw/json/19804.json', '../../data/sb/events/raw/json/8655.json', '../../data/sb/events/raw/json/266724.json', '../../data/sb/events/raw/json/19783.json', '../../data/sb/events/raw/json/22980.json', '../../data/sb/events/raw/json/2275103.json', '../../data/sb/events/

In [30]:
# Read in exported CSV file if exists, if not, read in JSON file
if not os.path.exists(os.path.join(data_dir_sb, 'events', 'raw', 'csv', 'events.csv')):
    # Loop through the matches files for the selected match(es)
    # Take the separate JSON file each representing theevents match for the selected matches.
    # The file is called {match_id}.json.
    # Read the corresponding JSON matches files using the auxillary function
    # Read JSON file as a pandas DataFrame
    # Append the DataFrames to a list
    # Finally, concatenate all the separate DataFrames into one DataFrame

    ## Create empty list for DataFrames
    dfs_events = []

    ## Loop through event files for the selected matches and append DataFrame to dfs_events list
    for match_id in lst_matches:
        with open(data_dir_sb + '/events/raw/json/' + str(match_id) + '.json') as f:
            event = json.load(f)
           #match_id = str(match_id)
            df_event_flat = json_normalize(event)
            df_event_flat['match_id'] = match_id
            dfs_events.append(df_event_flat)    

    ## Concatenate DataFrames to one DataFrame
    df_events = pd.concat(dfs_events)
    
    # Flatten the nested columns
    df_events_flat = flatten_nested_json_df(df_events)
    
    ## Rename columns
    df_events_flat.columns = df_events_flat.columns.str.replace('[.]', '_')
    
else:    
    df_events_flat = pd.read_csv(os.path.join(data_dir_sb, 'events', 'raw', 'csv', 'events.csv'))
    
    
# Display DataFrame
df_events_flat.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,level_0,id,index,period,timestamp,minute,second,possession,duration,type_id,type_name,possession_team_id,possession_team_name,play_pattern_id,play_pattern_name,team_id,team_name,tactics_formation,tactics_lineup,related_events,location,player_id,player_name,position_id,position_name,pass_recipient_id,pass_recipient_name,pass_length,pass_angle,pass_height_id,pass_height_name,pass_end_location,pass_type_id,pass_type_name,pass_body_part_id,pass_body_part_name,carry_end_location,pass_outcome_id,pass_outcome_name,under_pressure,clearance_head,clearance_body_part_id,clearance_body_part_name,counterpress,duel_outcome_id,duel_outcome_name,duel_type_id,duel_type_name,ball_receipt_outcome_id,ball_receipt_outcome_name,out,clearance_left_foot,pass_switch,off_camera,clearance_aerial_won,dribble_outcome_id,dribble_outcome_name,pass_cross,pass_assisted_shot_id,pass_shot_assist,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_body_part_id,shot_body_part_name,shot_technique_id,shot_technique_name,shot_outcome_id,shot_outcome_name,shot_type_id,shot_type_name,shot_freeze_frame,goalkeeper_end_location,goalkeeper_type_id,goalkeeper_type_name,goalkeeper_position_id,goalkeeper_position_name,ball_recovery_recovery_failure,foul_committed_advantage,foul_won_advantage,dribble_overrun,clearance_right_foot,interception_outcome_id,interception_outcome_name,foul_won_defensive,pass_aerial_won,pass_deflected,pass_inswinging,pass_technique_id,pass_technique_name,goalkeeper_body_part_id,goalkeeper_body_part_name,goalkeeper_technique_id,goalkeeper_technique_name,goalkeeper_outcome_id,goalkeeper_outcome_name,pass_outswinging,pass_goal_assist,shot_one_on_one,miscontrol_aerial_won,shot_deflected,block_deflection,shot_first_time,block_offensive,pass_through_ball,foul_committed_card_id,foul_committed_card_name,foul_committed_penalty,foul_won_penalty,dribble_nutmeg,pass_miscommunication,pass_no_touch,foul_committed_offensive,goalkeeper_lost_out,pass_straight,substitution_outcome_id,substitution_outcome_name,substitution_replacement_id,substitution_replacement_name,match_id,goalkeeper_punched_out,shot_aerial_won,pass_cut_back,goalkeeper_success_in_play,50_50_outcome_id,50_50_outcome_name,foul_committed_type_id,foul_committed_type_name,ball_recovery_offensive,shot_saved_off_target,goalkeeper_shot_saved_off_target,shot_open_goal,dribble_no_touch,bad_behaviour_card_id,bad_behaviour_card_name,half_start_late_video_start,block_save_block,shot_follows_dribble,clearance_other,goalkeeper_shot_saved_to_post,shot_redirect,injury_stoppage_in_chain,shot_saved_to_post,goalkeeper_success_out,goalkeeper_lost_in_play,half_end_early_video_end,player_off_permanent,goalkeeper_saved_to_post,pass_backheel,shot_kick_off,goalkeeper_penalty_saved_to_post
0,0,41e0ff39-da7c-451a-8f08-82d3a9b369f2,1,1,00:00:00.000,0,0,1,0.0,35,Starting XI,1,Arsenal,1,Regular Play,1,Arsenal,442.0,"[{'player': {'id': 20015, 'name': 'Jens Lehman...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,1,d8c32d32-494b-4ae1-bb0c-d2f738952e3c,2,1,00:00:00.000,0,0,1,0.0,35,Starting XI,1,Arsenal,1,Regular Play,22,Leicester City,442.0,"[{'player': {'id': 40236, 'name': 'Ian Walker'...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2,6e678cba-67c3-4e9a-acca-78ab69b7d68b,3,1,00:00:00.000,0,0,1,0.0,18,Half Start,1,Arsenal,1,Regular Play,1,Arsenal,,,['b31e69b0-a75e-4721-b023-c06094ddcfa0'],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,3,b31e69b0-a75e-4721-b023-c06094ddcfa0,4,1,00:00:00.000,0,0,1,0.0,18,Half Start,1,Arsenal,1,Regular Play,22,Leicester City,,,['6e678cba-67c3-4e9a-acca-78ab69b7d68b'],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,4,0613063a-1cd4-4a18-83a7-9722be2d9f40,5,1,00:00:01.036,0,1,2,0.238292,30,Pass,22,Leicester City,9,From Kick Off,22,Leicester City,,,['500a6fd9-61c7-4b61-bb3c-8e6605c24084'],"[61.0, 40.1]",40240.0,Paul Dickov,24.0,Left Center Forward,40242.0,Marcus Bent,1.104536,-1.661456,1.0,Ground Pass,"[60.9, 39.0]",65.0,Kick Off,38.0,Left Foot,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [31]:
df_events_flat.shape

(3161917, 151)

##### Streamline the DataFrame

In [32]:
# Displays all one hundered and four columns, commented out but shown for reference
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df_events_flat.dtypes)

level_0                               int64
id                                   object
index                                 int64
period                                int64
timestamp                            object
minute                                int64
second                                int64
possession                            int64
duration                            float64
type_id                               int64
type_name                            object
possession_team_id                    int64
possession_team_name                 object
play_pattern_id                       int64
play_pattern_name                    object
team_id                               int64
team_name                            object
tactics_formation                   float64
tactics_lineup                       object
related_events                       object
location                             object
player_id                           float64
player_name                     

In [33]:
"""
# Select columns of interest
cols_events = [
              ]

# Create more concise DataFrame using only columns of interest
df_events_flat_select = df_events_flat[cols_events]
"""

'\n# Select columns of interest\ncols_events = [\n              ]\n\n# Create more concise DataFrame using only columns of interest\ndf_events_flat_select = df_events_flat[cols_events]\n'

##### Export DataFrame

In [34]:
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'events', 'raw', 'csv', 'events.csv')):
    df_events_flat.to_csv(os.path.join(data_dir_sb, 'events', 'raw', 'csv', 'events.csv'), index=None, header=True)
else:
    pass

In [35]:
"""
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'events', 'raw', 'csv', 'events_select.csv')):
    df_events_flat_select.to_csv(os.path.join(data_dir_sb, 'events', 'raw', 'csv', 'events_select.csv'), index=None, header=True)
else:
    pass
"""

"\n# Export DataFrame as a CSV file\nif not os.path.exists(os.path.join(data_dir_sb, 'events', 'raw', 'csv', 'events_select.csv')):\n    df_events_flat_select.to_csv(os.path.join(data_dir_sb, 'events', 'raw', 'csv', 'events_select.csv'), index=None, header=True)\nelse:\n    pass\n"

##### View all formations

In [36]:
# Flatmap all formations
lst_formation = df_events_flat['tactics_formation'].tolist()

# Display list of competitions
lst_formation

[442.0,
 442.0,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,


#### <a id='#section3.3.4.'>3.3.4. Lineups</a>

##### Data Dictionary

##### Read in JSON files

In [37]:
# Show files in directory
print(glob.glob(data_dir_sb + '/lineups/raw/json/' + '/*json'))

['../../data/sb/lineups/raw/json/2275050.json', '../../data/sb/lineups/raw/json/19795.json', '../../data/sb/lineups/raw/json/7298.json', '../../data/sb/lineups/raw/json/265958.json', '../../data/sb/lineups/raw/json/69182.json', '../../data/sb/lineups/raw/json/18242.json', '../../data/sb/lineups/raw/json/69301.json', '../../data/sb/lineups/raw/json/303696.json', '../../data/sb/lineups/raw/json/69244.json', '../../data/sb/lineups/raw/json/2275142.json', '../../data/sb/lineups/raw/json/266620.json', '../../data/sb/lineups/raw/json/7559.json', '../../data/sb/lineups/raw/json/69213.json', '../../data/sb/lineups/raw/json/2275154.json', '../../data/sb/lineups/raw/json/69340.json', '../../data/sb/lineups/raw/json/69205.json', '../../data/sb/lineups/raw/json/19804.json', '../../data/sb/lineups/raw/json/8655.json', '../../data/sb/lineups/raw/json/266724.json', '../../data/sb/lineups/raw/json/19783.json', '../../data/sb/lineups/raw/json/22980.json', '../../data/sb/lineups/raw/json/2275103.json', 

In [38]:
# Read in exported CSV file if exists, if not, read in JSON file
if not os.path.exists(os.path.join(data_dir_sb, 'lineups', 'raw', 'csv', 'lineups.csv')):
    # Loop through the competition files for the selected competition(s)
    # Take the separate JSON files each representing a match for the selected competition(s).
    # The file is called {match_id}.json.
    # Read the corresponding JSON matches files using the auxillary function
    # Read JSON file as a pandas DataFrame
    # Append the DataFrames to a list
    # Finally, concatenate all the separate DataFrames into one DataFrame

    ## Create empty list for DataFrames
    dfs_lineups = []

    ## Loop through event files for the selected matches and append DataFrame to dfs_lineups list
    for match_id in lst_matches:
        with open(data_dir_sb + '/lineups/raw/json/' + str(match_id) + '.json') as f:
            lineup = json.load(f)
           #match_id = str(match_id)
            df_lineups_flat = json_normalize(lineup)
            df_lineups_flat['match_id'] = match_id
            dfs_lineups.append(df_lineups_flat)    

    ## Concatenate DataFrames to one DataFrame
    df_lineups = pd.concat(dfs_lineups)

    # Flatten the nested columns
    df_lineups_flat = flatten_nested_json_df(df_lineups)
    
    ## Rename columns
    df_lineups_flat.columns = df_lineups_flat.columns.str.replace('[.]', '_')
    
else:    
    df_lineups_flat = pd.read_csv(os.path.join(data_dir_sb, 'lineups', 'raw', 'csv', 'lineups.csv'))
    
    
# Display DataFrame
df_lineups_flat.head()

Unnamed: 0,index,team_id,team_name,match_id,lineup_player_id,lineup_player_name,lineup_player_nickname,lineup_jersey_number,lineup_country_id,lineup_country_name
0,0,1,Arsenal,3749257,12529,Ashley Cole,,0,68,England
1,0,1,Arsenal,3749257,15042,Dennis Bergkamp,,0,160,Netherlands
2,0,1,Arsenal,3749257,15515,Patrick Vieira,,4,78,France
3,0,1,Arsenal,3749257,15516,Thierry Henry,,0,78,France
4,0,1,Arsenal,3749257,15637,"Sulzeer Jeremiah ""Sol"" Campbell",Sol Campbell,23,68,England


In [39]:
df_lineups_flat.shape

(26287, 10)

##### Streamline the DataFrame

In [40]:
df_lineups_flat.columns

Index(['index', 'team_id', 'team_name', 'match_id', 'lineup_player_id',
       'lineup_player_name', 'lineup_player_nickname', 'lineup_jersey_number',
       'lineup_country_id', 'lineup_country_name'],
      dtype='object')

In [41]:
"""
# Select columns of interest
cols_lineups = [
               ]

# Create more concise DataFrame using only columns of interest
df_lineups_flat_select = df_lineups_flat[cols_lineups]
"""

'\n# Select columns of interest\ncols_lineups = [\n               ]\n\n# Create more concise DataFrame using only columns of interest\ndf_lineups_flat_select = df_lineups_flat[cols_lineups]\n'

##### Export DataFrame

In [42]:
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'lineups', 'raw', 'csv', 'lineups.csv')):
    df_lineups_flat.to_csv(os.path.join(data_dir_sb, 'lineups', 'raw', 'csv', 'lineups.csv'), index=None, header=True)
else:
    pass

In [43]:
"""
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'lineups', 'raw', 'csv', 'lineups_select.csv')):
    df_lineups_flat_select.to_csv(os.path.join(data_dir_sb, 'lineups', 'raw', 'csv', 'lineups_select.csv'), index=None, header=True)
else:
    pass
"""

"\n# Export DataFrame as a CSV file\nif not os.path.exists(os.path.join(data_dir_sb, 'lineups', 'raw', 'csv', 'lineups_select.csv')):\n    df_lineups_flat_select.to_csv(os.path.join(data_dir_sb, 'lineups', 'raw', 'csv', 'lineups_select.csv'), index=None, header=True)\nelse:\n    pass\n"

### <a id='#section3.4'>3.4. Join Datasets</a>
Next, we're required to join the `Matches` DataFrame and the `Players` DataFrame to the `Events` DatFrame. The `Events` data is the base DataFrame in which we join the other tables via `x`, `y`, `z`, `z`, and `z`.

In [44]:
# Read in exported CSV file if exists, if not, merge the individual DataFrames
if not os.path.exists(os.path.join(data_dir_sb, 'combined', 'raw', 'csv', 'combined.csv')):
    
    # Join the Matches DataFrame to the Events DataFrame
    df_events_matches = pd.merge(df_events_flat, df_matches_flat, left_on=['match_id'], right_on=['match_id'])

    # Join the Competitions DataFrame to the Events-Matches DataFrame
    df_events_matches_competitions = pd.merge(df_events_matches, df_competitions_flat, left_on=['competition_competition_id', 'season_season_id'], right_on=['competition_id', 'season_id'])
    
else:    
    df_events_matches_competitions = pd.read_csv(os.path.join(data_dir_sb, 'combined', 'raw', 'csv', 'combined.csv'))
    
    
# Display DataFrame
df_events_matches_competitions.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,level_0,id,index_x,period,timestamp,minute,second,possession,duration,type_id,type_name,possession_team_id,possession_team_name,play_pattern_id,play_pattern_name,team_id,team_name,tactics_formation,tactics_lineup,related_events,location,player_id,player_name,position_id,position_name,pass_recipient_id,pass_recipient_name,pass_length,pass_angle,pass_height_id,pass_height_name,pass_end_location,pass_type_id,pass_type_name,pass_body_part_id,pass_body_part_name,carry_end_location,pass_outcome_id,pass_outcome_name,under_pressure,clearance_head,clearance_body_part_id,clearance_body_part_name,counterpress,duel_outcome_id,duel_outcome_name,duel_type_id,duel_type_name,ball_receipt_outcome_id,ball_receipt_outcome_name,out,clearance_left_foot,pass_switch,off_camera,clearance_aerial_won,dribble_outcome_id,dribble_outcome_name,pass_cross,pass_assisted_shot_id,pass_shot_assist,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_body_part_id,shot_body_part_name,shot_technique_id,shot_technique_name,shot_outcome_id,shot_outcome_name,shot_type_id,shot_type_name,shot_freeze_frame,goalkeeper_end_location,goalkeeper_type_id,goalkeeper_type_name,goalkeeper_position_id,goalkeeper_position_name,ball_recovery_recovery_failure,foul_committed_advantage,foul_won_advantage,dribble_overrun,clearance_right_foot,interception_outcome_id,interception_outcome_name,foul_won_defensive,pass_aerial_won,pass_deflected,pass_inswinging,pass_technique_id,pass_technique_name,goalkeeper_body_part_id,goalkeeper_body_part_name,goalkeeper_technique_id,goalkeeper_technique_name,goalkeeper_outcome_id,goalkeeper_outcome_name,pass_outswinging,pass_goal_assist,shot_one_on_one,miscontrol_aerial_won,shot_deflected,block_deflection,shot_first_time,block_offensive,pass_through_ball,foul_committed_card_id,foul_committed_card_name,foul_committed_penalty,foul_won_penalty,dribble_nutmeg,pass_miscommunication,pass_no_touch,foul_committed_offensive,goalkeeper_lost_out,pass_straight,substitution_outcome_id,substitution_outcome_name,substitution_replacement_id,substitution_replacement_name,match_id,goalkeeper_punched_out,shot_aerial_won,pass_cut_back,goalkeeper_success_in_play,50_50_outcome_id,50_50_outcome_name,foul_committed_type_id,foul_committed_type_name,ball_recovery_offensive,shot_saved_off_target,goalkeeper_shot_saved_off_target,shot_open_goal,dribble_no_touch,bad_behaviour_card_id,bad_behaviour_card_name,half_start_late_video_start,block_save_block,shot_follows_dribble,clearance_other,goalkeeper_shot_saved_to_post,shot_redirect,injury_stoppage_in_chain,shot_saved_to_post,goalkeeper_success_out,goalkeeper_lost_in_play,half_end_early_video_end,player_off_permanent,goalkeeper_saved_to_post,pass_backheel,shot_kick_off,goalkeeper_penalty_saved_to_post,index_y,match_date,kick_off,home_score,away_score,match_status,last_updated,match_week,referee,stadium,competition_competition_id,competition_country_name,competition_competition_name,season_season_id,season_season_name,home_team_home_team_id,home_team_home_team_name,home_team_home_team_gender,home_team_home_team_group,home_team_managers,home_team_country_id,home_team_country_name,away_team_away_team_id,away_team_away_team_name,away_team_away_team_gender,away_team_away_team_group,away_team_managers,away_team_country_id,away_team_country_name,metadata_data_version,metadata_shot_fidelity_version,metadata_xy_fidelity_version,competition_stage_id,competition_stage_name,competition_id,season_id,country_name,competition_name,competition_gender,season_name,match_updated,match_available
0,0,41e0ff39-da7c-451a-8f08-82d3a9b369f2,1,1,00:00:00.000,0,0,1,0.0,35,Starting XI,1,Arsenal,1,Regular Play,1,Arsenal,442.0,"[{'player': {'id': 20015, 'name': 'Jens Lehman...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season,2,44,England,Premier League,male,2003/2004,2020-08-31T20:40:28.969635,2020-08-31T20:40:28.969635
1,1,d8c32d32-494b-4ae1-bb0c-d2f738952e3c,2,1,00:00:00.000,0,0,1,0.0,35,Starting XI,1,Arsenal,1,Regular Play,22,Leicester City,442.0,"[{'player': {'id': 40236, 'name': 'Ian Walker'...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season,2,44,England,Premier League,male,2003/2004,2020-08-31T20:40:28.969635,2020-08-31T20:40:28.969635
2,2,6e678cba-67c3-4e9a-acca-78ab69b7d68b,3,1,00:00:00.000,0,0,1,0.0,18,Half Start,1,Arsenal,1,Regular Play,1,Arsenal,,,['b31e69b0-a75e-4721-b023-c06094ddcfa0'],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season,2,44,England,Premier League,male,2003/2004,2020-08-31T20:40:28.969635,2020-08-31T20:40:28.969635
3,3,b31e69b0-a75e-4721-b023-c06094ddcfa0,4,1,00:00:00.000,0,0,1,0.0,18,Half Start,1,Arsenal,1,Regular Play,22,Leicester City,,,['6e678cba-67c3-4e9a-acca-78ab69b7d68b'],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season,2,44,England,Premier League,male,2003/2004,2020-08-31T20:40:28.969635,2020-08-31T20:40:28.969635
4,4,0613063a-1cd4-4a18-83a7-9722be2d9f40,5,1,00:00:01.036,0,1,2,0.238292,30,Pass,22,Leicester City,9,From Kick Off,22,Leicester City,,,['500a6fd9-61c7-4b61-bb3c-8e6605c24084'],"[61.0, 40.1]",40240.0,Paul Dickov,24.0,Left Center Forward,40242.0,Marcus Bent,1.104536,-1.661456,1.0,Ground Pass,"[60.9, 39.0]",65.0,Kick Off,38.0,Left Foot,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season,2,44,England,Premier League,male,2003/2004,2020-08-31T20:40:28.969635,2020-08-31T20:40:28.969635


In [45]:
df_events_matches_competitions.shape

(3158158, 193)

##### Join Lineups Data to Events-Match-Competiton Data

In [46]:
# Join the Lineups DataFrame to the Events-Match-Competiton DataFrame
#df_sb_merge = pd.merge(df_sb_merge, df_lineups, left_on=['competition.competition_id', 'season.season_id'], right_on=['competition_id', 'season_id'])

In [47]:
#df_sb_merge.shape

### <a id='#section3.5'>3.5. Initial Data Handling</a>
Let's quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods.

In [49]:
# Display the first 5 rows of the raw DataFrame, df_events_matches_competitions
df_events_matches_competitions.head()

Unnamed: 0,level_0,id,index_x,period,timestamp,minute,second,possession,duration,type_id,type_name,possession_team_id,possession_team_name,play_pattern_id,play_pattern_name,team_id,team_name,tactics_formation,tactics_lineup,related_events,location,player_id,player_name,position_id,position_name,pass_recipient_id,pass_recipient_name,pass_length,pass_angle,pass_height_id,pass_height_name,pass_end_location,pass_type_id,pass_type_name,pass_body_part_id,pass_body_part_name,carry_end_location,pass_outcome_id,pass_outcome_name,under_pressure,clearance_head,clearance_body_part_id,clearance_body_part_name,counterpress,duel_outcome_id,duel_outcome_name,duel_type_id,duel_type_name,ball_receipt_outcome_id,ball_receipt_outcome_name,out,clearance_left_foot,pass_switch,off_camera,clearance_aerial_won,dribble_outcome_id,dribble_outcome_name,pass_cross,pass_assisted_shot_id,pass_shot_assist,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_body_part_id,shot_body_part_name,shot_technique_id,shot_technique_name,shot_outcome_id,shot_outcome_name,shot_type_id,shot_type_name,shot_freeze_frame,goalkeeper_end_location,goalkeeper_type_id,goalkeeper_type_name,goalkeeper_position_id,goalkeeper_position_name,ball_recovery_recovery_failure,foul_committed_advantage,foul_won_advantage,dribble_overrun,clearance_right_foot,interception_outcome_id,interception_outcome_name,foul_won_defensive,pass_aerial_won,pass_deflected,pass_inswinging,pass_technique_id,pass_technique_name,goalkeeper_body_part_id,goalkeeper_body_part_name,goalkeeper_technique_id,goalkeeper_technique_name,goalkeeper_outcome_id,goalkeeper_outcome_name,pass_outswinging,pass_goal_assist,shot_one_on_one,miscontrol_aerial_won,shot_deflected,block_deflection,shot_first_time,block_offensive,pass_through_ball,foul_committed_card_id,foul_committed_card_name,foul_committed_penalty,foul_won_penalty,dribble_nutmeg,pass_miscommunication,pass_no_touch,foul_committed_offensive,goalkeeper_lost_out,pass_straight,substitution_outcome_id,substitution_outcome_name,substitution_replacement_id,substitution_replacement_name,match_id,goalkeeper_punched_out,shot_aerial_won,pass_cut_back,goalkeeper_success_in_play,50_50_outcome_id,50_50_outcome_name,foul_committed_type_id,foul_committed_type_name,ball_recovery_offensive,shot_saved_off_target,goalkeeper_shot_saved_off_target,shot_open_goal,dribble_no_touch,bad_behaviour_card_id,bad_behaviour_card_name,half_start_late_video_start,block_save_block,shot_follows_dribble,clearance_other,goalkeeper_shot_saved_to_post,shot_redirect,injury_stoppage_in_chain,shot_saved_to_post,goalkeeper_success_out,goalkeeper_lost_in_play,half_end_early_video_end,player_off_permanent,goalkeeper_saved_to_post,pass_backheel,shot_kick_off,goalkeeper_penalty_saved_to_post,index_y,match_date,kick_off,home_score,away_score,match_status,last_updated,match_week,referee,stadium,competition_competition_id,competition_country_name,competition_competition_name,season_season_id,season_season_name,home_team_home_team_id,home_team_home_team_name,home_team_home_team_gender,home_team_home_team_group,home_team_managers,home_team_country_id,home_team_country_name,away_team_away_team_id,away_team_away_team_name,away_team_away_team_gender,away_team_away_team_group,away_team_managers,away_team_country_id,away_team_country_name,metadata_data_version,metadata_shot_fidelity_version,metadata_xy_fidelity_version,competition_stage_id,competition_stage_name,competition_id,season_id,country_name,competition_name,competition_gender,season_name,match_updated,match_available
0,0,41e0ff39-da7c-451a-8f08-82d3a9b369f2,1,1,00:00:00.000,0,0,1,0.0,35,Starting XI,1,Arsenal,1,Regular Play,1,Arsenal,442.0,"[{'player': {'id': 20015, 'name': 'Jens Lehman...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season,2,44,England,Premier League,male,2003/2004,2020-08-31T20:40:28.969635,2020-08-31T20:40:28.969635
1,1,d8c32d32-494b-4ae1-bb0c-d2f738952e3c,2,1,00:00:00.000,0,0,1,0.0,35,Starting XI,1,Arsenal,1,Regular Play,22,Leicester City,442.0,"[{'player': {'id': 40236, 'name': 'Ian Walker'...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season,2,44,England,Premier League,male,2003/2004,2020-08-31T20:40:28.969635,2020-08-31T20:40:28.969635
2,2,6e678cba-67c3-4e9a-acca-78ab69b7d68b,3,1,00:00:00.000,0,0,1,0.0,18,Half Start,1,Arsenal,1,Regular Play,1,Arsenal,,,['b31e69b0-a75e-4721-b023-c06094ddcfa0'],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season,2,44,England,Premier League,male,2003/2004,2020-08-31T20:40:28.969635,2020-08-31T20:40:28.969635
3,3,b31e69b0-a75e-4721-b023-c06094ddcfa0,4,1,00:00:00.000,0,0,1,0.0,18,Half Start,1,Arsenal,1,Regular Play,22,Leicester City,,,['6e678cba-67c3-4e9a-acca-78ab69b7d68b'],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season,2,44,England,Premier League,male,2003/2004,2020-08-31T20:40:28.969635,2020-08-31T20:40:28.969635
4,4,0613063a-1cd4-4a18-83a7-9722be2d9f40,5,1,00:00:01.036,0,1,2,0.238292,30,Pass,22,Leicester City,9,From Kick Off,22,Leicester City,,,['500a6fd9-61c7-4b61-bb3c-8e6605c24084'],"[61.0, 40.1]",40240.0,Paul Dickov,24.0,Left Center Forward,40242.0,Marcus Bent,1.104536,-1.661456,1.0,Ground Pass,"[60.9, 39.0]",65.0,Kick Off,38.0,Left Foot,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3749257,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2004-05-15,16:00:00.000,2,1,available,2020-08-30T08:12:14.579037,38,"{'id': 1279, 'name': 'None'}",,2,England,Premier League,44,2003/2004,1,Arsenal,male,,"[{'id': 577, 'name': 'Arsène Wenger', 'nicknam...",68,England,22,Leicester City,male,,"[{'id': 2974, 'name': 'Micky Adams', 'nickname...",68,England,1.1.0,2.0,2.0,1,Regular Season,2,44,England,Premier League,male,2003/2004,2020-08-31T20:40:28.969635,2020-08-31T20:40:28.969635


In [50]:
# Display the last 5 rows of the raw DataFrame, df_events_matches_competitions
df_events_matches_competitions.tail()

Unnamed: 0,level_0,id,index_x,period,timestamp,minute,second,possession,duration,type_id,type_name,possession_team_id,possession_team_name,play_pattern_id,play_pattern_name,team_id,team_name,tactics_formation,tactics_lineup,related_events,location,player_id,player_name,position_id,position_name,pass_recipient_id,pass_recipient_name,pass_length,pass_angle,pass_height_id,pass_height_name,pass_end_location,pass_type_id,pass_type_name,pass_body_part_id,pass_body_part_name,carry_end_location,pass_outcome_id,pass_outcome_name,under_pressure,clearance_head,clearance_body_part_id,clearance_body_part_name,counterpress,duel_outcome_id,duel_outcome_name,duel_type_id,duel_type_name,ball_receipt_outcome_id,ball_receipt_outcome_name,out,clearance_left_foot,pass_switch,off_camera,clearance_aerial_won,dribble_outcome_id,dribble_outcome_name,pass_cross,pass_assisted_shot_id,pass_shot_assist,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_body_part_id,shot_body_part_name,shot_technique_id,shot_technique_name,shot_outcome_id,shot_outcome_name,shot_type_id,shot_type_name,shot_freeze_frame,goalkeeper_end_location,goalkeeper_type_id,goalkeeper_type_name,goalkeeper_position_id,goalkeeper_position_name,ball_recovery_recovery_failure,foul_committed_advantage,foul_won_advantage,dribble_overrun,clearance_right_foot,interception_outcome_id,interception_outcome_name,foul_won_defensive,pass_aerial_won,pass_deflected,pass_inswinging,pass_technique_id,pass_technique_name,goalkeeper_body_part_id,goalkeeper_body_part_name,goalkeeper_technique_id,goalkeeper_technique_name,goalkeeper_outcome_id,goalkeeper_outcome_name,pass_outswinging,pass_goal_assist,shot_one_on_one,miscontrol_aerial_won,shot_deflected,block_deflection,shot_first_time,block_offensive,pass_through_ball,foul_committed_card_id,foul_committed_card_name,foul_committed_penalty,foul_won_penalty,dribble_nutmeg,pass_miscommunication,pass_no_touch,foul_committed_offensive,goalkeeper_lost_out,pass_straight,substitution_outcome_id,substitution_outcome_name,substitution_replacement_id,substitution_replacement_name,match_id,goalkeeper_punched_out,shot_aerial_won,pass_cut_back,goalkeeper_success_in_play,50_50_outcome_id,50_50_outcome_name,foul_committed_type_id,foul_committed_type_name,ball_recovery_offensive,shot_saved_off_target,goalkeeper_shot_saved_off_target,shot_open_goal,dribble_no_touch,bad_behaviour_card_id,bad_behaviour_card_name,half_start_late_video_start,block_save_block,shot_follows_dribble,clearance_other,goalkeeper_shot_saved_to_post,shot_redirect,injury_stoppage_in_chain,shot_saved_to_post,goalkeeper_success_out,goalkeeper_lost_in_play,half_end_early_video_end,player_off_permanent,goalkeeper_saved_to_post,pass_backheel,shot_kick_off,goalkeeper_penalty_saved_to_post,index_y,match_date,kick_off,home_score,away_score,match_status,last_updated,match_week,referee,stadium,competition_competition_id,competition_country_name,competition_competition_name,season_season_id,season_season_name,home_team_home_team_id,home_team_home_team_name,home_team_home_team_gender,home_team_home_team_group,home_team_managers,home_team_country_id,home_team_country_name,away_team_away_team_id,away_team_away_team_name,away_team_away_team_gender,away_team_away_team_group,away_team_managers,away_team_country_id,away_team_country_name,metadata_data_version,metadata_shot_fidelity_version,metadata_xy_fidelity_version,competition_stage_id,competition_stage_name,competition_id,season_id,country_name,competition_name,competition_gender,season_name,match_updated,match_available
3158153,4237,3821ac0d-f832-4bbd-a9c3-f00dc45bfd2a,4238,4,00:22:30.391,127,30,268,0.04,43,Carry,858,Sweden Women's,4,From Throw In,858,Sweden Women's,,,"['4d7320fa-dde2-4191-9172-05745c4e25d2', '633c...","[106.0, 3.8]",10222.0,Jonna Andersson,6.0,Left Back,,,,,,,,,,,,"[106.0, 3.8]",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,69284,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,51,2019-07-03,21:00:00.000,1,0,available,2020-07-29T05:00,6,"{'id': 1627, 'name': 'M. Beaudoin'}","{'id': 193, 'name': 'Groupama Stadium', 'count...",72,International,Women's World Cup,30,2019,851,Netherlands Women's,female,,"[{'id': 45, 'name': 'Sarina Glotzbach-Wiegman'...",160,Netherlands,858,Sweden Women's,female,,"[{'id': 3016, 'name': 'Peter Gerhardsson', 'ni...",220,Sweden,1.1.0,2.0,2.0,15,Semi-finals,72,30,International,Women's World Cup,female,2019,2020-07-29T05:00,2020-07-29T05:00
3158154,4238,633cac33-a3e7-4e4e-9bc8-10c8bf67c5ca,4239,4,00:22:30.431,127,30,268,2.176919,30,Pass,858,Sweden Women's,4,From Throw In,858,Sweden Women's,,,['2a96c784-b880-4d0e-a66e-5b4f765cf924'],"[106.0, 3.8]",10222.0,Jonna Andersson,6.0,Left Back,,,38.552303,1.435502,3.0,High Pass,"[111.2, 42.0]",,,38.0,Left Foot,,9.0,Incomplete,,,,,,,,,,,,,,,,,,,True,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,69284,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,51,2019-07-03,21:00:00.000,1,0,available,2020-07-29T05:00,6,"{'id': 1627, 'name': 'M. Beaudoin'}","{'id': 193, 'name': 'Groupama Stadium', 'count...",72,International,Women's World Cup,30,2019,851,Netherlands Women's,female,,"[{'id': 45, 'name': 'Sarina Glotzbach-Wiegman'...",160,Netherlands,858,Sweden Women's,female,,"[{'id': 3016, 'name': 'Peter Gerhardsson', 'ni...",220,Sweden,1.1.0,2.0,2.0,15,Semi-finals,72,30,International,Women's World Cup,female,2019,2020-07-29T05:00,2020-07-29T05:00
3158155,4239,2a96c784-b880-4d0e-a66e-5b4f765cf924,4240,4,00:22:32.608,127,32,269,0.0,23,Goal Keeper,851,Netherlands Women's,8,From Keeper,851,Netherlands Women's,,,['633cac33-a3e7-4e4e-9bc8-10c8bf67c5ca'],"[9.8, 39.0]",10646.0,Sari van Veenendaal,1.0,Goalkeeper,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,25.0,Collected,,,,,,,,,,,,,,,,,,,,15.0,Success,,,,,,,,,,,,,,,,,,,,,,,,69284,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,51,2019-07-03,21:00:00.000,1,0,available,2020-07-29T05:00,6,"{'id': 1627, 'name': 'M. Beaudoin'}","{'id': 193, 'name': 'Groupama Stadium', 'count...",72,International,Women's World Cup,30,2019,851,Netherlands Women's,female,,"[{'id': 45, 'name': 'Sarina Glotzbach-Wiegman'...",160,Netherlands,858,Sweden Women's,female,,"[{'id': 3016, 'name': 'Peter Gerhardsson', 'ni...",220,Sweden,1.1.0,2.0,2.0,15,Semi-finals,72,30,International,Women's World Cup,female,2019,2020-07-29T05:00,2020-07-29T05:00
3158156,4240,dc2ac9d4-03bb-4e2f-b462-ddf16187934c,4241,4,00:22:38.347,127,38,269,0.0,34,Half End,851,Netherlands Women's,8,From Keeper,858,Sweden Women's,,,['9125ce39-0492-406a-8186-0332f1669dc7'],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,69284,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,51,2019-07-03,21:00:00.000,1,0,available,2020-07-29T05:00,6,"{'id': 1627, 'name': 'M. Beaudoin'}","{'id': 193, 'name': 'Groupama Stadium', 'count...",72,International,Women's World Cup,30,2019,851,Netherlands Women's,female,,"[{'id': 45, 'name': 'Sarina Glotzbach-Wiegman'...",160,Netherlands,858,Sweden Women's,female,,"[{'id': 3016, 'name': 'Peter Gerhardsson', 'ni...",220,Sweden,1.1.0,2.0,2.0,15,Semi-finals,72,30,International,Women's World Cup,female,2019,2020-07-29T05:00,2020-07-29T05:00
3158157,4241,9125ce39-0492-406a-8186-0332f1669dc7,4242,4,00:22:38.347,127,38,269,0.0,34,Half End,851,Netherlands Women's,8,From Keeper,851,Netherlands Women's,,,['dc2ac9d4-03bb-4e2f-b462-ddf16187934c'],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,69284,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,51,2019-07-03,21:00:00.000,1,0,available,2020-07-29T05:00,6,"{'id': 1627, 'name': 'M. Beaudoin'}","{'id': 193, 'name': 'Groupama Stadium', 'count...",72,International,Women's World Cup,30,2019,851,Netherlands Women's,female,,"[{'id': 45, 'name': 'Sarina Glotzbach-Wiegman'...",160,Netherlands,858,Sweden Women's,female,,"[{'id': 3016, 'name': 'Peter Gerhardsson', 'ni...",220,Sweden,1.1.0,2.0,2.0,15,Semi-finals,72,30,International,Women's World Cup,female,2019,2020-07-29T05:00,2020-07-29T05:00


In [51]:
# Print the shape of the raw DataFrame, df_events_matches_competitions
print(df_events_matches_competitions.shape)

(3158158, 193)


In [52]:
# Print the column names of the raw DataFrame, df_events_matches_competitions
print(df_events_matches_competitions.columns)

Index(['level_0', 'id', 'index_x', 'period', 'timestamp', 'minute', 'second',
       'possession', 'duration', 'type_id',
       ...
       'competition_stage_id', 'competition_stage_name', 'competition_id',
       'season_id', 'country_name', 'competition_name', 'competition_gender',
       'season_name', 'match_updated', 'match_available'],
      dtype='object', length=193)


The joined dataset has forty features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1).

In [53]:
# Data types of the features of the raw DataFrame, df_events_matches_competitions
df_events_matches_competitions.dtypes

level_0                int64
id                    object
index_x                int64
period                 int64
timestamp             object
                       ...  
competition_name      object
competition_gender    object
season_name           object
match_updated         object
match_available       object
Length: 193, dtype: object

Full details of these attributes and their data types can be found in the [Data Dictionary](section3.3.1).

In [54]:
# Info for the raw DataFrame, df_events_matches_competitions
df_events_matches_competitions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3158158 entries, 0 to 3158157
Columns: 193 entries, level_0 to match_available
dtypes: float64(36), int64(24), object(133)
memory usage: 4.5+ GB


In [55]:
# Description of the raw DataFrame, df_events_matches_competitions, showing some summary statistics for each numberical column in the DataFrame
df_events_matches_competitions.describe()

Unnamed: 0,level_0,index_x,period,minute,second,possession,duration,type_id,possession_team_id,play_pattern_id,team_id,tactics_formation,player_id,position_id,pass_recipient_id,pass_length,pass_angle,pass_height_id,pass_type_id,pass_body_part_id,pass_outcome_id,clearance_body_part_id,duel_outcome_id,duel_type_id,ball_receipt_outcome_id,dribble_outcome_id,shot_statsbomb_xg,shot_body_part_id,shot_technique_id,shot_outcome_id,shot_type_id,goalkeeper_type_id,goalkeeper_position_id,interception_outcome_id,pass_technique_id,goalkeeper_body_part_id,goalkeeper_technique_id,goalkeeper_outcome_id,foul_committed_card_id,substitution_outcome_id,substitution_replacement_id,match_id,50_50_outcome_id,foul_committed_type_id,bad_behaviour_card_id,index_y,home_score,away_score,match_week,competition_competition_id,season_season_id,home_team_home_team_id,home_team_country_id,away_team_away_team_id,away_team_country_id,metadata_shot_fidelity_version,metadata_xy_fidelity_version,competition_stage_id,competition_id,season_id
count,3158158.0,3158158.0,3158158.0,3158158.0,3158158.0,3158158.0,2313994.0,3158158.0,3158158.0,3158158.0,3158158.0,3301.0,3141590.0,3141590.0,809438.0,871308.0,871308.0,871308.0,171945.0,819407.0,185449.0,30932.0,34423.0,55407.0,123579.0,36494.0,22357.0,22357.0,22357.0,22357.0,22357.0,26473.0,22337.0,16815.0,10304.0,6448.0,8660.0,12634.0,2799.0,4842.0,4847.0,3158158.0,1366.0,1575.0,628.0,3158158.0,3158158.0,3158158.0,3158158.0,3158158.0,3158158.0,3158158.0,3158158.0,3158158.0,3158158.0,2494494.0,2154338.0,3158158.0,3158158.0,3158158.0
mean,1820.498,1821.498,1.500939,44.86634,29.30872,96.64262,1.273816,32.3692,493.0352,2.823449,497.6638,4814.252348,12630.05,11.24376,12222.931576,21.058518,0.013494,1.555189,65.1301,39.882497,19.264547,38.253136,11.894199,10.621275,9.0,8.381789,0.117009,38.998166,92.90607,98.163394,85.52838,31.258868,43.852532,10.267023,106.536685,36.551179,45.561085,40.609387,6.931404,102.945477,14064.650505,439718.7,2.036603,23.274286,6.93949,23.97784,1.808959,1.43636,14.91716,23.19544,22.05346,512.7428,167.1042,510.0286,166.6108,2.0,2.0,2.563879,23.19544,22.05346
std,1076.392,1076.392,0.5160235,27.14136,17.38708,57.67026,2.148291,12.13322,364.0865,2.198834,365.5942,11433.567582,8832.031,7.189082,8679.003819,14.275584,1.542211,0.815164,1.91996,4.434959,24.014366,2.643092,4.937,0.485074,0.0,0.485832,0.157306,1.999809,0.830046,2.06486,5.919048,6.652252,0.502779,5.258971,1.732251,2.272881,0.496283,20.157284,0.328984,0.22707,9347.389479,924383.0,1.179858,1.250034,0.318741,21.30848,1.694212,1.428223,10.84382,18.09714,15.53319,368.382,69.43479,371.7773,70.05726,0.0,0.0,5.538984,18.09714,15.53319
min,0.0,1.0,1.0,0.0,0.0,1.0,-739.8828,2.0,1.0,1.0,1.0,55.0,2941.0,0.0,2941.0,0.0,-3.138562,1.0,61.0,37.0,9.0,37.0,4.0,10.0,9.0,8.0,0.0,37.0,89.0,96.0,61.0,25.0,42.0,4.0,104.0,35.0,45.0,1.0,5.0,102.0,2948.0,7430.0,1.0,19.0,5.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,11.0,1.0,11.0,2.0,2.0,1.0,2.0,1.0
25%,899.0,900.0,1.0,21.0,14.0,47.0,0.365368,30.0,217.0,1.0,217.0,433.0,5216.0,5.0,5216.0,11.313708,-1.162647,1.0,63.0,38.0,9.0,37.0,4.0,10.0,9.0,8.0,0.026083,38.0,93.0,97.0,87.0,30.0,44.0,4.0,105.0,35.0,45.0,15.0,7.0,103.0,5528.5,19757.0,1.0,23.0,7.0,9.0,1.0,0.0,5.0,11.0,4.0,217.0,68.0,217.0,68.0,2.0,2.0,1.0,11.0,4.0
50%,1798.0,1799.0,1.0,45.0,29.0,95.0,1.059719,38.0,217.0,2.0,217.0,442.0,7179.0,11.0,6829.0,17.088007,0.0,1.0,66.0,40.0,9.0,37.0,14.0,11.0,9.0,8.0,0.053939,40.0,93.0,98.0,87.0,32.0,44.0,13.0,108.0,35.0,46.0,52.0,7.0,103.0,10650.0,69234.0,1.0,24.0,7.0,19.0,1.0,1.0,13.0,11.0,24.0,220.0,214.0,220.0,214.0,2.0,2.0,1.0,11.0,24.0
75%,2700.0,2701.0,2.0,68.0,44.0,143.0,1.737,42.0,852.0,4.0,857.0,4231.0,19767.0,17.0,19419.0,26.41969,1.195864,2.0,66.0,40.0,9.0,40.0,16.0,11.0,9.0,9.0,0.132065,40.0,93.0,100.0,87.0,32.0,44.0,16.0,108.0,39.0,46.0,55.0,7.0,103.0,23791.0,266669.0,3.0,24.0,7.0,30.0,3.0,2.0,23.0,37.0,39.0,865.0,214.0,901.0,214.0,2.0,2.0,1.0,37.0,39.0
max,5025.0,5026.0,5.0,128.0,59.0,302.0,1471.906,43.0,1475.0,9.0,1475.0,312112.0,41125.0,25.0,41125.0,119.19941,3.141593,3.0,67.0,106.0,77.0,70.0,17.0,11.0,9.0,9.0,0.93174,70.0,95.0,116.0,88.0,114.0,44.0,17.0,108.0,41.0,46.0,117.0,7.0,103.0,41125.0,3752619.0,4.0,24.0,7.0,106.0,13.0,8.0,38.0,72.0,44.0,1475.0,255.0,1475.0,242.0,2.0,2.0,33.0,72.0,44.0


In [None]:
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_events_matches_competitions
msno.matrix(df_events_matches_competitions, figsize = (30, 7))

In [None]:
# Counts of missing values
null_value_stats = df_events_matches_competitions.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

The visualisation shows us that there are no missing values in the DataFrame.

---

## <a id='#section5'>5. Export Data</a>
Export Data ready for data engineering in the subsequent notebooks.

In [None]:
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'combined', 'raw', 'csv', 'combined.csv')):
    df_events_matches_competitions.to_csv(os.path.join(data_dir_sb, 'combined', 'raw', 'csv', 'combined.csv'), index=None, header=True)
else:
    pass

## <a id='#section6'>6. Summary</a>
This notebook engineers scraped [StatsBomb](https://statsbomb.com/) data using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames.

---

## <a id='#section7'>7. Next Steps</a>
The step is to take the parsed dataset created in this notebook and engineer the data for new features, which is carried out in the follow [Data Engineering](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/StatsBomb%20Data%20Engineering.ipynb) notebook. This data is then ready for use in projects including Expected Goals (xG) models and Tableau visualisations.

## <a id='#section8'>8. References</a>

#### Data
*    [StatsBomb](https://statsbomb.com/) data
*    [StatsBomb](https://github.com/statsbomb/open-data/tree/master/data) open data GitHub repository

---

***Visit my website [EddWebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.***

[Back to the top](#top)