# Tutorial: Download data, value actions and rate players

This tutorial demonstrates how to value on-the-ball actions of football players with the open-source [VAEP framework](https://github.com/ML-KULeuven/socceraction) using the publicly available [Wyscout match event dataset](https://figshare.com/collections/Soccer_match_event_dataset/4415000). The Wyscout dataset includes data for the 2017/2018 English Premier League, the 2017/2018 Spanish Primera División, the 2017/2018 German 1. Bundesliga, the 2017/2018 Italian Serie A, the 2017/2018 French Ligue 1, the 2018 FIFA World Cup, and the UEFA Euro 2016. Covering 1,941 matches, 3,251,294 events and 4,299 players, the dataset is large enough to train machine-learning models and obtain robust ratings for the players.

This tutorial demonstrates the following four steps:
1. Download the [Wyscout dataset](https://figshare.com/collections/Soccer_match_event_dataset/4415000) and preprocess the relevant data.
2. Value game states by training predictive machine learning models.
  - Compute descriptive features for each game state.
  - Obtain labels for each game state (i.e., *Goal scored within next ten actions? Goal conceded within next ten actions?*)
3. Value on-the-ball actions by using the trained predictive machine learning models.
4. Rate players by aggregating the values of their on-the-ball actions.

This notebook is compatible with `socceraction` version `0.2.0`.

**Conventions:**
* Variables that refer a `DataFrame` object are prefixed with `df_`.
* Variables that refer a collection of `DataFrame` objects (e.g., a list, a set or a dict) are prefixed with `dfs_`.

**References:**
* Tom Decroos, Lotte Bransen, Jan Van Haaren, and Jesse Davis. "[Actions Speak Louder than Goals: Valuing Player Actions in Soccer.](https://arxiv.org/abs/1802.07127)" In *Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*, pp. 1851-1861. 2019.
* Luca Pappalardo, Paolo Cintia, Alessio Rossi, Emanuele Massucco, Paolo Ferragina, Dino Pedreschi, and Fosca Giannotti. "[A Public Data Set of Spatio-Temporal Match Events in Soccer Competitions.](https://www.nature.com/articles/s41597-019-0247-7)" *Scientific Data 6*, no. 1 (2019): 1-15.

**Optional:** If you run this notebook on Google Colab, then uncomment the code in the following cell and execute the cell.

In [1]:
# !pip install tables==3.6.1
# !pip install socceraction==0.2.0

**Optional:** If you run this notebook on Google Colab and wish to store all data in a Google Drive folder, then uncomment the code in the following cell and execute the cell.

In [2]:
# from google.colab import drive
# drive.mount('/content/gdrive')
# %mkdir -p '/content/gdrive/My Drive/Friends of Tracking/'
# %cd '/content/gdrive/My Drive/Friends of Tracking/'

In [3]:
import warnings
from io import BytesIO, StringIO
from pathlib import Path
from urllib.parse import urlparse
from urllib.request import urlopen, urlretrieve
from zipfile import ZipFile, is_zipfile

import pandas as pd
import socceraction.vaep.features as features
import socceraction.vaep.labels as labels
from sklearn.metrics import brier_score_loss, roc_auc_score
from socceraction.spadl.wyscout import convert_to_spadl
from socceraction.vaep.formula import value
from tqdm.notebook import tqdm
from xgboost import XGBClassifier

In [4]:
warnings.filterwarnings('ignore', category=pd.io.pytables.PerformanceWarning)

# Download and preprocess the data

This section downloads the Wyscout dataset, collects the required information about the match events, and converts the match events into the SPADL representation.

1. Download the Wyscout dataset;
2. Construct an HDF5 file named `wyscout.h5` that contains the relevant information from the dataset;
3. Convert the `wyscout.h5` file into a `spadl.h5` file that contains the same information in the SPADL representation.

**Note:** The `socceraction` library offers off-the-shelf functionality to convert a collection of Wyscout JSON files into the SPADL representation. However, the JSON files in the publicly available dataset are not directly compatible with the `socceraction` functionality. Therefore, we need to perform a few additional steps to transform the Wyscout data into the SPADL representation.

## Download the Wyscout dataset

The `data_files` `dict` lists the four data files in the Wyscout dataset that are required to run the VAEP framework.
* `events` (73.74 MB): match events for the matches in the dataset;
* `matches` (629.98 kB): overview of the matches in the dataset;
* `players` (1.66 MB): information on the players in the dataset;
* `teams` (26.76 kB): information on the teams in the dataset.

In [5]:
data_files = {
    'events': 'https://ndownloader.figshare.com/files/14464685',  # ZIP file containing one JSON file for each competition
    'matches': 'https://ndownloader.figshare.com/files/14464622',  # ZIP file containing one JSON file for each competition
    'players': 'https://ndownloader.figshare.com/files/15073721',  # JSON file
    'teams': 'https://ndownloader.figshare.com/files/15073697'  # JSON file
}

The following cell loops through the `data_files` `dict`, downloads each listed data file, and stores each downloaded data file to the local file system.

If the downloaded data file is a ZIP archive, the included JSON files are extracted from the ZIP archive and stored to the local file system.

**Note:** If you do not understand what the code below does exactly, then do not worry too much. ;-)

In [6]:
for url in tqdm(data_files.values()):
    url_s3 = urlopen(url).geturl()
    path = Path(urlparse(url_s3).path)
    file_name = path.name
    data_dir = Path('data')
    data_dir.mkdir(exist_ok=True)
    file_local, _ = urlretrieve(url_s3, str(data_dir / file_name))
    if is_zipfile(file_local):
        with ZipFile(file_local) as zip_file:
            zip_file.extractall(data_dir)

  0%|          | 0/4 [00:00<?, ?it/s]

## Preprocess the Wyscout data

The `read_json_file` function reads and returns the content of a given JSON file. The function handles the encoding of special characters (e.g., accents in names of players and teams) that the `pd.read_json` function cannot handle properly.

In [7]:
def read_json_file(filename):
    with open(filename, 'rb') as json_file:
        return BytesIO(json_file.read()).getvalue().decode('unicode_escape')

### Teams

The following cells read the `teams.json` file into a `DataFrame` object and store that object in the `wyscout.h5` HDF5 file under the key `teams`.

In [8]:
json_teams = read_json_file('data/teams.json')
df_teams = pd.read_json(StringIO(json_teams))

In [9]:
df_teams.head(10)

Unnamed: 0,city,name,wyId,officialName,area,type
0,Newcastle upon Tyne,Newcastle United,1613,Newcastle United FC,"{'name': 'England', 'id': '0', 'alpha3code': '...",club
1,Vigo,Celta de Vigo,692,Real Club Celta de Vigo,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club
2,Barcelona,Espanyol,691,Reial Club Deportiu Espanyol,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club
3,Vitoria-Gasteiz,Deportivo Alavés,696,Deportivo Alavés,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club
4,Valencia,Levante,695,Levante UD,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club
5,Troyes,Troyes,3795,Espérance Sportive Troyes Aube Champagne,"{'name': 'France', 'id': '250', 'alpha3code': ...",club
6,Getafe (Madrid),Getafe,698,Getafe Club de Fútbol,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club
7,Mönchengladbach,Borussia M'gladbach,2454,Borussia VfL Mönchengladbach,"{'name': 'Germany', 'id': '276', 'alpha3code':...",club
8,"Huddersfield, West Yorkshire",Huddersfield Town,1673,Huddersfield Town FC,"{'name': 'England', 'id': '0', 'alpha3code': '...",club
9,Bilbao,Athletic Club,678,Athletic Club Bilbao,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club


In [10]:
df_teams.to_hdf('wyscout.h5', key='teams', mode='w')

### Players

The following cells read the `players.json` file into a `DataFrame` object and store that object in the `wyscout.h5` HDF5 file under the key `players`.

In [11]:
json_players = read_json_file('data/players.json')
df_players = pd.read_json(StringIO(json_players))

In [12]:
df_players.head(10)
df_players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3603 entries, 0 to 3602
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   passportArea           3603 non-null   object
 1   weight                 3603 non-null   int64 
 2   firstName              3603 non-null   object
 3   middleName             3603 non-null   object
 4   lastName               3603 non-null   object
 5   currentTeamId          3512 non-null   object
 6   birthDate              3603 non-null   object
 7   height                 3603 non-null   int64 
 8   role                   3603 non-null   object
 9   birthArea              3603 non-null   object
 10  wyId                   3603 non-null   int64 
 11  foot                   3603 non-null   object
 12  shortName              3603 non-null   object
 13  currentNationalTeamId  3603 non-null   object
dtypes: int64(3), object(11)
memory usage: 394.2+ KB


In [13]:
df_players.to_hdf('wyscout.h5', key='players', mode='a')

### Matches

The following cell lists the competitions to be included in the dataset. Uncomment the competitions that you want to include in your dataset.

In [14]:
competitions = [
    'England',
    'France',
    'Germany',
    'Italy',
    'Spain',
    'European Championship',
    'World Cup'
]

The following cells read the `matches.json` files for the selected competitions into a `DataFrame` object and store that object in the `wyscout.h5` HDF5 file under the key `matches`.

In [15]:
dfs_matches = []
for competition in competitions:
    competition_name = competition.replace(' ', '_')
    file_matches = f'data/matches_{competition_name}.json'
    json_matches = read_json_file(file_matches)
    df_matches = pd.read_json(StringIO(json_matches))
    dfs_matches.append(df_matches)
df_matches = pd.concat(dfs_matches)

In [16]:
df_matches.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1941 entries, 0 to 63
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   status         1941 non-null   object
 1   roundId        1941 non-null   int64 
 2   gameweek       1941 non-null   int64 
 3   teamsData      1941 non-null   object
 4   seasonId       1941 non-null   int64 
 5   dateutc        1941 non-null   object
 6   winner         1941 non-null   int64 
 7   venue          1941 non-null   object
 8   wyId           1941 non-null   int64 
 9   label          1941 non-null   object
 10  date           1941 non-null   object
 11  referees       1941 non-null   object
 12  duration       1941 non-null   object
 13  competitionId  1941 non-null   int64 
 14  groupName      115 non-null    object
dtypes: int64(6), object(9)
memory usage: 242.6+ KB


In [17]:
df_matches.to_hdf('wyscout.h5', key='matches', mode='a')

### Events

The following cells read the `events.json` files for the selected competitions into a `DataFrame` object and store that object in the `wyscout.h5` HDF5 file under the key `events/match_<match-id>`.

In [18]:
for competition in competitions:
    competition_name = competition.replace(' ', '_')
    file_events = f'data/events_{competition_name}.json'
    json_events = read_json_file(file_events)
    df_events = pd.read_json(StringIO(json_events))
    # Ensure coordinate columns are numeric floats to avoid dtype warnings and
    # incompatible-dtype assignments in downstream code
    for _col in ['start_x','start_y','end_x','end_y']:
        if _col in df_events.columns:
            df_events[_col] = pd.to_numeric(df_events[_col], errors='coerce').astype(float)
    df_events_matches = df_events.groupby('matchId', as_index=False)
    for match_id, df_events_match in df_events_matches:
        df_events_match.to_hdf('wyscout.h5', key=f'events/match_{match_id}', mode='a')

## Convert the Wyscout data to the SPADL representation

The following cell calls the `convert_to_spadl` function from the `socceraction` library to convert the `wyscout.h5` HDF5 file into the `spadl.h5` HDF5 file.

In [19]:
convert_to_spadl('wyscout.h5', 'spadl.h5')

...Inserting actiontypes
...Inserting bodyparts
...Inserting results
...Converting games
...Converting players
...Converting teams
...Generating player_games


100%|██████████| 1941/1941 [01:29<00:00, 21.76game/s]


...Converting events to actions


  lambda x: make_position_vars(x[0], x[1]), axis=1
  df_events.loc[post_left_idx, "end_y"] = 55.38
  lambda x: make_position_vars(x[0], x[1]), axis=1
  df_events.loc[post_left_idx, "end_y"] = 55.38
  lambda x: make_position_vars(x[0], x[1]), axis=1
  df_events.loc[post_left_idx, "end_y"] = 55.38
  lambda x: make_position_vars(x[0], x[1]), axis=1
  df_events.loc[post_left_idx, "end_y"] = 55.38
  lambda x: make_position_vars(x[0], x[1]), axis=1
  df_events.loc[post_left_idx, "end_y"] = 55.38
  lambda x: make_position_vars(x[0], x[1]), axis=1
  df_events.loc[post_left_idx, "end_y"] = 55.38
  lambda x: make_position_vars(x[0], x[1]), axis=1
  df_events.loc[post_left_idx, "end_y"] = 55.38
  lambda x: make_position_vars(x[0], x[1]), axis=1
  df_events.loc[post_left_idx, "end_y"] = 55.38
  lambda x: make_position_vars(x[0], x[1]), axis=1
  df_events.loc[post_left_idx, "end_y"] = 55.38
  lambda x: make_position_vars(x[0], x[1]), axis=1
  df_events.loc[post_left_idx, "end_y"] = 55.38
  lambda x

# Value game states

This section generates features and labels for the game states, trains a predictive machine learning model for each label, and values the game states by applying the trained machine learning models.

1. Generate the features to describe the game states;
2. Generate the labels that capture the value of the game states;
3. Compose a dataset by selecting a set of features and the labels of the game states;
4. Train predictive machine learning models using the dataset;
5. Value the game states using the trained predictive machine learning model.

**Note:** The code in this section is based on the [2-compute-features-and-labels.ipynb](https://github.com/ML-KULeuven/socceraction/blob/master/public-notebooks/2-compute-features-and-labels.ipynb) and [3-estimate-scoring-and-conceding-probabilities.ipynb](https://github.com/ML-KULeuven/socceraction/blob/master/public-notebooks/3-estimate-scoring-and-conceding-probabilities.ipynb) notebooks in the `socceraction` repository.

In [22]:
df_games = pd.read_hdf('spadl.h5', key='games')
df_actiontypes = pd.read_hdf('spadl.h5', key='actiontypes')
df_bodyparts = pd.read_hdf('spadl.h5', key='bodyparts')
df_results = pd.read_hdf('spadl.h5', key='results')

In [23]:
nb_prev_actions = 3

## Generate game state features

The following cell lists a number of *feature generators* from the `features` module in the `socceraction` library. Each function expects either a `DataFrame` object containing actions (i.e., individual actions) or a list of `DataFrame` objects containing consecutive actions (i.e., game states), and returns the corresponding *feature* for the individual action or game state.

In [24]:
functions_features = [
    features.actiontype_onehot,
    features.bodypart_onehot,
    features.result_onehot,
    features.goalscore,
    features.startlocation,
    features.endlocation,
    features.movement,
    features.space_delta,
    features.startpolar,
    features.endpolar,
    features.team,
    features.time_delta
]

The following cell generates game states from consecutive actions in each game and computes the features for each game state.

1. Obtain the actions for the game (i.e., `df_actions`) by looping through the games;
2. Construct game states of a given length from the actions (i.e., `dfs_gamestates`);
3. Compute the features for the constructed game states (i.e., `df_features`) by looping through the list of *feature generators*.

In [25]:
for _, game in tqdm(df_games.iterrows(), total=len(df_games)):
    game_id = game['game_id']
    df_actions = pd.read_hdf('spadl.h5', key=f'actions/game_{game_id}')
    df_actions = (df_actions
        .merge(df_actiontypes, how='left')
        .merge(df_results, how='left')
        .merge(df_bodyparts, how='left')
        .reset_index(drop=True)
    )
    
    dfs_gamestates = features.gamestates(df_actions, nb_prev_actions=nb_prev_actions)
    dfs_gamestates = features.play_left_to_right(dfs_gamestates, game['home_team_id'])
    
    df_features = pd.concat([function(dfs_gamestates) for function in functions_features], axis=1)
    df_features.to_hdf('features.h5', key=f'game_{game_id}')

  0%|          | 0/1941 [00:00<?, ?it/s]

## Generate game state labels

The following cell lists a number of *label generators* from the `labels` module in the `socceraction` library. Each function expects either a `DataFrame` object containing actions (i.e., individual actions) or a list of `DataFrame` objects containing consecutive actions (i.e., game states), and returns the corresponding *label* for the individual action or game state.

In [26]:
functions_labels = [
    labels.scores,
    labels.concedes
]

The following cell computes the labels for each action:

1. Obtain the actions for the game (i.e., `df_actions`) by looping through the games;
2. Compute the labels for the actions (i.e., `df_labels`) by looping through the list of *label generators*.

In [30]:
for _, game in tqdm(df_games.iterrows(), total=len(df_games)):
    game_id = game['game_id']
    df_actions = pd.read_hdf('spadl.h5', key=f'actions/game_{game_id}')
    df_actions = (df_actions
        .merge(df_actiontypes, how='left')
        .merge(df_results, how='left')
        .merge(df_bodyparts, how='left')
        .reset_index(drop=True)
    )
    
    df_labels = pd.concat([function(df_actions) for function in functions_labels], axis=1)
    df_labels.to_hdf('labels.h5', key=f'game_{game_id}')

  0%|          | 0/1941 [00:00<?, ?it/s]

## Generate dataset

The following cell generates a list of names for the features to be included in the dataset.

In [31]:
columns_features = features.feature_column_names(functions_features, nb_prev_actions=nb_prev_actions)

The following cell obtains the relevant features for each game and stores them in the `df_features` `DataFrame` object.

In [32]:
dfs_features = []
for _, game in tqdm(df_games.iterrows(), total=len(df_games)):
    game_id = game['game_id']
    df_features = pd.read_hdf('features.h5', key=f'game_{game_id}')
    dfs_features.append(df_features[columns_features])
df_features = pd.concat(dfs_features).reset_index(drop=True)

  0%|          | 0/1941 [00:00<?, ?it/s]

In [33]:
df_features.head(10)

Unnamed: 0,type_pass_a0,type_cross_a0,type_throw_in_a0,type_freekick_crossed_a0,type_freekick_short_a0,type_corner_crossed_a0,type_corner_short_a0,type_take_on_a0,type_foul_a0,type_tackle_a0,...,end_dist_to_goal_a0,end_angle_to_goal_a0,end_dist_to_goal_a1,end_angle_to_goal_a1,end_dist_to_goal_a2,end_angle_to_goal_a2,team_1,team_2,time_delta_1,time_delta_2
0,True,False,False,False,False,False,False,False,False,False,...,63.091679,0.053916,63.091679,0.053916,63.091679,0.053916,True,True,0.0,0.0
1,True,False,False,False,False,False,False,False,False,False,...,68.328929,0.355773,63.091679,0.053916,63.091679,0.053916,True,True,1.997756,1.997756
2,True,False,False,False,False,False,False,False,False,False,...,73.715416,0.185556,68.328929,0.355773,63.091679,0.053916,True,True,0.771744,2.7695
3,True,False,False,False,False,False,False,False,False,False,...,38.707772,0.396818,73.715416,0.185556,68.328929,0.355773,True,True,2.174464,2.946208
4,True,False,False,False,False,False,False,False,False,False,...,37.425928,0.620467,38.707772,0.396818,73.715416,0.185556,True,True,3.907382,6.081846
5,True,False,False,False,False,False,False,False,False,False,...,30.729141,1.258205,37.425928,0.620467,38.707772,0.396818,True,True,3.75873,7.666112
6,False,True,False,False,False,False,False,False,False,False,...,8.4,0.0,30.729141,1.258205,37.425928,0.620467,True,True,2.210584,5.969314
7,False,False,False,False,False,False,False,False,False,False,...,106.094842,0.24603,96.6,0.0,99.923872,0.296969,False,False,1.756122,3.966706
8,False,True,False,False,False,False,False,False,False,False,...,25.925192,1.489705,25.925192,1.489705,8.4,0.0,False,True,2.095783,3.851905
9,False,False,False,False,False,False,False,False,False,False,...,89.647658,0.388999,106.094842,0.24603,106.094842,0.24603,False,True,3.034782,5.130565


The following cell lists the names of the labels to be included in the dataset.

In [34]:
columns_labels = [
    'scores',
    'concedes'
]

The following cell obtains the relevant labels for each game and stores them in the `df_labels` `DataFrame` object.

In [35]:
dfs_labels = []
for _, game in tqdm(df_games.iterrows(), total=len(df_games)):
    game_id = game['game_id']
    df_labels = pd.read_hdf('labels.h5', key=f'game_{game_id}')
    dfs_labels.append(df_labels[columns_labels])
df_labels = pd.concat(dfs_labels).reset_index(drop=True)

  0%|          | 0/1941 [00:00<?, ?it/s]

In [40]:
df_labels.head(10)
sum(df_labels.concedes==True)


12425

## Train classifiers

The following cell trains an XGBoost classifier for each label using the computed features. For each label:
1. Construct an XGBoost classifier with default hyperparameters;
2. Train the classifier using the computed features and the label;
3. Store the trained classifier in the `models` `dict`.

In [41]:
%%time
models = {}
for column_labels in columns_labels:
    model = XGBClassifier(
        eval_metric='logloss',
        use_label_encoder=False
    )
    model.fit(df_features, df_labels[column_labels])
    models[column_labels] = model

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


CPU times: total: 10min 17s
Wall time: 12.1 s


In [46]:
print("shape:", df_features.shape)

shape: (2465156, 142)


## Estimate probabilities

The following cell predicts the labels for the game states using the trained XGBoost classifier. For each label:
1. Retrieve the model for the label;
2. Estimate the probabilities of the labels being `False` and `True` given the computed features;
3. Keep the probabilities for the `True` label;
4. Store the probabilities as a `Series` object in the `dfs_predictions` `dict`.

In [47]:
dfs_predictions = {}
for column_labels in columns_labels:
    model = models[column_labels]
    probabilities = model.predict_proba(df_features)
    predictions = probabilities[:, 1]
    dfs_predictions[column_labels] = pd.Series(predictions)
df_predictions = pd.concat(dfs_predictions, axis=1)

In [48]:
df_predictions.head(10)

Unnamed: 0,scores,concedes
0,0.003204,0.000747
1,0.003529,0.001474
2,0.003163,0.007955
3,0.015963,0.002828
4,0.01746,0.002133
5,0.024792,0.00142
6,0.017933,0.002242
7,0.001266,0.006793
8,0.020624,0.001258
9,0.000297,0.011874


The following cell obtains the `game_id` for each action in order to store the predictions per game.

In [49]:
dfs_game_ids = []
for _, game in tqdm(df_games.iterrows(), total=len(df_games)):
    game_id = game['game_id']
    df_actions = pd.read_hdf('spadl.h5', key=f'actions/game_{game_id}')
    dfs_game_ids.append(df_actions['game_id'])
df_game_ids = pd.concat(dfs_game_ids, axis=0).astype('int').reset_index(drop=True)

  0%|          | 0/1941 [00:00<?, ?it/s]

The following cell concatenates the `DataFrame` objects with predictions and `game_id`s for each action into a single `DataFrame` object.

In [50]:
df_predictions = pd.concat([df_predictions, df_game_ids], axis=1)

In [51]:
df_predictions.head(10)

Unnamed: 0,scores,concedes,game_id
0,0.003204,0.000747,2500089
1,0.003529,0.001474,2500089
2,0.003163,0.007955,2500089
3,0.015963,0.002828,2500089
4,0.01746,0.002133,2500089
5,0.024792,0.00142,2500089
6,0.017933,0.002242,2500089
7,0.001266,0.006793,2500089
8,0.020624,0.001258,2500089
9,0.000297,0.011874,2500089


The following cell groups the predictions per game based on their `game_id`.

In [55]:
df_predictions_per_game = df_predictions.groupby('game_id')
df_predictions_per_game.head(10)

Unnamed: 0,scores,concedes,game_id
0,0.003204,0.000747,2500089
1,0.003529,0.001474,2500089
2,0.003163,0.007955,2500089
3,0.015963,0.002828,2500089
4,0.017460,0.002133,2500089
...,...,...,...
2463941,0.021156,0.002605,2057954
2463942,0.022413,0.001311,2057954
2463943,0.007173,0.002747,2057954
2463944,0.017114,0.002835,2057954


The following cell stores the predictions in the `predictions.h5` HDF5 file per game.

In [56]:
for game_id, df_predictions in tqdm(df_predictions_per_game):
    df_predictions = df_predictions.reset_index(drop=True)
    df_predictions[columns_labels].to_hdf('predictions.h5', key=f'game_{game_id}')

  0%|          | 0/1941 [00:00<?, ?it/s]

# Value on-the-ball actions

**Note:** The code in this section is based on the [4-compute-vaep-values.ipynb](https://github.com/ML-KULeuven/socceraction/blob/master/public-notebooks/4-compute-vaep-values.ipynb) notebook in the `socceraction` repository.

In [57]:
df_players = pd.read_hdf('spadl.h5', key='players')
df_teams = pd.read_hdf('spadl.h5', key='teams')

In [58]:
dfs_values = []
for _, game in tqdm(df_games.iterrows(), total=len(df_games)):
    game_id = game['game_id']
    df_actions = pd.read_hdf('spadl.h5', key=f'actions/game_{game_id}')
    df_actions = (df_actions
        .merge(df_actiontypes, how='left')
        .merge(df_results, how='left')
        .merge(df_bodyparts, how='left')
        .merge(df_players, how='left')
        .merge(df_teams, how='left')
        .reset_index(drop=True)
    )
    
    df_predictions = pd.read_hdf('predictions.h5', key=f'game_{game_id}')
    df_values = value(df_actions, df_predictions['scores'], df_predictions['concedes'])
    
    df_all = pd.concat([df_actions, df_predictions, df_values], axis=1)
    dfs_values.append(df_all)

  0%|          | 0/1941 [00:00<?, ?it/s]

In [59]:
df_values = (pd.concat(dfs_values)
    .sort_values(['game_id', 'period_id', 'time_seconds'])
    .reset_index(drop=True)
)

In [60]:
df_values[
    ['short_name', 'scores', 'concedes', 'offensive_value', 'defensive_value', 'vaep_value']
].head(10)

Unnamed: 0,short_name,scores,concedes,offensive_value,defensive_value,vaep_value
0,O. Giroud,0.00289,0.001279,0.0,-0.0,0.0
1,A. Griezmann,0.003458,0.004951,0.000567,-0.003672,-0.003104
2,N. Kanté,0.004629,0.002421,0.001171,0.00253,0.003701
3,L. Koscielny,0.003637,0.008212,-0.000991,-0.005791,-0.006782
4,P. Evra,0.007062,0.001403,0.007062,-0.001403,0.005659
5,C. Săpunaru,0.000983,0.012464,-0.00042,-0.005402,-0.005823
6,C. Săpunaru,0.000599,0.01169,-0.000385,0.000775,0.00039
7,B. Matuidi,0.040535,0.003046,0.028845,-0.002447,0.026398
8,C. Tătărușanu,0.002531,0.111815,-0.000515,-0.071281,-0.071796
9,C. Tătărușanu,0.006613,0.008523,0.004082,0.103292,0.107375


# Rate players

**Note:** The code in this section is based on the [5-top-players.ipynb](https://github.com/ML-KULeuven/socceraction/blob/master/public-notebooks/5-top-players.ipynb) notebook in the `socceraction` repository.

## Rate according to total VAEP value

In [61]:
df_ranking = (df_values[['player_id', 'team_name', 'short_name', 'vaep_value']]
    .groupby(['player_id', 'team_name', 'short_name'])
    .agg(vaep_count=('vaep_value', 'count'), vaep_sum=('vaep_value', 'sum'))
    .sort_values('vaep_sum', ascending=False)
    .reset_index()
)

In [62]:
df_ranking.head(10)

Unnamed: 0,player_id,team_name,short_name,vaep_count,vaep_sum
0,3359.0,FC Barcelona,L. Messi,2753,36.439465
1,120353.0,Liverpool FC,Mohamed Salah,1568,25.749468
2,40810.0,Paris Saint-Germain FC,Neymar,1981,20.591738
3,3840.0,Real Club Celta de Vigo,Iago Aspas,1771,19.552942
4,26150.0,Leicester City FC,R. Mahrez,2022,18.736223
5,38021.0,Manchester City FC,K. De Bruyne,3528,18.731499
6,28115.0,Olympique Lyonnais,N. Fekir,1754,17.967518
7,8717.0,Tottenham Hotspur FC,H. Kane,1153,17.892778
8,3682.0,Club Atlético de Madrid,A. Griezmann,1394,17.73525
9,118.0,Olympique Lyonnais,M. Depay,1837,17.594404


## Rate according to total VAEP value per 90 minutes

In [63]:
df_player_games = pd.read_hdf('spadl.h5', 'player_games')
df_player_games = df_player_games[df_player_games['game_id'].isin(df_games['game_id'])]

In [64]:
df_minutes_played = (df_player_games[['player_id', 'minutes_played']]
    .groupby('player_id')
    .sum()
    .reset_index()
)

In [65]:
df_minutes_played.head(10)

Unnamed: 0,player_id,minutes_played
0,0,21.965164
1,12,186.155078
2,33,92.903192
3,36,2267.922277
4,38,382.741398
5,45,313.762114
6,48,4281.479539
7,54,3724.307824
8,56,267.266792
9,66,435.00425


In [66]:
df_ranking_p90 = df_ranking.merge(df_minutes_played)
df_ranking_p90 = df_ranking_p90[df_ranking_p90['minutes_played'] > 360]
df_ranking_p90['vaep_rating'] = df_ranking_p90['vaep_sum'] * 90 / df_ranking_p90['minutes_played']
df_ranking_p90 = df_ranking_p90.sort_values('vaep_rating', ascending=False)

In [67]:
df_ranking_p90.head(10)

Unnamed: 0,player_id,team_name,short_name,vaep_count,vaep_sum,minutes_played,vaep_rating
0,3359.0,FC Barcelona,L. Messi,2753,36.439465,3486.545778,0.940631
2,40810.0,Paris Saint-Germain FC,Neymar,1981,20.591738,2334.354641,0.793905
1,120353.0,Liverpool FC,Mohamed Salah,1568,25.749468,3186.109293,0.727361
46,134397.0,TSG 1899 Hoffenheim,S. Gnabry,836,11.511055,1534.906306,0.674957
389,406682.0,Rasen Ballsport Leipzig,A. Lookman,345,5.590625,771.08291,0.652532
971,326523.0,Real Madrid Club de Fútbol,Dani Ceballos,446,3.014174,420.702252,0.644816
10,89186.0,Juventus FC,P. Dybala,1782,17.193958,2477.563907,0.624588
483,25601.0,Benevento Calcio,C. Diabaté,177,4.920228,713.032583,0.621038
6,28115.0,Olympique Lyonnais,N. Fekir,1754,17.967518,2671.557291,0.605294
9,118.0,Olympique Lyonnais,M. Depay,1837,17.594404,2634.701474,0.601015


In [68]:
df_ranking_p90.to_csv('ranking.csv', index=False)