# Football matches result prediction

Let's try to predict *Serie A* matches result (i.e. home win, away win or draw) with a RNN.

## Introduction

- The dataset was created by scraping *Serie A* matches data starting from season 2005-06 to season 2020-21
- Cup matches (*Champions League*, *Europa League*, *Coppa Italia*) played over the course of each season were not taken into account

In [1399]:
import re
from collections import defaultdict

import pandas as pd
import torch
import torch.nn as nn
from sklearn.preprocessing import LabelBinarizer
from torch import optim
from torch.utils.data import Dataset

from MatchResult import MatchResult

In [1400]:
match_cols = ['season', 'round'] + \
             ['date', 'time', 'referee', 'home_team', 'away_team', 'home_team_score', 'away_team_score'] + \
             ['home_team_coach'] + \
             ['home_player_' + str(i) for i in range(1, 12)] + \
             ['home_substitute_' + str(i) for i in range(1, 8)] + \
             ['away_team_coach'] + \
             ['away_player_' + str(i) for i in range(1, 12)] + \
             ['away_substitute_' + str(i) for i in range(1, 8)]

In [1401]:
raw_data = pd.read_csv('raw.csv')
raw_data.head()

Unnamed: 0,season,round,date,time,referee,home_team,away_team,home_team_score,away_team_score,home_team_coach,...,away_player_9,away_player_10,away_player_11,away_substitute_1,away_substitute_2,away_substitute_3,away_substitute_4,away_substitute_5,away_substitute_6,away_substitute_7
0,2005-06,1,28/08/2005,15:00,MASSIMO DE,ASCOLI,MILAN,1,1,Massimo Silva,...,Kaka,Andriy Shevchenko,Alberto Gilardino,Marek Jankulovski,Clarence Seedorf,Zeljko Kalac,Gennaro Gattuso,Manuel Rui Costa,Johann Vogel,Dario Simic
1,2005-06,1,27/08/2005,20:30,GIANLUCA PAPARESTA,FIORENTINA,SAMPDORIA,2,1,Cesare Prandelli,...,Lamberto Zauli,Francesco Flachi,Emiliano Bonazzoli,Marco Pisano,Vitaliy Kutuzov,Marco Borriello,Luca Castellazzi,Marco Zamboni,Simone Pavan,Gionata Mingozzi
2,2005-06,1,28/08/2005,15:00,TIZIANO PIERI,PARMA,PALERMO,1,1,Mario Beretta,...,Massimo Bonanni,Andrea Caracciolo,Stephen Makinwa,Nicola Santoni,Franco Brienza,Massimo Mutarelli,Giuseppe Biava,Michele Ferri,Mariano Gonzalez,Simone Pepe
3,2005-06,1,28/08/2005,15:00,PAOLO TAGLIAVENTO,INTER,TREVISO,3,0,Roberto Mancini,...,Reginaldo,Luigi Beghetto,Pinga,Roberto Chiappara,Dino Fava,Jehad Muntasser,Adriano Zancope,Francesco Parravicini,Anderson,Alberto Giuliatto
4,2005-06,1,27/08/2005,18:00,GIANLUCA ROCCHI,LIVORNO,LECCE,2,1,Roberto Donadoni,...,Alex Pinardi,Aleksei Eremenko,Graziano Pelle,Alfonso Camorani,Jaime Valdes,Giuseppe Cozzolino,Francesco Benussi,Marco Pecorari,Giuseppe Abruzzese,Davide Giorgino


## Data visualization

Let's inspect our data a little bit more

In [1402]:
# todo

## Dataset pre-processing
Now let's clean our raw data and add some additional features.

In [1403]:
df = pd.DataFrame(raw_data)
df = df[:200]

In [1404]:
# convert date str to datetime
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
# sort by date column
df = df.sort_values(by='date')
df = df.reset_index(drop=True)
# round values to int
df['round'] = df['round'].astype(int)

In [1405]:
# utility methods
def get_team_and_historical_index_from_match_team_id(match_team_id: str) -> (str, str):
    match_team_name = re.findall("\s+", match_team_id)[0]
    match_team_index = re.findall("\d+", match_team_id)[0]
    return match_team_name, match_team_index


def get_match_by_team_and_round(df: pd.DataFrame, team: str, round: int) -> pd.DataFrame:
    return df[(df['home_team'].equals(team) | df['away_team'].equals(team)) & df['round'] == round]


def get_last_n_matches_played_by_team_before_round(df: pd.DataFrame, team: str, round: int, n: int) -> pd.DataFrame:
    last_n_matches = pd.DataFrame()
    for i in range(1, n+1):
        if round - i > 0:
            last_n_matches = pd.concat([last_n_matches, get_match_by_team_and_round(df, team, round - i)], axis=1)
    return last_n_matches

### Additional features

#### Result column
Our model will try to predict match results, i.e. **home win**, **away win** or **draw**, so we need a result column to be used as our target.

In [1406]:
def get_match_result_from_score(home_team_score: int, away_team_score: int) -> MatchResult:
    if home_team_score == away_team_score:
        return MatchResult.draw
    if home_team_score > away_team_score:
        return MatchResult.home
    return MatchResult.away


def add_target_column(df: pd.DataFrame) -> pd.DataFrame:
    results = {'result': []}
    for index, row in df.iterrows():
        results['result'] += [get_match_result_from_score(row['home_team_score'], row['away_team_score']).name]
    df.insert(loc=df.columns.get_loc('home_team_score'), column='result', value=results['result'])
    return df

In [1407]:
# add target column
add_target_column(df)

Unnamed: 0,season,round,date,time,referee,home_team,away_team,result,home_team_score,away_team_score,...,away_player_9,away_player_10,away_player_11,away_substitute_1,away_substitute_2,away_substitute_3,away_substitute_4,away_substitute_5,away_substitute_6,away_substitute_7
0,2005-06,1,2005-08-27,20:30,GIANLUCA PAPARESTA,FIORENTINA,SAMPDORIA,home,2,1,...,Lamberto Zauli,Francesco Flachi,Emiliano Bonazzoli,Marco Pisano,Vitaliy Kutuzov,Marco Borriello,Luca Castellazzi,Marco Zamboni,Simone Pavan,Gionata Mingozzi
1,2005-06,1,2005-08-27,18:00,GIANLUCA ROCCHI,LIVORNO,LECCE,home,2,1,...,Alex Pinardi,Aleksei Eremenko,Graziano Pelle,Alfonso Camorani,Jaime Valdes,Giuseppe Cozzolino,Francesco Benussi,Marco Pecorari,Giuseppe Abruzzese,Davide Giorgino
2,2005-06,1,2005-08-28,15:00,MASSIMO DE,ASCOLI,MILAN,draw,1,1,...,Kaka,Andriy Shevchenko,Alberto Gilardino,Marek Jankulovski,Clarence Seedorf,Zeljko Kalac,Gennaro Gattuso,Manuel Rui Costa,Johann Vogel,Dario Simic
3,2005-06,1,2005-08-28,15:00,TIZIANO PIERI,PARMA,PALERMO,draw,1,1,...,Massimo Bonanni,Andrea Caracciolo,Stephen Makinwa,Nicola Santoni,Franco Brienza,Massimo Mutarelli,Giuseppe Biava,Michele Ferri,Mariano Gonzalez,Simone Pepe
4,2005-06,1,2005-08-28,15:00,PAOLO TAGLIAVENTO,INTER,TREVISO,home,3,0,...,Reginaldo,Luigi Beghetto,Pinga,Roberto Chiappara,Dino Fava,Jehad Muntasser,Adriano Zancope,Francesco Parravicini,Anderson,Alberto Giuliatto
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,2005-06,20,2006-01-18,20:30,PAOLO TAGLIAVENTO,ROMA,REGGINA,home,3,1,...,Francesco Modesto,Francesco Cozza,Luca Vigiani,Maurizio Lauro,Nicola Amoruso,Simone Missiroli,Ivan Pelizzoli,Davide Biondini,Filippo Carobbio,Simone Cavalli
196,2005-06,20,2006-01-18,20:30,GIANLUCA ROCCHI,MILAN,ASCOLI,home,1,0,...,Cristiano Del Grosso,Sasa Bjelanovic,Fabio Quagliarella,Massimo Paci,Michele Fini,Pasquale Foggia,Carlo Zotti,Riccardo Corallo,Davide Oresti,Marco Ferrante
197,2005-06,20,2006-01-18,20:30,PASQUALE RODOMONTI,LECCE,LIVORNO,draw,0,0,...,Francesco Coco,Ibrahima Bakayoko,Cristiano Lucarelli,Marc Pfertzel,Cesar Prates,Raffaele Palladino,Paolo Acerbis,Stefano Fanucci,Giuseppe Colucci,Paulinho
198,2005-06,20,2006-01-18,20:30,ANDREA ROMEO,CAGLIARI,ROBUR SIENA,home,1,0,...,Cristian Molinaro,Erjon Bogdani,Enrico Chiesa,Rej Volpato,Paolo Negro,Nicola Legrottaglie,Marco Fortin,Francesco Colonnese,Roberto Nanni,


### Rest days features
Rest days are very important for recovery.

In [1408]:
def count_days_between_dates(date1, date2) -> int:
    return (date1 - date2).dt.days

In [1409]:
# for i in range(5):
#     for home_or_away in HomeOrAway:
#         if i == 0:
#             df[f'{home_or_away.name}_team_rest_days'] = count_days_between_dates(df['date'], df[f'{home_or_away.name}_team_history_{i+1}_date'])
#         else:
#             df[f'{home_or_away.name}_team_history_{i}_rest_days'] = count_days_between_dates(df[f'{home_or_away.name}_team_history_{i}_date'], df[f'{home_or_away.name}_team_history_{i+1}_date'])

# todo: cannot count rest days for historical 5th games because we still miss the data about the 6th historical match

In [1410]:
# delete columns referring to the historical 6th matches
# df = df.loc[:, ~df.columns.str.contains('history_6')]

#### Datetime features
Add **year**, **month** and **day** features for all **date** value

In [1411]:
def get_exploded_datetime_values(df: pd.DataFrame) -> dict:
    data = {'year': [], 'month': [], 'day': [], 'hour': []}
    df['time'] = pd.to_datetime(df['time'], format="%H:%M")
    data['year'] += df['date'].map(lambda val: val.year).tolist()
    data['month'] += df['date'].map(lambda val: val.month).tolist()
    data['day'] += df['date'].map(lambda val: val.day).tolist()
    data['hour'] += df['time'].map(lambda val: val.hour).tolist()
    return data


def insert_exploded_datetime_values(df, exploded):
    df.insert(loc=df.columns.get_loc('time'), column='year', value=exploded['year'])
    df.insert(loc=df.columns.get_loc('time'), column='month', value=exploded['month'])
    df.insert(loc=df.columns.get_loc('time'), column='day', value=exploded['day'])
    df.insert(loc=df.columns.get_loc('time'), column='hour', value=exploded['hour'])
    return df


def explode_datetime_values(df: pd.DataFrame) -> pd.DataFrame:
    exploded = get_exploded_datetime_values(df)
    return insert_exploded_datetime_values(df, exploded)


def get_column_names_containing_str(df: pd.DataFrame, substring: str) -> list[str]:
    return df.loc[:,df.columns.str.contains(substring)].columns.values.tolist()

In [1412]:
# explode datetime values
df = explode_datetime_values(df)
# drop date columns
date_cols = get_column_names_containing_str(df, 'date')
df.drop(date_cols, axis=1, inplace=True)
df.drop(['time'], axis=1, inplace=True)

The result of the pre-processing looks like this:

In [1413]:
df.head()

Unnamed: 0,season,round,year,month,day,hour,referee,home_team,away_team,result,...,away_player_9,away_player_10,away_player_11,away_substitute_1,away_substitute_2,away_substitute_3,away_substitute_4,away_substitute_5,away_substitute_6,away_substitute_7
0,2005-06,1,2005,8,27,20,GIANLUCA PAPARESTA,FIORENTINA,SAMPDORIA,home,...,Lamberto Zauli,Francesco Flachi,Emiliano Bonazzoli,Marco Pisano,Vitaliy Kutuzov,Marco Borriello,Luca Castellazzi,Marco Zamboni,Simone Pavan,Gionata Mingozzi
1,2005-06,1,2005,8,27,18,GIANLUCA ROCCHI,LIVORNO,LECCE,home,...,Alex Pinardi,Aleksei Eremenko,Graziano Pelle,Alfonso Camorani,Jaime Valdes,Giuseppe Cozzolino,Francesco Benussi,Marco Pecorari,Giuseppe Abruzzese,Davide Giorgino
2,2005-06,1,2005,8,28,15,MASSIMO DE,ASCOLI,MILAN,draw,...,Kaka,Andriy Shevchenko,Alberto Gilardino,Marek Jankulovski,Clarence Seedorf,Zeljko Kalac,Gennaro Gattuso,Manuel Rui Costa,Johann Vogel,Dario Simic
3,2005-06,1,2005,8,28,15,TIZIANO PIERI,PARMA,PALERMO,draw,...,Massimo Bonanni,Andrea Caracciolo,Stephen Makinwa,Nicola Santoni,Franco Brienza,Massimo Mutarelli,Giuseppe Biava,Michele Ferri,Mariano Gonzalez,Simone Pepe
4,2005-06,1,2005,8,28,15,PAOLO TAGLIAVENTO,INTER,TREVISO,home,...,Reginaldo,Luigi Beghetto,Pinga,Roberto Chiappara,Dino Fava,Jehad Muntasser,Adriano Zancope,Francesco Parravicini,Anderson,Alberto Giuliatto


In [1414]:
df.to_csv("processed.csv")

### Data encoding
We need to encode the data before feeding it to the network. Here we define encoding methods that returns pytorch Tensors.

#### Seasons and Rounds

In [1448]:
class SeasonRoundEncoder(object):
    """Encode the season and round columns of the given pandas DataFrame sample"""
    def __init__(self, season_dict_map: dict):
        self.mapping = season_dict_map

    def __call__(self, sample: pd.DataFrame) -> torch.tensor:
        season_encoding = torch.tensor([[el] for el in sample['season'].map(self.mapping).tolist()], dtype=torch.int32)
        round_encoding = torch.tensor([[el] for el in sample['round'].tolist()], dtype=torch.int32)
        return torch.cat([season_encoding, round_encoding], 1)

In [1449]:
season2index = {'20' + f'{i+5}'.zfill(2) + '-' + f'{i+6}'.zfill(2): i for i in range(16)}
season_round_encoder = SeasonRoundEncoder(season2index)

In [1450]:
# TEST seasons and rounds encoding
tensor = season_round_encoder(df.iloc[0:2])
seasons_rounds_expected_num_of_feats = 2
if tensor.shape[1] == seasons_rounds_expected_num_of_feats:
    print('SEASONS and ROUNDS encoding OK')
else:
    print(f'num of features: {tensor.shape[1]}')
    print(f'expected num of features: {seasons_rounds_expected_num_of_feats}')
    raise Exception('SEASONS and ROUNDS encoding NOT OK! :(')

SEASONS and ROUNDS encoding OK


#### Datetime values

In [1451]:
class DatetimeEncoder(object):
    """Encode the year, month, day and hour columns of the given pandas DataFrame sample"""
    def __init__(self):
        pass

    def __call__(self, sample: pd.DataFrame) -> torch.tensor:
        year_encoding = torch.tensor([[el] for el in sample['year'].tolist()], dtype=torch.int32)
        month_encoding = torch.tensor([[el] for el in sample['month'].tolist()], dtype=torch.int32)
        day_encoding = torch.tensor([[el] for el in sample['day'].tolist()], dtype=torch.int32)
        hour_encoding = torch.tensor([[el] for el in sample['hour'].tolist()], dtype=torch.int32)
        return torch.cat([year_encoding, month_encoding, day_encoding, hour_encoding], 1)

In [1452]:
datetime_encoder = DatetimeEncoder()

In [1455]:
# TEST datetime values encoding
tensor = datetime_encoder(df.iloc[0:2])
datetime_expected_num_of_feats = 4
if tensor.shape[1] == datetime_expected_num_of_feats:
    print('DATETIME encoding OK')
else:
    print(f'num of features: {tensor.shape[1]}')
    print(f'expected num of features: {datetime_expected_num_of_feats}')
    raise Exception('DATETIME encoding NOT OK! :(')

tensor([[2005,    8,   27,   20],
        [2005,    8,   27,   18]], dtype=torch.int32)
DATETIME encoding OK


#### Results
One-hot encoding

In [None]:
class ResultEncoder(object):
    """Encode the result column of the given pandas DataFrame sample"""
    def __init__(self, dict_map: dict):
        self.mapping = dict_map

    def __call__(self, sample: pd.DataFrame) -> torch.tensor:
        return torch.tensor(sample['result'].map(self.mapping))

In [None]:
result2onehot = {'home': [1, 0, 0], 'draw': [0, 1, 0], 'away': [0, 0, 1]}
result_encoder = ResultEncoder(result2onehot)

In [None]:
# TEST results encoding
tensor = result_encoder(df.iloc[[0]])
results_expected_num_of_feats = len(df['result'].unique())
if tensor.shape[1] == results_expected_num_of_feats:
    print('RESULT encoding OK')
else:
    print(f'num of features: {tensor.shape[1]}')
    print(f'expected num of features: {results_expected_num_of_feats}')
    raise Exception('RESULT encoding NOT OK! :(')

#### Referees
One-hot encoding

In [None]:
class RefereeEncoder(object):
    """Encode the referee column of the given pandas DataFrame sample"""
    def __init__(self, lb: LabelBinarizer):
        self.lb = lb

    def __call__(self, sample: pd.DataFrame) -> torch.tensor:
        return torch.tensor(self.lb.transform(sample['referee']))

In [None]:
lb = LabelBinarizer()
fitted_lb = lb.fit(df['referee'].tolist())
referee_encoder = RefereeEncoder(fitted_lb)

In [None]:
# TEST referees encoding
tensor = referee_encoder(df.iloc[[0]])
referees_expected_num_of_feats = len(df['referee'].unique())
if tensor.shape[1] == referees_expected_num_of_feats:
    print('REFEREE encoding OK')
else:
    print(f'num of features: {tensor.shape[1]}')
    print(f'expected num of features: {referees_expected_num_of_feats}')
    raise Exception('REFEREE encoding NOT OK! :(')

#### Teams
One-hot encoding

In [None]:
class TeamsEncoder(object):
    """Encode the home_team and away_team columns of the given pandas DataFrame sample"""
    def __init__(self, lb: LabelBinarizer):
        self.lb = lb

    def __call__(self, sample: pd.DataFrame) -> torch.tensor:
        home_encoding = torch.tensor(self.lb.transform(sample['home_team']))
        away_encoding = torch.tensor(self.lb.transform(sample['away_team']))
        return torch.cat([home_encoding, away_encoding], 1)

In [None]:
lb = LabelBinarizer()
# every team has played as home team at least once
fitted_lb = lb.fit(df['home_team'].tolist())
teams_encoder = TeamsEncoder(fitted_lb)

In [None]:
# TEST teams encoding
tensor = teams_encoder(df.iloc[[0]])
teams_expected_num_of_feats = len(df['home_team'].unique()) * 2
if tensor.shape[1] == teams_expected_num_of_feats:
    print('TEAMS encoding OK')
else:
    print(f'num of features: {tensor.shape[1]}')
    print(f'expected num of features: {teams_expected_num_of_feats}')
    raise Exception('TEAMS encoding NOT OK! :(')

#### Coaches
One-hot encoding

In [None]:
class CoachesEncoder(object):
    """Encode the home_team_coach and away_team_coach columns of the given pandas DataFrame sample"""
    def __init__(self, lb: LabelBinarizer):
        self.lb = lb

    def __call__(self, sample: pd.DataFrame) -> torch.tensor:
        home_coach_encoding = torch.tensor(self.lb.transform(sample['home_team_coach'].tolist()))
        away_coach_encoding = torch.tensor(self.lb.transform(sample['away_team_coach'].tolist()))
        return torch.cat([home_coach_encoding, away_coach_encoding], 1)

In [None]:
lb = LabelBinarizer()
# every team has played as home team at least once, so home_team_coach already contains all the coaches
fitted_lb = lb.fit(df['home_team_coach'].tolist())
coaches_encoder = CoachesEncoder(fitted_lb)

In [None]:
# TEST coaches encoding
tensor = coaches_encoder(df.iloc[[0]])
coaches_expected_num_of_feats = len(df['home_team_coach'].unique()) * 2
if tensor.shape[1] == coaches_expected_num_of_feats:
    print('COACH encoding OK')
else:
    print(f'num of features: {tensor.shape[1]}')
    print(f'expected num of features: {coaches_expected_num_of_feats}')
    raise Exception('COACH encoding NOT OK! :(')

#### Players
One-hot encoding. We treat all players equally, both those that are part of the lineup and the substitutes

In [None]:
class PlayersEncoder(object):
    """Encode the home and away team lineup and substitute players of the given pandas DataFrame sample"""
    def __init__(self, lb: LabelBinarizer):
        self.lb = lb

    def __call__(self, sample: pd.DataFrame) -> torch.tensor:
        result = []
        for i in range(1, 12):
            result += [torch.tensor(self.lb.transform(sample[f'home_player_{i}']))]
        for i in range(1, 8):
            result += [torch.tensor(self.lb.transform(sample[f'home_substitute_{i}']))]
        for i in range(1, 12):
            result += [torch.tensor(self.lb.transform(sample[f'away_player_{i}']))]
        for i in range(1, 8):
            result += [torch.tensor(self.lb.transform(sample[f'away_substitute_{i}']))]
        return torch.cat(result, 1)

In [None]:
def flatten_list(list_of_lists: list[list[str]]) -> list[str]:
    return [item for sublist in list_of_lists for item in sublist]


def encode_fit_players(source_df: pd.DataFrame) -> LabelBinarizer:
    lb = LabelBinarizer()
    player_cols = get_column_names_containing_str(source_df, 'home_player')
    player_cols += get_column_names_containing_str(source_df, 'home_substitute')
    all_players_unflattened = source_df.loc[:, player_cols].values.tolist()
    all_players_flattened = flatten_list(all_players_unflattened)
    lb.fit(all_players_flattened)
    return lb

In [None]:
lb = LabelBinarizer()
fitted_lb = encode_fit_players(df)
players_encoder = PlayersEncoder(fitted_lb)

In [None]:
# TEST players encoding
player_cols = get_column_names_containing_str(df, 'home_player')
player_cols += get_column_names_containing_str(df, 'home_substitute')
tensor = players_encoder(df.iloc[[0]])
all_unique_player_names = pd.concat([df[player_cols[i]] for i in range(len(player_cols))], axis=0).unique()
players_expected_num_of_feats = len(all_unique_player_names) * (11 + 7) * 2
if tensor.shape[1] == players_expected_num_of_feats:
    print('PLAYER encoding OK')
else:
    print(f'num of features: {tensor.shape[1]}')
    print(f'expected num of features: {players_expected_num_of_feats}')
    raise Exception('PLAYER encoding NOT OK! :(')

In [None]:
class Encode(object):
    """Encode the given pandas DataFrame sample and return a pytorch Tensor"""
    def __init__(self, season_round_enc: SeasonRoundEncoder, datetime_enc: DatetimeEncoder, result_enc: ResultEncoder,
                 referee_enc: RefereeEncoder, teams_enc: TeamsEncoder, coaches_enc: CoachesEncoder, players_enc: PlayersEncoder,
                 keep_scores: bool):
        self.season_round_encoder = season_round_enc
        self.datetime_encoder = datetime_enc
        self.result_encoder = result_enc
        self.referee_encoder = referee_enc
        self.teams_encoder = teams_enc
        self.coaches_encoder = coaches_enc
        self.players_encoder = players_enc
        self.keep_scores = keep_scores

    def __call__(self, sample: pd.DataFrame) -> torch.tensor:
        encoded = torch.cat([
            self.season_round_encoder(sample),
            self.datetime_encoder(sample),
            self.result_encoder(sample),
            self.referee_encoder(sample),
            self.teams_encoder(sample),
            self.coaches_encoder(sample),
            self.players_encoder(sample)
        ], 1)
        if self.keep_scores:
            home_score = torch.tensor([sample['home_team_score']], dtype=torch.int32)
            away_score = torch.tensor([sample['away_team_score']], dtype=torch.int32)
            return torch.cat([encoded, home_score, away_score], 1)
        return encoded

In [None]:
# print train tensor example
test_sample = df.iloc[[0]]
test_encoder = Encode(season_round_encoder, datetime_encoder, result_encoder, referee_encoder, teams_encoder, coaches_encoder, players_encoder, True)
test_encoded_sample = test_encoder(test_sample)
total_num_of_features = seasons_rounds_expected_num_of_feats \
                        + datetime_expected_num_of_feats \
                        + results_expected_num_of_feats \
                        + referees_expected_num_of_feats \
                        + teams_expected_num_of_feats \
                        + coaches_expected_num_of_feats \
                        + players_expected_num_of_feats \
                        + 2 # we kept the scores
if test_encoded_sample.shape[1] == total_num_of_features:
    print("encoding OK")
else:
    print(f'num of features: {test_encoded_sample.shape[1]}')
    print(f'expected num of features: {total_num_of_features}')
    raise Exception("encoding NOT OK")



In [None]:
print(f'Total number of encoded features: {test_encoded_sample.size(1)}')

In [None]:
del df
del lb

### Data normalization

In [None]:
# todo

### Dataset construction

We need to define a torch Dataset and torch Dataloader that will be used during training.

In [None]:
class SerieAFootballMatchesDataset(Dataset):
    def __init__(self, csv_file, transform=None, target_transform=None):
        self.dataframe = pd.read_csv(csv_file)
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self) -> int:
        return len(self.dataframe)

    def __getitem__(self, idx):
        sample = self.dataframe.iloc[[idx]]
        last_5_games_home = get_last_n_matches_played_by_team_before_round(
            self.dataframe, sample['home_team'], sample['round'], 5)
        last_5_games_away = get_last_n_matches_played_by_team_before_round(
            self.dataframe, sample['away_team'], sample['round'], 5
        )
        if self.transform:
            sample = self.transform(sample)
        if self.target_transform:
            label = self.target_transform(sample)
        return sample

In [None]:
encoder = Encode()
dataset = SerieAFootballMatchesDataset(csv_file='processed.csv', transform=Encode())

## Training

### RNN

In [None]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.linear = nn.Linear(input_size + hidden_size, hidden_size)
        self.tanh = nn.Tanh()

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        pre_hidden = self.linear(combined)
        hidden = self.tanh(pre_hidden)
        return hidden

    def init_hidden(self):
        return torch.zeros(1, self.hidden_size)

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, input_size):
        super(NeuralNetwork, self).__init__()
        self.input_size = input_size
        self.flatten = nn.Flatten()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 3),
            nn.Softmax(dim=1)
        )

    def forward(self, x):
        x = self.flatten(x)
        output = self.layers(x)
        return output

In [None]:
def get_encoded_home_team_historical_performance(df: pd.DataFrame, team: str, round: int) -> torch.tensor:
    historical_performance_df = get_last_n_matches_played_by_team_before_round(df, team, round, 5)
    return torch.cat([
        encode_seasons_rounds(historical_performance_df),
        encode_datetime_values(historical_performance_df),
        encode_results(historical_performance_df),
        encode_referees(df, historical_performance_df),
        encode_teams(df, historical_performance_df),
        encode_coaches(df, historical_performance_df),
        encode_players(df, historical_performance_df)
    ], 1)

In [None]:
def train_epoch(x_historical_home: torch.tensor, x_historical_away: torch.tensor, x_current: torch.tensor, y: torch.tensor,
                rnn_home: RNN, rnn_away: RNN, nn: NeuralNetwork,
                rnn_home_optimizer: optim.Optimizer, rnn_away_optimizer: optim.Optimizer, nn_optimizer: optim.Optimizer,
                loss_fn):
    # init
    rnn_home_optimizer.zero_grad()
    nn_optimizer.zero_grad()
    # rnn_home forward
    rnn_home_hidden = rnn_home.init_hidden()
    for history_index in range(x_historical_home.size(0)):
        rnn_home_hidden = rnn_home(x_historical_home[history_index], rnn_home_hidden)
    # rnn_away forward
    rnn_away_hidden = rnn_away.init_hidden()
    for history_index in range(x_historical_away.size(0)):
        rnn_away_hidden = rnn_away(x_historical_away[history_index], rnn_away_hidden)
    # mlp forward
    x_train = torch.cat((x_current, rnn_home_hidden, rnn_away_hidden), 1)
    y_hat = nn(x_train)
    # backward
    loss = loss_fn(y, y_hat)
    loss.backward()
    rnn_home_optimizer.step()
    rnn_away_optimizer.step()
    nn_optimizer.step()
    return loss.item() / y.size(0)

In [None]:
hidden_size = 128
rnn_home = RNN(input_size=total_num_of_features, hidden_size=hidden_size)
rnn_away = RNN(input_size=total_num_of_features, hidden_size=hidden_size)
mlp = NeuralNetwork(hidden_size * 2 + total_num_of_features)

In [None]:
learning_rate = 0.01
rnn_home_optimizer = optim.SGD(rnn_home.parameters(), lr=learning_rate)
rnn_away_optimizer = optim.SGD(rnn_away.parameters(), lr=learning_rate)
mlp_optimizer = optim.SGD(mlp.parameters(), lr=learning_rate)
loss_fn = nn.NLLLoss()

In [None]:
def train_model(rnn_home: RNN, rnn_away: RNN, nn: NeuralNetwork,
                rnn_home_optimizer: optim.Optimizer, rnn_away_optimizer: optim.Optimizer, nn_optimizer: optim.Optimizer,
                loss_fn,
                n_iters, print_every=1000, plot_every=100):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    training_pairs = [tensorsFromPair(random.choice(pairs)) for i in range(n_iters)]
    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]
        loss = train_epoch(input_tensor, target_tensor, test_encoder,
                           decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss
        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))
        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0
    showPlot(plot_losses)

# Missing data
- We don't have data about new players that come to play in _Serie A_ during the course of the seasons. The model has to learn from zero context how important their contribution is for the outcome of the matches. If we were to considered multiple leagues, we could keep track of player transfers and maintain the history.
- We don't have data about cup matches played during the course of the seasons, like _Champions League_, _Europa League_ and _Coppa Italia_. Since they are very prestigious competitions and matches are usually very competitive, teams put a lot of effort in them and therefore can then perform worse in the championship.
- We don't have any type of player performance metric like who scored a goal, who was the assist man, red or yellow cards, goalkeeper's saves etc. so the model could face some difficulties in learning which player is important for the team.