# Football match results prediction

The aim of this project is to develop and compare simple Deep Learning models with the ultimate goal of trying to predict the outcome of *Seria A* football matches, i.e. **home win**, **away win** or **draw**.
In total, three models were designed:
- A **baseline**
- A **hybrid** model
- A **RNN**

Raw data was processed differently based on the target model. Moreover, additional experiments were made trying to reduce the number of features and thus the complexity of the models, in an effort to improve the performances. In the end, five trained models were compared.

The following sections will first describe the raw data, then the models and finally the results.

## The raw data
The raw dataset was gathered from the official [Serie A archive](https://www.legaseriea.it/it/serie-a/archivio), where seasons ranging from 2005-06 to 2021-22 were considered. In total, 6460 matches were collected, each one described by the following features.

In [1]:
import pandas as pd

df = pd.read_csv('raw.csv')
df.shape

(6460, 95)

In [2]:
df.columns

Index(['season', 'round', 'date', 'time', 'referee', 'home_team', 'away_team',
       'home_score', 'away_score', 'home_gk_saves', 'away_gk_saves',
       'home_penalties', 'away_penalties', 'home_shots', 'away_shots',
       'home_shots_on_target', 'away_shots_on_target', 'home_shots_off_target',
       'away_shots_off_target', 'home_shots_on_target_from_penalty_area',
       'away_shots_on_target_from_penalty_area', 'home_fouls', 'away_fouls',
       'home_woodwork_hits', 'away_woodwork_hits', 'home_goal_chances',
       'away_goal_chances', 'home_assists', 'away_assists', 'home_offsides',
       'away_offsides', 'home_corner_kicks', 'away_corner_kicks',
       'home_yel_cards', 'away_yel_cards', 'home_red_cards', 'away_red_cards',
       'home_crosses', 'away_crosses', 'home_long_throws', 'away_long_throws',
       'home_attacks_from_center', 'away_attacks_from_center',
       'home_attacks_from_right', 'away_attacks_from_right',
       'home_attacks_from_left', 'away_attacks_from_l

In [3]:
df.head()

Unnamed: 0,season,round,date,time,referee,home_team,away_team,home_score,away_score,home_gk_saves,...,away_substitute3,away_substitute4,away_substitute5,away_substitute6,away_substitute7,away_substitute8,away_substitute9,away_substitute10,away_substitute11,away_substitute12
0,2005-06,1,28/08/2005,20:30,MATTEO SIMONE,JUVENTUS,CHIEVOVERONA,1,0,0,...,Amauri,John Mensah,Filippo Antonelli,Victor Obinna,Giovanni Marchese,-,-,-,-,-
1,2005-06,1,28/08/2005,15:00,ROBERTO ROSETTI,REGGINA,ROMA,0,3,0,...,Shabani Nonda,Pietro Pipolo,Cesare Bovo,Houssine Kharja,Antonio Cassano,-,-,-,-,-
2,2005-06,1,28/08/2005,15:00,ANDREA DE,UDINESE,EMPOLI,1,0,4,...,Daniele Balli,Davide Moro,Paolo Zanetti,Andrea Raggi,Francesco Pratali,-,-,-,-,-
3,2005-06,1,28/08/2005,15:00,PAOLO DONDARINI,LAZIO,MESSINA,1,0,7,...,Ivica Iliev,Marco Storari,Filippo Cristante,Luca Fusco,Atsushi Yanagisawa,-,-,-,-,-
4,2005-06,1,28/08/2005,15:00,PAOLO TAGLIAVENTO,INTER,TREVISO,3,0,3,...,Jehad Muntasser,Adriano Zancope,Francesco Parravicini,Anderson,Alberto Giuliatto,-,-,-,-,-


From the dataframe head we can see that, for each match, we have data about:
- The report (season, round, date, time, referee, teams and scores)
- The statistics (i.e. penalties, shots, shots on target, shots off target, fouls etc.)
- The lineups (coaches, players and substitutes)

Note that we could distinguish between:
- Pre-match data (season, round, date, time, referee, teams and lineups)
- Post-match data (scores and statistics)

As mentioned above, the raw data was processed differently based on the target model, so let's inspect them.

## The models
### The baseline model
The baseline model is a simple Multi-Layer Perceptron that was built with the idea of predicting the outcome of a game just from its pre-match data. Therefore, scores and statistics features were not included in the final dataset as they are unknown at pre-match time.

<br>
<div style="text-align: center;">
    <img
        style="display: block;
               margin-left: auto;
               margin-right: auto;
               width: 50%;"
        src="baseline/images/baseline-architecture.png"
        alt="Baseline Architecture">
    </img>
</div>
<br>

#### Processing of the raw dataset
##### Phase 1: data fixing
Several matches with some type of issues or inconsistent data were detected and fixed in the raw dataset.

*Example #1*: in several matches played by Lazio during the season 2007-08, the goalkeeper Marco Ballotta is missing in
    the lineup, resulting in a data shift and `NULL` values in the column `away_substitute_12`.

*Example #2*: in round 37, season 2005-06, MESSINA-EMPOLI was suspended at 89’ with score 1-2.
    Then winner of the game was decided to be EMPOLI with a ‘by forfeit’ victory, i.e. 0-3 for EMPOLI.
    Since the game was about to end when it was suspended, the on-pitch score was kept in the dataset.

##### Phase 2: data manipulation
This phase includes few important steps:
- Substitution of `date` and `time` features with `year`, `month`, `day` and `hour`;
- Creation of the target column named `result` based on `home_score` and `away_score`. This column contains the categorical values `home`, `away` or `draw` the model will try to predict;
- Dropping of `home_score`, `away_score` and all the statistics features;
- Type casting

##### Phase 3: data encoding
This phase includes the following steps:
 - Label encoding of `season` and `year` features
 - One-hot encoding of all categorical features

In [None]:
baseline_df = pd.read_csv('baseline/data_baseline.csv')
print(f'Final dataset shape: {baseline_df.shape}')

### The hybrid model
The hybrid model consists of a RNN chained with a MLP. The goal is still to predict the outcome of a game starting from its pre-match data. The idea behind this architecture is based on the assumption that the outcome of a match between two teams depends significantly on their current form. The form of a team can be viewed as its recent sequence of results versus other teams. So if the same model that was designed as a baseline (i.e. the MLP) is given an extra context about the form of the two teams facing each other, we could hopefully improve the baseline. The task of encoding the form of both teams is taken care by the RNN.

More in detail, the network is fed with the data of the match whose outcome we want to predict, plus the sequence of the last 5 games played by both teams. From the two historical sequences, the 2-layers RNN should encode the form of both teams. The result is then combined with the pre-match data and fed to the MLP which outputs a result.

Note that, since a historical sequence is made up of already finished matches, the data that is fed to the RNN includes all the available features, while the MLP just looks at the pre-match data and the form of both teams.

<br>
<div style="text-align: center;">
    <img
        style="display: block;
               margin-left: auto;
               margin-right: auto;
               width: 50%;"
        src="hybrid/images/hybrid-architecture.png"
        alt="Hybrid Architecture">
    </img>
</div>
<br>

#### Processing of the raw dataset
##### Phase 1: data fixing
See data fixing from baseline model.

##### Phase 2: data manipulation
This phase includes the following steps:
- Substitution of `date` and `time` features with `year`, `month`, `day` and `hour`;
- Creation of the target column `result`;
- Creation of historic features including the data of the last 5 matches for both the home and away team;
- Conversion from wide to long format resulting in sequences of 6 matches, 5 of which are historic;
- Type casting;

This final dataset in long format will be converted to a nested array prior to training, so that, given an observation, we not only have the data of the match whose result will be predicted but also its historic sequence.

##### Phase 3: data encoding
See data encoding from baseline model.

In [None]:
hybrid_df = pd.read_csv('hybrid/data_hybrid.csv')
print(f'Final dataset shape: {hybrid_df.shape}')

### The RNN model
For the last model, a simpler approach was taken compared to the hybrid architecture. The MLP part of the network was completely removed, and the RNN was updated so that it outputs directly an outcome prediction for the current match given just the historical sequence of the last 5 games played by both teams (i.e. no pre-match data is included).


<br>
<div style="text-align: center;">
    <img
        style="display: block;
               margin-left: auto;
               margin-right: auto;
               width: 50%;"
        src="rnn/images/rnn-architecture.png"
        alt="RNN Architecture">
    </img>
</div>
<br>

#### Processing of the raw dataset
##### Phase 1: data fixing
See data fixing from baseline model.

##### Phase 2: data manipulation
In this phase, the same steps of the hybrid model data manipulation were taken, except for just one difference: prior to the wide-to-long conversion, the non-historical features were discarded, hence sequences of 5 matches were generated at the end.

##### Phase 3: data encoding
See data encoding from baseline model.

In [None]:
rnn_df = pd.read_csv('rnn/data_rnn.csv')
print(f'Final dataset shape: {rnn_df.shape}')

## Training

The 3 models were trained with the following configuration and hyperparameters:
- Typical 80/20/20 ratio to split the dataset into train/validation/test;
- SGD as optimization method;
- Early stopping with a maximum of 200 epochs to avoid overfitting;
- A learning rate of 0.001 and a batch size of 32;

Moreover, as mentioned at the beginning, experiments with feature selection were made in an effort to improve the performances of the hybrid and rnn models. Note that the high dimensionality of the dataset is mostly due to the one-hot encoding of players, coaches, referees and teams. Therefore, two additional models were trained without those features. This was done based on the assumption that knowing for example the name of the two teams facing each other does not make a greater contribution to the prediction of the outcome if the network already knows that the home team is in much better form with respect to the away team. On the contrary, some players are game changers and knowing they are going to participate in a match is important. Also, when a team is assigned a new coach, it usually does not perform well in the following matches. Despite this, the aforementioned features were still discarded to see if they mostly bring noise.

The following table sums up all the trained models:

| Model    | Training features                       | # of features |
|----------|-----------------------------------------|---------------|
| Baseline | all                                     | 8440          |
| Hybrid A | all                                     | 9009          |
| RNN A    | all                                     | 8826          |
| Hybrid B | no players, coaches, referees and teams | 95            |
| RNN B    | no players, coaches, referees and teams | 95            |

## Results

Ten training runs were performed with each model. The below graphs show the results. Different colors were used to discriminate different training runs.

### Baseline

Hyperparameters:
- Learning rate: 0.001
- Batch size: 32
- Early stopping with 200 epochs limit

<img src="baseline/images/results/results.png">

### Hybrid model A

Hyperparameters:
- Learning rate: 0.001
- Batch size: 32
- Early stopping with 200 epochs limit

<br>
<img src="hybrid/images/results/results_A.png">

### Hybrid model B

Hyperparameters:
- Learning rate: 0.001
- Batch size: 32
- Early stopping with 200 epochs limit

<img src="hybrid/images/results/results_B.png">

### RNN model A

Hyperparameters:
- Learning rate: 0.001
- Batch size: 16
- Early stopping with 200 epochs limit

<br>
<img src="rnn/images/results/results_A.png">

### RNN model B

Hyperparameters:
- Learning rate: 0.001
- Batch size: 32
- Early stopping with 200 epochs limit

<br>
<img src="rnn/images/results/results_B.png">

### Comparison
The following graphs compare the performances of all five models. Different colors were used for different models.

#### Accuracy on validation set of all models
| Model    | Color                                 |
|----------|---------------------------------------|
| Baseline | <font color=#6d6d6d>Gray</font>       |
| Hybrid A | <font color=#ed0602>Red</font>        |
| RNN A    | <font color=#0fb503>Green</font>      |
| Hybrid B | <font color=#1ed9ff>Light Blue</font> |
| RNN B    | <font color=#ac1eff>Violet</font>     |
<img src="images/results/val_acc.png">

#### Loss on validation set of all models
| Model    | Color                                 |
|----------|---------------------------------------|
| Baseline | <font color=#6d6d6d>Gray</font>       |
| Hybrid A | <font color=#ed0602>Red</font>        |
| RNN A    | <font color=#0fb503>Green</font>      |
| Hybrid B | <font color=#1ed9ff>Light Blue</font> |
| RNN B    | <font color=#ac1eff>Violet</font>     |
<img src="images/results/val_loss.png">

#### Loss on training set of all models

| Model    | Color                                 |
|----------|---------------------------------------|
| Baseline | <font color=#6d6d6d>Gray</font>       |
| Hybrid A | <font color=#ed0602>Red</font>        |
| RNN A    | <font color=#0fb503>Green</font>      |
| Hybrid B | <font color=#1ed9ff>Light Blue</font> |
| RNN B    | <font color=#ac1eff>Violet</font>     |

<br>
<img src="images/results/train_loss.png">

#### Accuracy and loss on test set of all models

| Model    | Color                                 |
|----------|---------------------------------------|
| Baseline | <font color=#6d6d6d>Gray</font>       |
| Hybrid A | <font color=#ed0602>Red</font>        |
| RNN A    | <font color=#0fb503>Green</font>      |
| Hybrid B | <font color=#1ed9ff>Light Blue</font> |
| RNN B    | <font color=#ac1eff>Violet</font>     |

<br>
<img src="images/results/test_acc_loss.png">
<br>
<br>

| Model    | Avg. test accuracy | Avg. test loss (C.E.) |
|----------|--------------------|-----------------------|
| Baseline | 67.42 %            | 0.8900                |
| Hybrid A | 71.12 %            | 0.8594                |
| RNN A    | 56.14 %            | 0.9745                |
| Hybrid B | 59.36 %            | 0.9675                |
| RNN B    | 53.35 %            | 1.0034                |

## Conclusion

...

The models have large room for improvements. These are some suggestions:
- Deep embedding could be used when working with players, coaches, teams and referees. This would let the network learn a more compact representation for those entities, hopefully with a proper similarity measure
- Additional RNN layers could be introduced also in the HYBRID models in order to
- LSTM or GNU could be used instead of simple RNN units
- More data, in particular: data from Champions League or other important competitions with mid-week matches, etc.
- More features, in particular: more data about the performance of each player, counting rest days between matches, etc.

### Baseline

Hyperparameters:
- Learning rate: 0.001
- Batch size: 32
- Early stopping with 200 epochs limit

<img src="baseline/images/results/results.png">

### Hybrid model A

Hyperparameters:
- Learning rate: 0.001
- Batch size: 32
- Early stopping with 200 epochs limit

<br>
<img src="hybrid/images/results/results_A.png">

### Hybrid model B

Hyperparameters:
- Learning rate: 0.001
- Batch size: 32
- Early stopping with 200 epochs limit

<img src="hybrid/images/results/results_B.png">

### RNN model A

Hyperparameters:
- Learning rate: 0.001
- Batch size: 16
- Early stopping with 200 epochs limit

<br>
<img src="rnn/images/results/results_A.png">

### RNN model B

Hyperparameters:
- Learning rate: 0.001
- Batch size: 32
- Early stopping with 200 epochs limit

<br>
<img src="rnn/images/results/results_B.png">

### Comparison
The following graphs compare the performances of all five models. Different colors were used for different models.

#### Accuracy on validation set of all models
| Model    | Color                                 |
|----------|---------------------------------------|
| Baseline | <font color=#6d6d6d>Gray</font>       |
| Hybrid A | <font color=#ed0602>Red</font>        |
| RNN A    | <font color=#0fb503>Green</font>      |
| Hybrid B | <font color=#1ed9ff>Light Blue</font> |
| RNN B    | <font color=#ac1eff>Violet</font>     |
<img src="images/results/val_acc.png">

#### Loss on validation set of all models
| Model    | Color                                 |
|----------|---------------------------------------|
| Baseline | <font color=#6d6d6d>Gray</font>       |
| Hybrid A | <font color=#ed0602>Red</font>        |
| RNN A    | <font color=#0fb503>Green</font>      |
| Hybrid B | <font color=#1ed9ff>Light Blue</font> |
| RNN B    | <font color=#ac1eff>Violet</font>     |
<img src="images/results/val_loss.png">

#### Loss on training set of all models

| Model    | Color                                 |
|----------|---------------------------------------|
| Baseline | <font color=#6d6d6d>Gray</font>       |
| Hybrid A | <font color=#ed0602>Red</font>        |
| RNN A    | <font color=#0fb503>Green</font>      |
| Hybrid B | <font color=#1ed9ff>Light Blue</font> |
| RNN B    | <font color=#ac1eff>Violet</font>     |

<br>
<img src="images/results/train_loss.png">

#### Accuracy and loss on test set of all models

| Model    | Color                                 |
|----------|---------------------------------------|
| Baseline | <font color=#6d6d6d>Gray</font>       |
| Hybrid A | <font color=#ed0602>Red</font>        |
| RNN A    | <font color=#0fb503>Green</font>      |
| Hybrid B | <font color=#1ed9ff>Light Blue</font> |
| RNN B    | <font color=#ac1eff>Violet</font>     |

<br>
<img src="images/results/test_acc_loss.png">
<br>
<br>

| Model    | Avg. test accuracy | Avg. test loss (C.E.) |
|----------|--------------------|-----------------------|
| Baseline | 67.42 %            | 0.8900                |
| Hybrid A | 71.12 %            | 0.8594                |
| RNN A    | 56.14 %            | 0.9745                |
| Hybrid B | 59.36 %            | 0.9675                |
| RNN B    | 53.35 %            | 1.0034                |

## Conclusion

...

The models have large room for improvements. These are some suggestions:
- Deep embedding could be used when working with players, coaches, teams and referees. This would let the network learn a more compact representation for those entities, hopefully with a proper similarity measure
- Additional RNN layers could be introduced also in the HYBRID models in order to
- LSTM or GNU could be used instead of simple RNN units
- More data, in particular: data from Champions League or other important competitions with mid-week matches, etc.
- More features, in particular: more data about the performance of each player, counting rest days between matches, etc.