Computational notebooks for this analysis can be found here and here
The challenge
Inspired by my performance in an office World Cup predictor, I decided to take that model and, hopefully, improve on it for Euro 2020.
I wasn't part of a similar workplace competition this time, so decided to enter UEFA's online prediction competition.
The data
I wanted to make this a supervised learning model. To this end I looked at past (and present) competitions and metrics for the countries involed.
I gathered fixtures/results from FBRef, stadium info from Wikipedia, Elo data from eloratings.net, and population and GDP data from the Penn tables (as now maintained by University of Groningen). The past tournaments and Elo ratings went back to 2000. Penn tables data was taken as per the end of the previous calendar year (e.g. 1999 figures for matches played in 2000).
The method
I focussed on what I felt were a handful of key indicators from previous work: Elo ratings and Home advantage (as used in WC 2018 predictions), and Experience, Population and GDP per capita (as used in Soccernomics by Kuper & Szymanski).
A random 20% of the past tournament matches were held for testing. This gave us 140 training samples and 35 test samples.
I opted for 2 target variables: Goal difference and Goal total. Goal diff is a metric widely used for predicting results but also capturing Goal total allows us (in theory) to simply convert into predicted match scores for both teams
Both targets were then fitted using a selection of 10 regression algorithms.
- Dummy (mean) - always predicts the mean of the training set
- Dummy (median) - always predicts the median of the training set
- Linear Regression
- Lasso
- Ridge
- Random Forest
- Gradient Boost
- Support Vector Machine (linear kernel)
- Support Vector Machine (rbf kernel)
- Custom Elo Regressor - approximates my World Cup 2018 model
(All but the EloRegressor had a standardised scaling applied to avoid any affects of differently scaled features)
From this, I selected the Elo model for Goal diff and Lasso for Goal total.
The results
The full dataset was assigned predictions, which could then be compared with actual results as they came in.
As part of this, "prediction points" were calculated based on the same citeria used in the World Cup comp. To recap, the original scoring system was 3 points for correct score, 2 points for correct goal difference, 1 point for correct result per game.
The predictions for Euro 2020 group matches were then entered into UEFA's Tournament and Match predictors.
|
|
|
|||||||||||||||
|
|
|
|||||||||||||||
|
From this I could extrapolate the knockout results as follows...
After the Group stage the knockout predictions were updated to the following...
Finally, here is a summary of all the models' predictions vs actual results...
Matches played | Points per game | % correct result | % correct goal diff | % correct score | Goals per game (predicted) | Goals per game (actual) | % games won (predicted) | % games won (actual) | |
---|---|---|---|---|---|---|---|---|---|
2000 | 31 | 0.71 | 35% | 19% | 16% | 2.48 | 2.84 | 52% | 87% |
2004 | 31 | 0.74 | 42% | 23% | 10% | 2.61 | 2.74 | 74% | 74% |
2008 | 31 | 0.74 | 35% | 26% | 13% | 2.45 | 2.61 | 45% | 87% |
2012 | 31 | 0.84 | 42% | 26% | 16% | 2.45 | 2.58 | 45% | 84% |
2016 | 51 | 0.86 | 45% | 27% | 14% | 2.69 | 2.31 | 75% | 78% |
2021 | 51 | 1.04 | 55% | 31% | 18% | 2.57 | 2.78 | 67% | 76% |
Training | 140 | 0.81 | 42% | 24% | 14% | 2.55 | 2.64 | 61% | 81% |
Testing | 35 | 0.71 | 34% | 26% | 11% | 2.57 | 2.37 | 56% | 86% |
Live | 51 | 1.04 | 55% | 31% | 18% | 2.57 | 2.78 | 67% | 76% |
Overall | 226 | 0.85 | 44% | 26% | 15% | 2.56 | 2.63 | 62% | 81% |
I was really pleased with how the model performed. In the Uefa match predictor I placed in the top 20% of all competitors with 145 pts (vs 255 for the winner). As I didn't make any first score predictions or use the 2x boosters avaialable, I felt this was pretty reasonable. In the Uefa tournament predictor, I placed in the top 32% with 49ts, and in the top 9% with 24pts just for the knockout predictions. As with my World Cup model it under-predicted the number of goals and wins. But within a much more robust and test-able framework there's greater scope to refine this before the next tournament!