# Euro 2020 (2021) Predictions

<!-- Written report for this analysis can be found [here](../reports/boro_01_market_value.md) -->

## 1. Business Understanding

* Determine Busines Objectives
* Situation Assessment
* Determine Data Mining Goal
* Produce Project Plan

```
# 1. Predict results of every match at Euro 2020
# 2. Make predictions before each round of competition
# 3. Ideally, at each round, use the predictions to simulate remainder of competition
# 4. Check against other predictions and actual results
# 5. Write up process (report/blog)
```

## 2. Data Understanding

* Collect Initial Data
* Describe Data
* Explore Data
* Verify Data Quality

### EURO 2020 fixtures/results
* https://en.wikipedia.org/wiki/UEFA_Euro_2020
* https://www.whoscored.com/Regions/247/Tournaments/124/Seasons/7329/Stages/16297/Show/International-European-Championship-2020
* https://www.uefa.com/uefaeuro-2020/fixtures-results/#/md/33673
* https://fbref.com/en/comps/676/schedule/UEFA-Euro-Scores-and-Fixtures

### Historic results
* https://www.staff.city.ac.uk/r.j.gerrard/football/aifrform.html (1871-2001)
* https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017/data (1872-)
* https://fbref.com/en/comps/676/history/European-Championship-Seasons (2000-)
* https://en.wikipedia.org/wiki/UEFA_Euro_2020_qualifying (qualifying)
* https://fbref.com/en/comps/678/Euro-Qualifying-Stats (qualifying)

### ELO ratings
* https://en.m.wikipedia.org/wiki/World_Football_Elo_Ratings
* https://www.eloratings.net/2021_European_Championship
* http://eloratings.net/2016_European_Championship_start
* https://www.eloratings.net/about

### Historic trends
* https://blog.annabet.com/soccer-goal-probabilities-poisson-vs-actual-distribution/
* https://en.wikipedia.org/wiki/Poisson_distribution

### GDP
* https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)
* https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)
* https://www.rug.nl/ggdc/productivity/pwt/

In [1]:
import pandas as pd
import os

import src.utilities as utilities

In [2]:
match = utilities.get_master("nations_matches")
match.info()

2021-05-10 21:18:42,802 - INFO - Building master filepath for nations_matches
2021-05-10 21:18:42,806 - INFO - Fetching C:\Users\adeacon\Documents\GitHub\the-ball-is-round\data\processed\ftb_nations_matches.txt
2021-05-10 21:18:42,807 - INFO - Building master filepath for nations_matches


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     211 non-null    int64  
 1   Round          211 non-null    object 
 2   Wk             168 non-null    float64
 3   Day            211 non-null    object 
 4   Date           211 non-null    object 
 5   Time           211 non-null    object 
 6   Team_1         211 non-null    object 
 7   Score          175 non-null    object 
 8   Team_2         211 non-null    object 
 9   Attendance     175 non-null    float64
 10  Venue          211 non-null    object 
 11  Referee        175 non-null    object 
 12  Match Report   211 non-null    object 
 13  Notes          16 non-null     object 
 14  Year           211 non-null    int64  
 15  Team_abbrev_1  211 non-null    object 
 16  Team_abbrev_2  211 non-null    object 
 17  Goals_1        175 non-null    float64
 18  Goals_2   

In [3]:
## checks on gaps in venue matching

# match.groupby("Filename").Venue_city.count() / match.groupby("Filename").Venue.count()

# match[pd.isnull(match.Venue_city)].Venue.unique()

# country_mask = ((~match.Venue_country.isin(match.Team_1.values)) & (~match.Venue_country.isin(match.Team_1.values)))
# match.loc[country_mask, "Venue_country"].unique()

match.fillna("").groupby(["Year", "Venue", "Venue_city", "Venue_country"]).Date.count()

Year  Venue                               Venue_city         Venue_country
2000  GelreDome                           Arnhem             Netherlands      3
      Jan Breydelstadion                  Bruges             Belgium          4
      Johan Cruyff ArenA                  Amsterdam          Netherlands      5
      Philips Stadion                     Eindhoven          Netherlands      3
      Stade Maurice Dufrasne              Liège              Belgium          3
      Stade Roi Baudouin                  Brussels           Belgium          5
      Stade du Pays de Charleroi          Charleroi          Belgium          3
      Stadion Feijenoord                  Rotterdam          Netherlands      5
2004  EstÃ¡dio Do Algarve                                                     3
      EstÃ¡dio Do DragÃ£o                                                     5
      EstÃ¡dio Dom Afonso Henriques                                           2
      EstÃ¡dio Dr. MagalhÃ£es Pessoa         

In [5]:
match.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Unnamed: 0,211,,,,105.0,61.0546,0.0,52.5,105.0,157.5,210.0
Round,211,5.0,Group stage,168.0,,,,,,,
Wk,168,,,,2.0,0.818938,1.0,1.0,2.0,3.0,3.0
Day,211,7.0,Sun,41.0,,,,,,,
Date,211,110.0,2016-06-22,4.0,,,,,,,
Time,211,16.0,20:45 (19:45),52.0,,,,,,,
Team_1,211,35.0,Portugal,16.0,,,,,,,
Score,175,34.0,0â1,23.0,,,,,,,
Team_2,211,35.0,France,14.0,,,,,,,
Attendance,175,,,,41575.8,13137.2,16002.0,30678.5,39493.0,50000.0,76833.0


In [4]:
summary = utilities.get_master("nations_summaries")
summary.info()

2021-05-10 21:21:17,675 - INFO - Building master filepath for nations_summaries
2021-05-10 21:21:17,676 - INFO - Fetching C:\Users\adeacon\Documents\GitHub\the-ball-is-round\data\processed\ftb_nations_summaries.txt
2021-05-10 21:21:17,677 - INFO - Building master filepath for nations_summaries


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            112 non-null    int64  
 1   Rank Local            112 non-null    int64  
 2   Rank Global           112 non-null    int64  
 3   Team                  112 non-null    object 
 4   Rating                112 non-null    int64  
 5   Average Rank          112 non-null    int64  
 6   Average Rating        112 non-null    int64  
 7   1 Year Change Rank    112 non-null    int64  
 8   1 Year Change Rating  112 non-null    int64  
 9   Matches Total         112 non-null    int64  
 10  Matches Home          112 non-null    int64  
 11  Matches Away          112 non-null    int64  
 12  Matches Neutral       112 non-null    int64  
 13  Matches Wins          112 non-null    int64  
 14  Matches Losses        112 non-null    int64  
 15  Matches Draws         1

In [7]:
summary.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Unnamed: 0,112,,,,55.5,32.4756,0.0,27.75,55.5,83.25,111.0
Rank Local,112,,,,10.1964,6.06398,1.0,5.0,10.0,14.25,24.0
Rank Global,112,,,,18.6964,14.7769,1.0,8.0,15.0,26.0,74.0
Team,112,35.0,Sweden,6.0,,,,,,,
Rating,112,,,,1856.0,122.945,1524.0,1771.25,1853.0,1948.25,2127.0
Average Rank,112,,,,22.5268,16.0921,4.0,11.0,19.0,27.25,83.0
Average Rating,112,,,,1769.49,128.753,1390.0,1704.75,1785.5,1875.25,1985.0
1 Year Change Rank,112,,,,1.41071,5.92073,-15.0,-2.0,1.0,4.0,23.0
1 Year Change Rating,112,,,,7.26786,42.6667,-92.0,-24.25,7.5,35.25,127.0
Matches Total,112,,,,638.009,214.417,63.0,537.25,659.5,787.0,1073.0


## 3. Data Preperation

* Select Data
* Clean Data
* Construct Data
* Integrate Data
* Format Data

## 4. Modelling

* Select Modelling Technique
* Generate Test Design
* Build Model
* Assess Model

### Updated WC model
* https://github.com/deacona/the-ball-is-round/blob/master/reports/intl_01_world_cup_2018.md
* https://github.com/deacona/the-ball-is-round/blob/master/notebooks/intl_01_world_cup_2018.ipynb

### "Soccernomics"
* goal diff = (0.6666 * home adv) + (0.5 * relative experience) + (0.1 * relative population) + (0.1 * relative gdp/head) + ...
* e.g. England vs Germany at Euro 96
    * Home = England = 1
    * Exp = 84k v 84k = 0
    * Pop = 57 v 81 = -0.4
    * GDP/h = 1627492 / 57 v 2633828 / 81 = -0.1
    * GD = (0.6666 * 1) + (0.5 * 0) + (0.1 * -0.4) + (0.1 * -0.1) = 0.6
* http://www.soccernomics-agency.com/wordpress/wp-content/uploads/2017/10/soccer-convergence-1.pdf

### Dixon-Coles (and other probability models)
* https://dashee87.github.io/football/python/predicting-football-results-with-statistical-modelling-dixon-coles-and-time-weighting/
* http://www.statsandsnakeoil.com/2018/06/05/modelling-the-world-cup-with-regista/
* http://opisthokonta.net/?cat=48

## 5. Evaluation

* Evaluate Results
* Review Process
* Determine Next Steps

```
# % correct score, goal diff, result, points
# vs historic trends (goals, W/D/L)
```

## 6. Deployment

* Plan Deployment
* Plan Monitoring and Maintenance
* Produce Final Report
* Review Project