# Euro 2020 (2021) Predictions

<!-- Written report for this analysis can be found [here](../reports/boro_01_market_value.md) -->

## 1. Business Understanding

* Determine Busines Objectives
* Situation Assessment
* Determine Data Mining Goal
* Produce Project Plan

```
# 1. Predict results of every match at Euro 2020
# 2. Make predictions before each round of competition
# 3. Ideally, at each round, use the predictions to simulate remainder of competition
# 4. Check against other predictions and actual results
# 5. Write up process (report/blog)
```

## 2. Data Understanding

* Collect Initial Data
* Describe Data
* Explore Data
* Verify Data Quality

### EURO 2020 fixtures/results
* https://en.wikipedia.org/wiki/UEFA_Euro_2020
* https://www.whoscored.com/Regions/247/Tournaments/124/Seasons/7329/Stages/16297/Show/International-European-Championship-2020
* https://www.uefa.com/uefaeuro-2020/fixtures-results/#/md/33673
* https://fbref.com/en/comps/676/schedule/UEFA-Euro-Scores-and-Fixtures

### Historic results
* https://www.staff.city.ac.uk/r.j.gerrard/football/aifrform.html (1871-2001)
* https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017/data (1872-)
* https://fbref.com/en/comps/676/history/European-Championship-Seasons (2000-)
* https://en.wikipedia.org/wiki/UEFA_Euro_2020_qualifying (qualifying)
* https://fbref.com/en/comps/678/Euro-Qualifying-Stats (qualifying)

### ELO ratings
* https://en.m.wikipedia.org/wiki/World_Football_Elo_Ratings
* https://www.eloratings.net/2021_European_Championship / https://www.eloratings.net/about

### Historic trends
* https://blog.annabet.com/soccer-goal-probabilities-poisson-vs-actual-distribution/
* https://en.wikipedia.org/wiki/Poisson_distribution

### GDP
* https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)
* https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)
* https://www.rug.nl/ggdc/productivity/pwt/

In [1]:
import pandas as pd
import os

In [16]:
comp_list = []
for file in os.listdir("../data/raw/fbr/competition/"):
    if not file.startswith("Euro"):
        continue
    if not file.endswith(".csv"):
        continue
        
    df = pd.read_csv(os.path.join("../data/raw/fbr/competition/", file))
    df["Filename"] = file
    comp_list.append(df)

# len(comp_list)
comp = pd.concat(comp_list)
comp.dropna(subset=["Round"], inplace=True)
comp.reset_index(drop=True, inplace=True)
comp.columns = ['Round', 'Wk', 'Day', 'Date', 'Time', 'Team_1', 'Score', 'Team_2',
       'Attendance', 'Venue', 'Referee', 'Match Report', 'Notes', 'Filename']
comp["Year"] = comp.Date.str[:4]
comp["Team_abbrev_1"] = comp["Team_1"].str[-2:]
comp["Team_1"] = comp["Team_1"].str[:-3]
comp["Team_abbrev_2"] = comp["Team_2"].str[:2]
comp["Team_2"] = comp["Team_2"].str[3:]
comp["Goals_1"] = comp.Score.str.extract(pat="([0-9]{1,2})–[0-9]{1,2}")
comp["Goals_2"] = comp.Score.str.extract(pat="[0-9]{1,2}–([0-9]{1,2})")
for i in range(1, 3):
    comp["Goals_"+str(i)] = pd.to_numeric(comp["Goals_"+str(i)], errors='coerce')
comp["Goal_diff"] = comp.Goals_1 - comp.Goals_2
comp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Round          211 non-null    object 
 1   Wk             168 non-null    float64
 2   Day            211 non-null    object 
 3   Date           211 non-null    object 
 4   Time           211 non-null    object 
 5   Team_1         211 non-null    object 
 6   Score          175 non-null    object 
 7   Team_2         211 non-null    object 
 8   Attendance     175 non-null    float64
 9   Venue          211 non-null    object 
 10  Referee        175 non-null    object 
 11  Match Report   211 non-null    object 
 12  Notes          16 non-null     object 
 13  Filename       211 non-null    object 
 14  Year           211 non-null    object 
 15  Team_abbrev_1  211 non-null    object 
 16  Team_abbrev_2  211 non-null    object 
 17  Goals_1        175 non-null    float64
 18  Goals_2   

In [17]:
comp.loc[:, ["Date", "Year", "Team_1", "Team_2", "Goals_1", "Goals_2", "Goal_diff", "Venue"]].sample(10, random_state=42)

Unnamed: 0,Date,Year,Team_1,Team_2,Goals_1,Goals_2,Goal_diff,Venue
30,2000-07-02,2000,France,Italy,2.0,1.0,1.0,Stadion Feijenoord
173,2016-07-07,2016,Germany,France,0.0,2.0,-2.0,Orange Vélodrome
140,2016-06-16,2016,Ukraine,Northern Ireland,0.0,2.0,-2.0,Groupama Stadium
75,2008-06-13,2008,Netherlands,France,4.0,1.0,3.0,Stade de Suisse Wankdorf Bern
60,2004-07-01,2004,Greece,Czech Republic,1.0,0.0,1.0,Estádio Do Dragão
208,2021-06-23,2021,Slovakia,Spain,,,,Estadio La Cartuja de Sevilla
45,2004-06-19,2004,Latvia,Germany,0.0,0.0,0.0,Estádio do Bessa Século XXI
183,2021-06-14,2021,Poland,Slovakia,,,,Gazprom Arena
9,2000-06-15,2000,Sweden,Turkey,0.0,0.0,0.0,Philips Stadion
100,2012-06-11,2012,Ukraine,Sweden,2.0,1.0,1.0,NSK Olimpijs'kyj


In [18]:
venue = pd.read_csv(os.path.join("../data/raw/wkp/wkp_std/", "wkp_std_nat.csv"), encoding="latin9", sep=",")
venue.columns = ["Venue_country", "Venue_city", "Venue", "Venue_URL"]
venue.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Venue_country  64 non-null     object
 1   Venue_city     64 non-null     object
 2   Venue          64 non-null     object
 3   Venue_URL      64 non-null     object
dtypes: object(4)
memory usage: 2.1+ KB


In [40]:
match = pd.merge(comp, venue, on="Venue", how="left")

## workaround for venues that aren't mapping
match.loc[match.Venue == 'Stadion Energa Gdańsk', "Venue_country"] = "Poland"
match.loc[match.Venue == 'Bakı Olimpiya Stadionu', "Venue_country"] = "Azerbaijan"
match.loc[match.Venue == 'Arena Naţională', "Venue_country"] = "Romania"

for i in range(1,3):
    match["Home_"+str(i)] = 0
    match.loc[match["Team_"+str(i)] == match.Venue_country, "Home_"+str(i)] = 1

match.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 211 entries, 0 to 210
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Round          211 non-null    object 
 1   Wk             168 non-null    float64
 2   Day            211 non-null    object 
 3   Date           211 non-null    object 
 4   Time           211 non-null    object 
 5   Team_1         211 non-null    object 
 6   Score          175 non-null    object 
 7   Team_2         211 non-null    object 
 8   Attendance     175 non-null    float64
 9   Venue          211 non-null    object 
 10  Referee        175 non-null    object 
 11  Match Report   211 non-null    object 
 12  Notes          16 non-null     object 
 13  Filename       211 non-null    object 
 14  Year           211 non-null    object 
 15  Team_abbrev_1  211 non-null    object 
 16  Team_abbrev_2  211 non-null    object 
 17  Goals_1        175 non-null    float64
 18  Goals_2   

In [41]:
## checks on gaps in venue matching

# match.groupby("Filename").Venue_city.count() / match.groupby("Filename").Venue.count()

# match[pd.isnull(match.Venue_city)].Venue.unique()

# country_mask = ((~match.Venue_country.isin(match.Team_1.values)) & (~match.Venue_country.isin(match.Team_1.values)))
# match.loc[country_mask, "Venue_country"].unique()

match.fillna("").groupby(["Filename", "Venue", "Venue_city", "Venue_country"]).Date.count()

Filename       Venue                              Venue_city         Venue_country
Euro_2000.csv  GelreDome                          Arnhem             Netherlands      3
               Jan Breydelstadion                 Bruges             Belgium          4
               Johan Cruyff ArenA                 Amsterdam          Netherlands      5
               Philips Stadion                    Eindhoven          Netherlands      3
               Stade Maurice Dufrasne             Liège              Belgium          3
               Stade Roi Baudouin                 Brussels           Belgium          5
               Stade du Pays de Charleroi         Charleroi          Belgium          3
               Stadion Feijenoord                 Rotterdam          Netherlands      5
Euro_2004.csv  Estádio Do Algarve                 Faro/Loulé         Portugal         3
               Estádio Do Dragão                  Porto              Portugal         5
               Estádio Dom Afonso Hen

In [46]:
match.sample(5)

Unnamed: 0,Round,Wk,Day,Date,Time,Team_1,Score,Team_2,Attendance,Venue,...,Team_abbrev_1,Team_abbrev_2,Goals_1,Goals_2,Goal_diff,Venue_country,Venue_city,Venue_URL,Home_1,Home_2
123,Final,,Sun,2012-07-01,21:45 (19:45),Spain,4–0,Italy,63170.0,NSK Olimpijs'kyj,...,es,it,4.0,0.0,4.0,Ukraine,Kyiv,https://en.wikipedia.org/wiki/UEFA_Euro_2012#V...,0,0
72,Group stage,2.0,Thu,2008-06-12,18:00 (17:00),Croatia,2–1,Germany,30461.0,Wörthersee Stadion,...,hr,de,2.0,1.0,1.0,Austria,Klagenfurt,https://en.wikipedia.org/wiki/UEFA_Euro_2008#V...,0,0
193,Group stage,2.0,Fri,2021-06-18,14:00,Sweden,,Slovakia,,Gazprom Arena,...,se,sk,,,,Russia,Saint Petersburg,https://en.wikipedia.org/wiki/UEFA_Euro_2020#V...,0,0
73,Group stage,2.0,Thu,2008-06-12,20:45 (19:45),Austria,1–1,Poland,51428.0,Ernst-Happel-Stadion,...,at,pl,1.0,1.0,0.0,Austria,Vienna,https://en.wikipedia.org/wiki/UEFA_Euro_2008#V...,1,0
25,Quarter-finals,,Sat,2000-06-24,20:45 (19:45),Italy,2–0,Romania,41000.0,Stade Roi Baudouin,...,it,ro,2.0,0.0,2.0,Belgium,Brussels,https://en.wikipedia.org/wiki/UEFA_Euro_2000#V...,0,0


In [9]:
# from selenium import webdriver
# from selenium.webdriver.firefox.options import Options

In [10]:
# browser = webdriver.FirefoxProfile()
# options = Options()

# browser = webdriver.Firefox(#firefox_profile=profile,
#                                options=options,
#                                executable_path=r'C:\Users\adeacon\Documents\geckodriver\geckodriver.exe')

## 3. Data Preperation

* Select Data
* Clean Data
* Construct Data
* Integrate Data
* Format Data

## 4. Modelling

* Select Modelling Technique
* Generate Test Design
* Build Model
* Assess Model

### Updated WC model
* https://github.com/deacona/the-ball-is-round/blob/master/reports/intl_01_world_cup_2018.md
* https://github.com/deacona/the-ball-is-round/blob/master/notebooks/intl_01_world_cup_2018.ipynb

### "Soccernomics"
* goal diff = (0.6666 * home adv) + (0.5 * relative experience) + (0.1 * relative population) + (0.1 * relative gdp/head) + ...
* e.g. England vs Germany at Euro 96
    * Home = England = 1
    * Exp = 84k v 84k = 0
    * Pop = 57 v 81 = -0.4
    * GDP/h = 1627492 / 57 v 2633828 / 81 = -0.1
    * GD = (0.6666 * 1) + (0.5 * 0) + (0.1 * -0.4) + (0.1 * -0.1) = 0.6
* http://www.soccernomics-agency.com/wordpress/wp-content/uploads/2017/10/soccer-convergence-1.pdf

### Dixon-Coles (and other probability models)
* https://dashee87.github.io/football/python/predicting-football-results-with-statistical-modelling-dixon-coles-and-time-weighting/
* http://www.statsandsnakeoil.com/2018/06/05/modelling-the-world-cup-with-regista/
* http://opisthokonta.net/?cat=48

## 5. Evaluation

* Evaluate Results
* Review Process
* Determine Next Steps

```
# % correct score, goal diff, result, points
# vs historic trends (goals, W/D/L)
```

## 6. Deployment

* Plan Deployment
* Plan Monitoring and Maintenance
* Produce Final Report
* Review Project