# World Cup Model

In this notebook you will find the proccess that I followed to create a prediction model for the 2022 World Cup. The data that I used can be found in [Kaggle](https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017)

In [88]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn
from scipy.stats import poisson,skellam
import datetime as dt
import warnings
warnings.filterwarnings('ignore')

import statsmodels.api as sm
import statsmodels.formula.api as smf

## 1. Calculate Elo

To train the prediction model we will need the elo score of both teams. To calculate elo we will create some useful functions. If you want to understand how to use these go to [my post about elo in Medium](https://medium.com/mlearning-ai/how-to-calculate-elo-score-for-international-teams-using-python-66c136f01048). The data that we use can be found in [Kaggle](https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017)

### 1.2 Elo functions

In [89]:
confederation_tournaments=['AFC Asian Cup','African Cup of Nations','UEFA Euro','Copa América','CONCACAF Championship','Oceania Nations Cup']

def k_value(tournament):
    k=5
    if tournament == 'Friendly':
        k=10
    elif tournament == 'FIFA World Cup qualification':
        k=25
    elif tournament in confederation_tournaments:
        k=40
    elif tournament == 'FIFA World Cup':
        k=55
    return k
    
def expected_result(loc,aw):
    dr=loc-aw
    we=(1/(10**(-dr/400)+1))
    return [np.round(we,3),1-np.round(we,3)]

def actual_result(loc,aw):
    if loc<aw:
        wa=1
        wl=0
    elif loc>aw:
        wa=0
        wl=1
    elif loc==aw:
        wa=0.5
        wl=0.5
    return [wl,wa]

def calculate_elo(elo_l,elo_v,local_goals,away_goals,tournament):
    
    k=k_value(tournament)
    wl,wv=actual_result(local_goals,away_goals)
    wel,wev=expected_result(elo_l,elo_v)

    elo_ln=elo_l+k*(wl-wel)
    elo_vn=elo_v+k*(wv-wev)

    return elo_ln,elo_vn

### 1.2 Calculate ELO

In [90]:
matches = pd.read_csv("data/results.csv").sort_values('date')

matches["Elo_h_before"]=np.nan
matches["Elo_a_before"]=np.nan

matches["Elo_h_after"]=np.nan
matches["Elo_a_after"]=np.nan

current_elo={}
for idx,row in matches.iterrows():
    
    local=row['home_team']
    away=row['away_team']
    local_goals=row['home_score']
    away_goals=row['away_score']
    tournament = row['tournament']
    
    # Si el equipo no se le ha calculado el ELO, se le inicializa en 1300
    if local not in current_elo.keys():
        current_elo[local]=1300
    
    if away not in current_elo.keys():
        current_elo[away]=1300
    
    elo_l=current_elo[local]
    elo_v=current_elo[away]
    elo_ln,elo_vn=calculate_elo(elo_l,elo_v,local_goals,away_goals,tournament)

    current_elo[local]=elo_ln
    current_elo[away]=elo_vn
    
    matches.loc[idx,'Elo_h_after']=elo_ln
    matches.loc[idx,'Elo_a_after']=elo_vn 
    matches.loc[idx,'Elo_h_before']=elo_l
    matches.loc[idx,'Elo_a_before']=elo_v

## 2. Build model

### 2.1 Filter dataframe 

In [91]:
# I only train the model with matches after 1990 and important tournaments. I removed the ooutliers, teams that scored more than 6 goals
# These are the tournaments that I will use to train the model
tournaments=['FIFA World Cup qualification', 'UEFA Euro qualification',
       'African Cup of Nations qualification', 'AFC Asian Cup qualification',
       'African Cup of Nations', 'CFU Caribbean Cup qualification',
       'FIFA World Cup',  'UEFA Nations League', 'Gold Cup',
       'Copa América',  'AFF Championship',
       'UEFA Euro', 'African Nations Championship', 'AFC Asian Cup',
       'CONCACAF Nations League','Friendly']
       
matches=matches[(pd.to_datetime(matches['date'])>dt.datetime(1989,12,31))&(matches['tournament'].isin(tournaments))]
matches =matches[['date','home_team','away_team','home_score','away_score','neutral','Elo_a_before','Elo_h_before']]
matches = matches[(matches['home_score']<6)&(matches['away_score']<6)].reset_index(drop=True)

# Created a Dataframe for each team instead of each match
home=matches[["date","home_team","home_score","neutral","Elo_a_before","Elo_h_before"]].rename(columns={'home_team':"Team","home_score":"Goals_for","Elo_a_before":"Elo rival","Elo_h_before":"Elo"})
away=matches[["date","away_team","away_score","Elo_a_before","Elo_h_before"]].rename(columns={'away_team':"Team","away_score":"Goals_for","Elo_a_before":"Elo","Elo_h_before":"Elo rival"}).assign(neutral=0).assign(local=0)

# We create a variable that tell us if there is home advantage for home team
home["local"] = home["neutral"].apply(lambda x: 1 if x==0 else 1)

# Created a new dataframe with all data 
df = pd.concat([home,away],ignore_index=True).sort_values("date").reset_index(drop=True)

# Created a moving average variable and shifted to have the previos data of each game
df["Moving_goals_for"]=df.groupby('Team')['Goals_for'].transform(lambda x: x.rolling(3).mean()).shift()
df["Moving_goals_for"]=df.groupby("Team")["Moving_goals_for"].shift()

# Create variable with elo difference 
df["Elo_difference"] = df["Elo"] - df["Elo rival"]

# Drop na and un used columns
df=df.dropna()
df=df.drop(columns=["Team","Elo","Elo rival","neutral"])

In [92]:
df["Elo_difference"]

32       -94.740
61       118.245
64        -8.345
66      -135.255
71      -115.255
          ...   
45255    153.780
45256     -6.435
45257   -260.525
45258   -121.560
45259    -64.815
Name: Elo_difference, Length: 44515, dtype: float64

### 2.2 Build model

In [93]:
model = smf.glm(formula="Goals_for ~ local + Moving_goals_for + Elo_difference", data=df, 
                        family=sm.families.Poisson()).fit()
model.summary()

0,1,2,3
Dep. Variable:,Goals_for,No. Observations:,44515.0
Model:,GLM,Df Residuals:,44511.0
Model Family:,Poisson,Df Model:,3.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-61872.0
Date:,"Tue, 08 Nov 2022",Deviance:,51095.0
Time:,15:39:56,Pearson chi2:,45200.0
No. Iterations:,5,Pseudo R-squ. (CS):,0.1766
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.0230,0.010,-2.350,0.019,-0.042,-0.004
local,0.3371,0.009,38.994,0.000,0.320,0.354
Moving_goals_for,-0.0035,0.006,-0.615,0.539,-0.015,0.008
Elo_difference,0.0022,2.6e-05,83.712,0.000,0.002,0.002


In [94]:
# Convert values to dataframe

argentina_elo=1733.03
brazil_elo=1763.62
argentina_data = pd.DataFrame(data={'local':0,'Moving_goals_for':2.0,'Elo_difference':argentina_elo-brazil_elo},index=[1])
brazil_data = pd.DataFrame(data={'local':0,'Moving_goals_for':1.0,'Elo_difference':brazil_elo-argentina_elo},index=[1])

# Get avg goals predicted by model
argentina_avg_goals = model.predict(argentina_data).values[0]
brazil_avg_goals = model.predict(brazil_data).values[0]

# Get probability of score 1 to 5 goals for each team
team_pred = [[poisson.pmf(i, team_avg) for i in range(0, 5)] for team_avg in [argentina_avg_goals, brazil_avg_goals]]

#Calculate joint probability

joint_proba=np.outer(np.array(team_pred[0]), np.array(team_pred[1]))

#Calculate probability for Home, Draw and Away
pd.Series([1-np.sum(np.triu(joint_proba, 1))-np.sum(np.diag(joint_proba)),np.sum(np.diag(joint_proba)),np.sum(np.triu(joint_proba, 1))],index=['Home','Draw','Away'])

Home    0.312985
Draw    0.312402
Away    0.374613
dtype: float64