# World Cup 2022 Prediction

## Introduction

With the World Cup ongoing at the time of writing this, it will be very interesting to predict the winners. In this notebook, we used past international game results to learn, cross validate, and test a linear regression model that will predict the expected goals for each match. This dataset can be found in Past_International_Matches.csv, which we derived from Kaggle ([FIFA World Cup 2022 ⚽️🏆](https://www.kaggle.com/datasets/brenda89/fifa-world-cup-2022)).

We then used this model then simulate the whole competion. This means predicting each game in the group stage to find out who will qualify for the knock out stages, and then use the predefined tournament bracket for this World Cup. We will iterate many simulations to find the chance of each team winning this World Cup.

## Import and Set Up

In this section we will import the necessary libraries, read the dataset, and store it in matrices.

In [17]:
from mpl_toolkits import mplot3d
import matplotlib.pyplot as plt

In [18]:
import numpy as np
import pandas as pd
import  math
from scipy import optimize
from itertools import combinations
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import sklearn
from sklearn import linear_model
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPClassifier


In [116]:
groups = pd.read_csv('groupstage.csv')
data = pd.read_csv('Past_International_Matches_Points.csv')
# groups.set_index(['Country_Name'])


In [121]:
data['tvalue_difference'] = data['tvalue_home'] - data['tvalue_away']
data['result'] = (data['home_team_result'] == 'Win') * 3 + (data['home_team_result'] == 'Draw') * 2 + (data['home_team_result'] == 'Lose') * 1

data['neutral'] = (data['neutral_location'] == 'TRUE') * 1

## Training 

In [122]:
X = data.loc[:, ['mean_coeff', 'home_team_fifa_rank', 'away_team_fifa_rank', 'importance', 'neutral']]
y = data.loc[:,['result']]
print(X)
print(y)

       mean_coeff  home_team_fifa_rank  away_team_fifa_rank  importance  \
0           1.000                   59                   22         2.5   
1           0.925                    8                   14         1.0   
2           1.000                   35                   94         2.5   
3           0.850                   65                   86         1.0   
4           1.000                   67                    5         2.5   
...           ...                  ...                  ...         ...   
23916       0.990                  180                  153         2.5   
23917       0.990                  192                  135         2.5   
23918       0.925                   28                   60         1.0   
23919       0.850                   23                   35         1.0   
23920       0.850                   29                   32         1.0   

       neutral  
0            0  
1            0  
2            0  
3            0  
4            0

In [126]:
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, train_size = 0.7, shuffle = True)

# logreg = LogisticRegression()
# logreg.fit(x_train, y_train)
# log_pred = logreg.predict(x_test)
# acc_log = round(logreg.score(x_test, y_test) * 100, 2)
# acc_log

mlp = MLPClassifier()
mlp.fit(x_train, y_train)

predict_test = mlp.predict(x_test)
predict_train = mlp.predict(x_train)

print(sum(predict_test == 2))

acc_mlp = round(mlp.score(x_test, y_test) * 100, 2)
acc_mlp


  y = column_or_1d(y, warn=True)


112


56.63

## Group Stage

In [131]:
groups_use = groups.set_index(['Country_Name'])
groups_use['points'] = 0
for group in set(groups['Group']):
    print('====Group {}===='.format(group))
    for home, away in combinations (groups.query('Group == "{}"'.format(group)).values, 2):
        print("{} vs {} : " .format(home[0], away[0], end=''))
        row = pd.DataFrame(np.array([[np.nan, np.nan, np.nan, 4, 1]]), columns=X.columns)
        home = home[0]
        away = away[0]

        home_coeff = groups_use.loc[home, 'conf_coeff']
        away_coeff = groups_use.loc[away, 'conf_coeff']
        home_rank = groups_use.loc[home, 'Rank']
        away_rank = groups_use.loc[away, 'Rank']

        row['mean_coeff'] = (home_coeff + away_coeff)/2
        row['home_team_fifa_rank'] = home_rank
        row['away_team_fifa_rank'] = away_rank
        # row['importance'] = 4

        #PUT MODEL OUTPUT AFTER
        prediction_model = mlp.predict(row)[:][0]
        points = 0 
        if prediction_model == 1 :
            groups_use.loc[away, 'points'] += 3
            print("{} wins".format(away))
            
        elif prediction_model == 2:
            groups_use.loc[home, 'points'] += 1
            groups_use.loc[away, 'points'] += 1
            print("The match ends in a draw")
            
        elif prediction_model == 3 :
            groups_use.loc[home, 'points'] += 3
            print("{} wins".format(home))





====Group D====
France vs Denmark : 
France wins
France vs Tunisia : 
France wins
France vs Australia : 
France wins
Denmark vs Tunisia : 
Denmark wins
Denmark vs Australia : 
Denmark wins
Tunisia vs Australia : 
Tunisia wins
====Group F====
Belgium vs Canada : 
Belgium wins
Belgium vs Morocco : 
Belgium wins
Belgium vs Croatia : 
Belgium wins
Canada vs Morocco : 
Morocco wins
Canada vs Croatia : 
Croatia wins
Morocco vs Croatia : 
Croatia wins
====Group G====
Brazil vs Serbia : 
Brazil wins
Brazil vs Switzerland : 
Brazil wins
Brazil vs Cameroon : 
Brazil wins
Serbia vs Switzerland : 
Serbia wins
Serbia vs Cameroon : 
Serbia wins
Switzerland vs Cameroon : 
Switzerland wins
====Group E====
Spain vs Germany : 
Spain wins
Spain vs Japan : 
Spain wins
Spain vs Costa Rica : 
Spain wins
Germany vs Japan : 
Germany wins
Germany vs Costa Rica : 
Germany wins
Japan vs Costa Rica : 
Japan wins
====Group H====
Portugal vs Ghana : 
Portugal wins
Portugal vs Uruguay : 
Portugal wins
Portugal vs Ko

In [132]:
groups_use = groups_use.sort_values(by =['Group', 'points'], ascending = [True, False]) 
print(groups_use)

A = groups_use[groups_use['Group'] == 'A']
B = groups_use[groups_use['Group'] == 'B']
C = groups_use[groups_use['Group'] == 'C']
D = groups_use[groups_use['Group'] == 'D']
E = groups_use[groups_use['Group'] == 'E']
F = groups_use[groups_use['Group'] == 'F']
G = groups_use[groups_use['Group'] == 'G']
H = groups_use[groups_use['Group'] == 'H']

               Group  conf_coeff  Rank  points
Country_Name                                  
Netherlands        A        0.99     8       9
Senegal            A        0.85    18       6
Qatar              A        0.85    50       3
Ecuador            A        1.00    44       0
England            B        0.99     5       9
IR Iran            B        0.85    20       6
USA                B        0.85    16       3
Wales              B        0.99    19       0
Argentina          C        1.00     3       9
Mexico             C        0.85    13       6
Poland             C        0.99    26       3
Saudi Arabia       C        0.85    51       0
France             D        0.99     4       9
Denmark            D        0.99    10       6
Tunisia            D        0.85    30       3
Australia          D        0.85    38       0
Spain              E        0.99     7       9
Germany            E        0.99    11       6
Japan              E        0.85    24       3
Costa Rica   

## Knockout stages

In [159]:
def match_winner(home, away) :
    row = pd.DataFrame(np.array([[np.nan, np.nan, np.nan, 4, 1]]), columns=X.columns)
    row['mean_coeff'] = (home['conf_coeff'].values[0] + away['conf_coeff'].values[0]) / 2
    row['home_team_fifa_rank'] = home['Rank'].values[0]
    row['away_team_fifa_rank'] = away['Rank'].values[0]

    prediction_model = mlp.predict(row)[:][0]

    if prediction_model == 3:
        return home
    if prediction_model == 1:
        return away


In [158]:
a1 = A.iloc[0:1]
a2 = A.iloc[1:2]

b1 = B.iloc[0:1]
b2 = B.iloc[1:2]

c1 = C.iloc[0:1]
c2 = C.iloc[1:2]

d1 = D.iloc[0:1]
d2 = D.iloc[1:2]

e1 = E.iloc[0:1]
e2 = E.iloc[1:2]

f1 = F.iloc[0:1]
f2 = F.iloc[1:2]

g1 = G.iloc[0:1]
g2 = G.iloc[1:2]

h1 = H.iloc[0:1]
h2 = H.iloc[1:2]

9


### Last 16


In [183]:

l16_1 = match_winner(a1,b2)
print(a1.index.values + " vs " + b2.index.values + ". Winner " + l16_1.index.values)

l16_2 = match_winner(c1,d2)
print(c1.index.values + " vs " + d2.index.values + ". Winner " + l16_2.index.values)

l16_3 = match_winner(e1,f2)
print(e1.index.values + " vs " + f2.index.values + ". Winner " + l16_3.index.values)

l16_4 = match_winner(g1,h2)
print(g1.index.values + " vs " + h2.index.values + ". Winner " + l16_4.index.values)

l16_5 = match_winner(b1,a2)
print(b1.index.values + " vs " + a2.index.values + ". Winner " + l16_5.index.values)

l16_6 = match_winner(d1,c2)
print(d1.index.values + " vs " + c2.index.values + ". Winner " + l16_6.index.values)

l16_7 = match_winner(f1,e2)
print(f1.index.values + " vs " + e2.index.values + ". Winner " + l16_7.index.values)

l16_8 = match_winner(h1,g2)
print(h1.index.values + " vs " + g2.index.values + ". Winner " + l16_8.index.values)

['Netherlands vs IR Iran. Winner Netherlands']
['Argentina vs Denmark. Winner Argentina']
['Spain vs Croatia. Winner Spain']
['Brazil vs Uruguay. Winner Brazil']
['England vs Senegal. Winner England']
['France vs Mexico. Winner France']
['Belgium vs Germany. Winner Belgium']
['Portugal vs Serbia. Winner Portugal']


### Quarter finals

In [184]:
qf_1 = match_winner(l16_1, l16_2)
print(l16_1.index.values + " vs " + l16_2.index.values + ". Winner " + qf_1.index.values)

qf_2 = match_winner(l16_3, l16_4)
print(l16_3.index.values + " vs " + l16_4.index.values + ". Winner " + qf_2.index.values)

qf_3 = match_winner(l16_5,l16_6)
print(l16_5.index.values + " vs " + l16_6.index.values + ". Winner " + qf_3.index.values)

qf_4 = match_winner(l16_7, l16_8)
print(l16_7.index.values + " vs " + l16_8.index.values + ". Winner " + qf_4.index.values)


['Netherlands vs Argentina. Winner Argentina']
['Spain vs Brazil. Winner Brazil']
['England vs France. Winner France']
['Belgium vs Portugal. Winner Belgium']


### Semi-Finals

In [186]:
sf_1 = match_winner(qf_1, qf_2)
print(qf_1.index.values + " vs " + qf_2.index.values + ". Winner " + sf_1.index.values)

sf_2 = match_winner(qf_3, qf_4)
print(qf_3.index.values + " vs " + qf_4.index.values + ". Winner " + sf_2.index.values)

['Argentina vs Brazil. Winner Argentina']
['France vs Belgium. Winner Belgium']


### FINALS !!!!!

In [187]:
final_winner = match_winner(sf_1, sf_2)
print(sf_1.index.values + " vs " + sf_2.index.values + ". Winner " + final_winner.index.values)

['Argentina vs Belgium. Winner Argentina']
