# World Cup 2022 Prediction

## Introduction

With the World Cup ongoing at the time of writing this, it will be very interesting to predict the winners. In this notebook, we used past international game results to learn, cross validate, and test a linear regression model that will predict the expected goals for each match. This dataset can be found in Past_International_Matches.csv, which we derived from Kaggle ([FIFA World Cup 2022 ⚽️🏆](https://www.kaggle.com/datasets/brenda89/fifa-world-cup-2022)).

We then used this model then simulate the whole competion. This means predicting each game in the group stage to find out who will qualify for the knock out stages, and then use the predefined tournament bracket for this World Cup. We will iterate many simulations to find the chance of each team winning this World Cup.

## Import and Set Up

In this section we will import the necessary libraries, read the dataset, and store it in matrices.

## Data Visualisation

In [1]:
from mpl_toolkits import mplot3d
import matplotlib.pyplot as plt

In [2]:
import numpy as np
import pandas as pd
import  math
from scipy import optimize
from itertools import combinations
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import sklearn
from sklearn import linear_model
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPClassifier


In [3]:
groups = pd.read_csv('groupstage.csv')
data = pd.read_csv('Past_International_Matches.csv')
# groups.set_index(['Country_Name'])


In [4]:
data['tvalue_difference'] = data['tvalue_home'] - data['tvalue_away']
data['is_won'] = (data['home_team_result'] == 'Win') * 1
# data['is_draw'] = (data['home_team_result'] == 'Draw') * 1
data['is_lost'] = (data['home_team_result'] == 'Lose') * 1

data['polyn1'] = data['away_team_total_fifa_points'] * data['home_team_total_fifa_points']
data['polyn2'] = np.square(data['mean_coeff'])

## Training 

In [5]:
X = data.loc[:, ['mean_coeff', 'home_team_fifa_rank', 'away_team_fifa_rank', 'importance']]
y = data.loc[:,['is_won']]
print(X)

      mean_coeff  home_team_fifa_rank  away_team_fifa_rank  importance
0          0.925                  114                  158         1.0
1          0.850                  120                  129         1.0
2          0.850                  108                   88         1.0
3          0.850                  101                   98         1.0
4          0.850                   96                  127         1.0
...          ...                  ...                  ...         ...
9666       0.990                  180                  153         2.5
9667       0.990                  192                  135         2.5
9668       0.925                   28                   60         1.0
9669       0.850                   23                   35         1.0
9670       0.850                   29                   32         1.0

[9671 rows x 4 columns]


In [6]:
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, train_size = 0.7, shuffle = True)

# logreg = LogisticRegression()
# logreg.fit(x_train, y_train)
# log_pred = logreg.predict(x_test)
# acc_log = round(logreg.score(x_test, y_test) * 100, 2)
# acc_log

mlp = MLPClassifier(solver='sgd')
mlp.fit(x_train, y_train)

predict_test = mlp.predict(x_test)
predict_train = mlp.predict(x_train)
predict_test

# acc_mlp = round(mlp.score(x_test, y_test) * 100, 2)
# acc_mlp


  y = column_or_1d(y, warn=True)


array([0, 0, 1, ..., 0, 0, 1])

## Group Stage

In [13]:
groups_use = groups.set_index(['Country_Name'])
groups_use['points'] = 0
for group in set(groups['Group']):
    print('====Group {}===='.format(group))
    for home, away in combinations (groups.query('Group == "{}"'.format(group)).values, 2):
        print("{} vs {} : " .format(home[0], away[0], end=''))
        row = pd.DataFrame(np.array([[np.nan, np.nan, np.nan, np.nan]]), columns=X.columns)
        home = home[0]
        away = away[0]

        home_coeff = groups_use.loc[home, 'conf_coeff']
        away_coeff = groups_use.loc[away, 'conf_coeff']
        home_rank = groups_use.loc[home, 'Rank']
        away_rank = groups_use.loc[away, 'Rank']

        row['mean_coeff'] = (home_coeff + away_coeff)/2
        row['home_team_fifa_rank'] = home_rank
        row['away_team_fifa_rank'] = away_rank
        row['importance'] = 2.5

        #PUT MODEL OUTPUT AFTER
        prediction_model = mlp.predict(row)[:][0]
        

        points = 0 
        if prediction_model == 0 :
            groups_use.loc[away, 'points'] += 3
            print("{} wins".format(away))
            
        if prediction_model == 1:
            groups_use.loc[home, 'points'] += 3
            print("{} wins".format(home))



====Group A====
Qatar vs Ecuador : 
Ecuador wins
Qatar vs Senegal : 
Senegal wins
Qatar vs Netherlands : 
Netherlands wins
Ecuador vs Senegal : 
Senegal wins
Ecuador vs Netherlands : 
Netherlands wins
Senegal vs Netherlands : 
Netherlands wins
====Group D====
France vs Denmark : 
Denmark wins
France vs Tunisia : 
France wins
France vs Australia : 
France wins
Denmark vs Tunisia : 
Denmark wins
Denmark vs Australia : 
Denmark wins
Tunisia vs Australia : 
Australia wins
====Group G====
Brazil vs Serbia : 
Brazil wins
Brazil vs Switzerland : 
Brazil wins
Brazil vs Cameroon : 
Brazil wins
Serbia vs Switzerland : 
Switzerland wins
Serbia vs Cameroon : 
Serbia wins
Switzerland vs Cameroon : 
Switzerland wins
====Group C====
Argentina vs Saudi Arabia : 
Argentina wins
Argentina vs Mexico : 
Argentina wins
Argentina vs Poland : 
Argentina wins
Saudi Arabia vs Mexico : 
Mexico wins
Saudi Arabia vs Poland : 
Poland wins
Mexico vs Poland : 
Mexico wins
====Group E====
Spain vs Germany : 
Germany 

In [14]:
groups_use = groups_use.sort_values(by =['Group', 'points'], ascending = [True, False])          
print(groups_use)
 

               Group  conf_coeff  Rank  points
Country_Name                                  
Netherlands        A        0.99     8       9
Senegal            A        0.85    18       6
Ecuador            A        1.00    44       3
Qatar              A        0.85    50       0
England            B        0.99     5       9
Wales              B        0.99    19       6
USA                B        0.85    16       3
IR Iran            B        0.85    20       0
Argentina          C        1.00     3       9
Mexico             C        0.85    13       6
Poland             C        0.99    26       3
Saudi Arabia       C        0.85    51       0
Denmark            D        0.99    10       9
France             D        0.99     4       6
Australia          D        0.85    38       3
Tunisia            D        0.85    30       0
Germany            E        0.99    11       9
Spain              E        0.99     7       6
Costa Rica         E        0.85    31       3
Japan        