# Predicting 

Getting into the world cup mood, I'm going to predict the results of this competition using complex networks, or graph theory, and an approach called link prediction. Basically, we're going to work here with the Adamic Adar measure.

**What is it?**

It is a measure used to compute the closeness of nodes based on their shared neighbors. But I made some changes in this formula...

- x and y are 2 nodes (2 National teams)
- N(*one_node*) is a function that return the set of adjacent nodes to *one_node*. A.k.a., goals scored by this team.
- The modified part that I included in the formula was the numerator $goals scored_{uy}$ to give us a more precise metric in the case the team score a lots of goals in a match.

$$𝑎𝑑𝑎𝑚𝑖𝑐𝐴𝑑𝑎𝑟(𝑥,𝑦)=\sum_{𝑢∈𝑁(𝑥)∩𝑁(𝑦)} \frac{goals scored_{uy}}{𝑙𝑜𝑔(𝑁(𝑢))}$$

«say otherwise, for each team u who conceded a goal by team x and scored a goal against team y, add to the measure: scored goals from team u againts team y divided by the amount of goals scored by the team u againts any other team.»

The quantity  $\frac{goals scored_{uy}}{𝑙𝑜𝑔(𝑁(𝑢))}$  determine the importance of u in the measure.

And then... With the quantity of scored goals by the team x in the team u, we can add:

$$𝑎𝑑𝑎𝑚𝑖𝑐𝐴𝑑𝑎𝑟(𝑥,𝑦)=\sum_{𝑢∈𝑁(𝑥)∩𝑁(𝑦)}\frac{goals scored_{xu}*goals scored_{uy}}{𝑙𝑜𝑔(𝑁(𝑢))}$$

This will be the final function used for this framework.


# 1. Importing necessary libraries

In [1]:
import pandas as pd
import networkx as nx
import math
import numpy as np
import matplotlib.pyplot as plt

# 2. Reading and preparing data

Reading dataframe that have all official results from 1872 to 2022 national teams matches!

In [2]:
#Reading dataframe
df =  pd.read_csv("../input/international-football-results-from-1872-to-2017/results.csv")
df['date'] = pd.to_datetime(df['date'])

Filtering dataframe to considerer events before the actual world cup ediction to the historic dataframe

In [3]:
#Possibility to filter matches from the last ediction until the actual ediction
last_ediction_date = df[ (df['tournament'] == 'FIFA World Cup') & (df['date'].dt.year == 2018) ].date.max()
actual_ediction_date = df[ (df['tournament'] == 'FIFA World Cup') & (df['date'].dt.year == 2022) ].date.min()

#df = df[df['date'] >= last_ediction_date].copy()

Preparing dataframes that we will use to predict the results

In [4]:
#Catar world cup matches
df_catar = df[ (df['tournament'] == 'FIFA World Cup') & (df['date'].dt.year == 2022) ].copy()

#Deleting Catar world cup matches from the historical dataframe
df = df[df['date'] < actual_ediction_date].copy()

#Selecting the useful columns
df = df[['home_team', 'away_team', 'home_score', 'away_score', 'date']].reset_index(drop=True)
df_catar = df_catar[['home_team', 'away_team', 'home_score', 'away_score', 'date']].reset_index(drop=True)

#Adjusting column types
df['home_score'] = df['home_score'].astype('int64')
df['away_score'] = df['away_score'].astype('int64')

Creating the group relationship

In [5]:
#Creating the group relationship
teams = ['Netherlands','Ecuador','Senegal','Qatar','England','Iran','United States','Wales','Poland','Argentina','Saudi Arabia','Mexico','France','Australia','Denmark','Tunisia','Spain','Japan','Costa Rica','Germany','Belgium','Croatia','Morocco','Canada','Brazil','Switzerland','Cameroon','Serbia','Portugal','South Korea','Uruguay','Ghana']
groups = ['A','A','A','A','B','B','B','B','C','C','C','C','D','D','D','D','E','E','E','E','F','F','F','F','G','G','G','G','H','H','H','H']

df_groups = pd.DataFrame(zip(teams, groups), columns=['team', 'group'])
df_catar = df_catar.merge(df_groups, left_on = 'home_team', right_on = 'team', how='left')
df_catar.head()

Unnamed: 0,home_team,away_team,home_score,away_score,date,team,group
0,Qatar,Ecuador,0.0,2.0,2022-11-20,Qatar,A
1,Senegal,Netherlands,0.0,2.0,2022-11-21,Senegal,A
2,England,Iran,6.0,2.0,2022-11-21,England,B
3,United States,Wales,1.0,1.0,2022-11-21,United States,B
4,Argentina,Saudi Arabia,1.0,2.0,2022-11-22,Argentina,C


# 3. Network building

Here we will use tranform the pandas dataframe into a network infrastructure. This will allow us to use some interesting functions in the future!

In [6]:
G = nx.DiGraph()

for i, rowi in df.iterrows():
    if rowi['home_team'] not in G: G.add_node(rowi['home_team'],label="TEAM")
    if rowi['away_team'] not in G: G.add_node(rowi['away_team'],label="TEAM")
    
    if G.has_edge(rowi['home_team'], rowi['away_team']):
        goals = G.get_edge_data(rowi['home_team'], rowi['away_team'])['weight']
        goals += rowi['home_score']
        G.remove_edge(rowi['home_team'], rowi['away_team'])
        G.add_edge(rowi['home_team'], rowi['away_team'], label="GOAL", weight=goals)
        
    else:
        G.add_edge(rowi['home_team'], rowi['away_team'], label="GOAL", weight=rowi['home_score'])
    
    if G.has_edge(rowi['away_team'], rowi['home_team']):
        goals = G.get_edge_data(rowi['away_team'], rowi['home_team'])['weight']
        goals += rowi['away_score']
        G.remove_edge(rowi['away_team'], rowi['home_team'])
        G.add_edge(rowi['away_team'], rowi['home_team'], label="GOAL", weight=goals)
        
    else:
        G.add_edge(rowi['away_team'], rowi['home_team'], label="GOAL", weight=rowi['away_score'])

Lets take a look on the amount of matches and teams mapped!

In [7]:
print('Unique teams in the graph:', G.number_of_nodes())
print('Unique matches in the graph:', G.number_of_edges())

Unique teams in the graph: 316
Unique matches in the graph: 14038


Building the function to return the Adamic Adar score (AAS) for each pair of teams. The idea is: If the AAS for Team A is bigger than the AAS for Team B, then the Team A is the winner of the match. 

We build this network with the goals scored by the team in the past, so we can consider this now as a proxy for the chance a team score a new goal in the especif match.

In [8]:
def get_score(Graph, Home_Team, Away_Team):

    commons_dict = {}
    for e in Graph.neighbors(Home_Team):
        for e2 in Graph.neighbors(e):
            if e2==Home_Team:
                continue
            else:
                commons = commons_dict.get(e2)
                if commons==None:
                    commons_dict.update({e2 : [e]})
                else:
                    commons.append(e)
                    commons_dict.update({e2 : commons})
    teams=[]
    weight=[]
    for key, values in commons_dict.items():
        w=0.0
        for e in values:
            w= w + (Graph[Home_Team][e]['weight']*Graph[e][key]['weight']*1)/math.log(Graph.out_degree(e))
        teams.append(key) 
        weight.append(w)
    
    result = pd.Series(data=np.array(weight),index=teams)
    result.sort_values(inplace=True,ascending=False)
    
    try:
        return result[Away_Team]
    except:
        return 0

Testing the beautiful function

In [9]:
print('Score for Brazil against Argentina:', get_score(G, 'Brazil', 'Argentina'))
print('Score for Argentina against Brazil:', get_score(G, 'Argentina', 'Brazil'))

Score for Brazil against Argentina: 19748.88847705135
Score for Argentina against Brazil: 16040.025376623713


The ideia is to use the function above to get a historical data against the common team opponents and then, get the simple probability of a team score a goal based of the games since the last world cup ediction. The funcition below get this for us! If there are no matches in the last years for a pair of teams, we will have 50% returned.

In [10]:
def get_actual_score(dataframe, Home_Team, Away_Team):
    goals_home = 0
    goals_away = 0
    total = 0
    
    mask = ( (df['home_team'] == Home_Team) & (df['away_team'] == Away_Team) & (df['date'] > last_ediction_date ) )
    goals_home = dataframe[mask]['home_score'].sum()
    goals_away = dataframe[mask]['away_score'].sum()
    
    mask = ( (df['home_team'] == Away_Team) & (df['away_team'] == Home_Team) & (df['date'] > last_ediction_date ) )
    goals_home += dataframe[mask]['away_score'].sum()
    goals_away += dataframe[mask]['home_score'].sum()
    
    total = goals_home + goals_away
    if total == 0:
        return 0.5, 0.5 
    else:
        return goals_home/(total), goals_away/(total)

As an example...

In [11]:
get_actual_score(df, 'Brazil', 'Argentina')

(0.6, 0.4)

Combining both functions we would have...

In [12]:
brazil = get_score(G, 'Brazil', 'Argentina') / (get_score(G, 'Brazil', 'Argentina') + get_score(G, 'Argentina', 'Brazil'))
argentina = get_score(G, 'Argentina', 'Brazil') / (get_score(G, 'Brazil', 'Argentina') + get_score(G, 'Argentina', 'Brazil'))

In [13]:
print('Score for Brazil against Argentina:', np.mean([brazil, get_actual_score(df, 'Brazil', 'Argentina')[0]]))
print('Score for Argentina against Brazil:', np.mean([argentina, get_actual_score(df, 'Brazil', 'Argentina')[1]]))

Score for Brazil against Argentina: 0.5759079048584118
Score for Argentina against Brazil: 0.4240920951415882


# 4. Predicting the Group Stage

Adding all nodes for the actual ediction in the network

In [14]:
for i, rowi in df_catar.iterrows():
    if rowi['home_team'] not in G: G.add_node(rowi['home_team'],label="TEAM")
    if rowi['away_team'] not in G: G.add_node(rowi['away_team'],label="TEAM")

Applying for all the matches of group stage

In [15]:
home_points = list()
away_points = list()
for i, rowi in df_catar.iterrows():
    actual =  get_actual_score(df, rowi['home_team'], rowi['away_team'])
    
    home_ = get_score(G, rowi['home_team'], rowi['away_team']) 
    away_ = get_score(G, rowi['away_team'], rowi['home_team']) 
    
    if (home_ + away_) == 0:
        home = np.mean([0.5, actual[0]])
        away = np.mean([0.5, actual[1]])
    else:
        home = home_ / (home_ + away_)
        away = away_ / (home_ + away_)
                       
        home = np.mean([home, actual[0]])
        away = np.mean([away, actual[1]])
    
    if home > away:
        home_points.append(3)
        away_points.append(0)
    elif home < away:
        home_points.append(0)
        away_points.append(3)
    else:
        home_points.append(1)
        away_points.append(1)
        
df_catar['home_points'] = home_points
df_catar['away_points'] = away_points

Uh... Sometimes it works, but we already know that some unusual results happened this year.

In [16]:
df_catar.head()

Unnamed: 0,home_team,away_team,home_score,away_score,date,team,group,home_points,away_points
0,Qatar,Ecuador,0.0,2.0,2022-11-20,Qatar,A,0,3
1,Senegal,Netherlands,0.0,2.0,2022-11-21,Senegal,A,0,3
2,England,Iran,6.0,2.0,2022-11-21,England,B,3,0
3,United States,Wales,1.0,1.0,2022-11-21,United States,B,0,3
4,Argentina,Saudi Arabia,1.0,2.0,2022-11-22,Argentina,C,3,0


Creating a function to get final results for the group stage

In [17]:
def get_group_results(df_results, df_teams, group):
    df_aux = df_teams[df_teams['group']==group].copy()
    points = list()
    for team in df_aux.team:
        h = df_results[df_results['home_team']==team]['home_points'].sum()
        a = df_results[df_results['away_team']==team]['away_points'].sum()
        points.append(a + h)
    df_aux['pts'] = points
    return df_aux.sort_values(by=['pts'], ascending=False).reset_index(drop=True)

This is how it ends for Group A

In [18]:
get_group_results(df_catar, df_groups, 'A')

Unnamed: 0,team,group,pts
0,Netherlands,A,9
1,Senegal,A,6
2,Ecuador,A,3
3,Qatar,A,0


This is how it ends for Group B

In [19]:
get_group_results(df_catar, df_groups, 'B')

Unnamed: 0,team,group,pts
0,England,B,9
1,Wales,B,6
2,United States,B,3
3,Iran,B,0


This is how it ends for Group C

In [20]:
get_group_results(df_catar, df_groups, 'C')

Unnamed: 0,team,group,pts
0,Argentina,C,9
1,Mexico,C,6
2,Poland,C,3
3,Saudi Arabia,C,0


This is how it ends for Group D

In [21]:
get_group_results(df_catar, df_groups, 'D')

Unnamed: 0,team,group,pts
0,Denmark,D,9
1,France,D,6
2,Australia,D,3
3,Tunisia,D,0


This is how it ends for Group E

In [22]:
get_group_results(df_catar, df_groups, 'E')

Unnamed: 0,team,group,pts
0,Spain,E,9
1,Germany,E,6
2,Japan,E,3
3,Costa Rica,E,0


This is how it ends for Group F

In [23]:
get_group_results(df_catar, df_groups, 'F')

Unnamed: 0,team,group,pts
0,Belgium,F,9
1,Croatia,F,6
2,Morocco,F,3
3,Canada,F,0


This is how it ends for Group G

In [24]:
get_group_results(df_catar, df_groups, 'G')

Unnamed: 0,team,group,pts
0,Brazil,G,9
1,Serbia,G,6
2,Switzerland,G,3
3,Cameroon,G,0


This is how it ends for Group H

In [25]:
get_group_results(df_catar, df_groups, 'H')

Unnamed: 0,team,group,pts
0,Portugal,H,6
1,South Korea,H,6
2,Uruguay,H,6
3,Ghana,H,0


# 5. Predicting the knockout stage and the champion!

In [26]:
if (home_ + away_) == 0:
    home = np.mean([0.5, actual[0]])
    away = np.mean([0.5, actual[1]])
else:
    home = home_ / (home_ + away_)
    away = away_ / (home_ + away_)

    home = np.mean([home, actual[0]])
    away = np.mean([away, actual[1]])

In [27]:
def get_winner(G_, home_team, away_team):
    actual =  get_actual_score(df, home_team, away_team)
    h = get_score(G_, home_team, away_team)
    a = get_score(G_, away_team, home_team)
    
    if (h + a) == 0:
        home = np.mean([0.5, actual[0]])
        away = np.mean([0.5, actual[1]])
    else:
        home = h / (h + a)
        away = a / (h + a)

        home = np.mean([h, actual[0]])
        away = np.mean([a, actual[1]])
    
    if home > away:
        return home_team, home, away
    elif home < away:
        return away_team, home, away
    else:
        return 'draw', 0.5, 0.5

In [28]:
def get_knockout_results(df_results, df_teams):
    first_A = get_group_results(df_results, df_teams, 'A')['team'][0]
    second_A = get_group_results(df_results, df_teams, 'A')['team'][1]
    
    first_B = get_group_results(df_results, df_teams, 'B')['team'][0]
    second_B = get_group_results(df_results, df_teams, 'B')['team'][1]
    
    first_C = get_group_results(df_results, df_teams, 'C')['team'][0]
    second_C = get_group_results(df_results, df_teams, 'C')['team'][1]
    
    first_D = get_group_results(df_results, df_teams, 'D')['team'][0]
    second_D = get_group_results(df_results, df_teams, 'D')['team'][1]
    
    first_E = get_group_results(df_results, df_teams, 'E')['team'][0]
    second_E = get_group_results(df_results, df_teams, 'E')['team'][1]
    
    first_F = get_group_results(df_results, df_teams, 'F')['team'][0]
    second_F = get_group_results(df_results, df_teams, 'F')['team'][1]
    
    first_G = get_group_results(df_results, df_teams, 'G')['team'][0]
    second_G = get_group_results(df_results, df_teams, 'G')['team'][1]
    
    first_H = get_group_results(df_results, df_teams, 'H')['team'][0]
    second_H = get_group_results(df_results, df_teams, 'H')['team'][1]
    
    round_16_1 = get_winner(G, first_A, second_B)[0]
    round_16_2 = get_winner(G, first_C, second_D)[0]
    round_16_3 = get_winner(G, first_D, second_C)[0]
    round_16_4 = get_winner(G, first_B, second_A)[0]
    round_16_5 = get_winner(G, first_E, second_F)[0]
    round_16_6 = get_winner(G, first_G, second_H)[0]
    round_16_7 = get_winner(G, first_F, second_E)[0]
    round_16_8 = get_winner(G, first_H, second_G)[0]
    
    quart_1 = get_winner(G, round_16_5, round_16_6)[0]
    quart_2 = get_winner(G, round_16_1, round_16_2)[0]
    quart_3 = get_winner(G, round_16_7, round_16_8)[0]
    quart_4 = get_winner(G, round_16_3, round_16_4)[0]
    
    semi_1 = get_winner(G, quart_2, quart_1)[0]
    semi_2 = get_winner(G, quart_4, quart_3)[0]
    
    aux = [quart_2, quart_1]
    aux.remove(semi_1)
    aux2 = [quart_4, quart_3]
    aux2.remove(semi_2)
    third = get_winner(G, aux[0], aux2[0])[0]
    
    final = get_winner(G, semi_1, semi_2)[0]
    
    print('---------------------------------------------------------------------------')
    print('\n                        Knockout Stage:')
    print('\n---------------------------------------------------------------------------')
    print('\nMatch 1: ', first_A, 'vs', second_B, '| Winner:', round_16_1)
    print('\nMatch 2: ', first_C, 'vs', second_D, '| Winner:', round_16_2)
    print('\nMatch 3: ', first_D, 'vs', second_C, '| Winner:', round_16_3)
    print('\nMatch 4: ', first_B, 'vs', second_A, '| Winner:', round_16_4)
    print('\nMatch 5: ', first_E, 'vs', second_F, '| Winner:', round_16_5)
    print('\nMatch 6: ', first_G, 'vs', second_H, '| Winner:', round_16_6)
    print('\nMatch 7: ', first_F, 'vs', second_E, '| Winner:', round_16_7)
    print('\nMatch 8: ', first_H, 'vs', second_G, '| Winner:', round_16_8)
    print('\n---------------------------------------------------------------------------')
    print('\n                       Quarter-finals:')
    print('\n---------------------------------------------------------------------------')
    print('\nMatch 1: ', round_16_5, 'vs', round_16_6, '| Winner:', quart_1)
    print('\nMatch 2: ', round_16_1, 'vs', round_16_2, '| Winner:', quart_2)
    print('\nMatch 3: ', round_16_7, 'vs', round_16_8, '| Winner:', quart_3)
    print('\nMatch 4: ', round_16_3, 'vs', round_16_4, '| Winner:', quart_4)
    print('\n---------------------------------------------------------------------------')
    print('\n                        Semi-finals:')
    print('\n---------------------------------------------------------------------------')
    print('\nMatch 1: ', quart_2, 'vs', quart_1, '| Winner:', semi_1)
    print('\nMatch 2: ', quart_4, 'vs', quart_3, '| Winner:', semi_2)
    print('\n---------------------------------------------------------------------------')
    print('\n                    Third place play-off:')
    print('\n---------------------------------------------------------------------------')
    print('\nMatch 1: ', aux[0], 'vs', aux2[0], '| Third-place:', third)
    print('\n---------------------------------------------------------------------------')
    print('\n                            Final:')
    print('\n---------------------------------------------------------------------------')
    print('\nMatch 1: ', semi_1, 'vs', semi_2, '| Champion:', final)

In [29]:
get_knockout_results(df_catar, df_groups)

---------------------------------------------------------------------------

                        Knockout Stage:

---------------------------------------------------------------------------

Match 1:  Netherlands vs Wales | Winner: Netherlands

Match 2:  Argentina vs France | Winner: Argentina

Match 3:  Denmark vs Mexico | Winner: Denmark

Match 4:  England vs Senegal | Winner: England

Match 5:  Spain vs Croatia | Winner: Spain

Match 6:  Brazil vs South Korea | Winner: Brazil

Match 7:  Belgium vs Germany | Winner: Germany

Match 8:  Portugal vs Serbia | Winner: Portugal

---------------------------------------------------------------------------

                       Quarter-finals:

---------------------------------------------------------------------------

Match 1:  Spain vs Brazil | Winner: Brazil

Match 2:  Netherlands vs Argentina | Winner: Argentina

Match 3:  Germany vs Portugal | Winner: Germany

Match 4:  Denmark vs England | Winner: England

-----------------------

# By the end of it, seems like Brazil will be the world champion!