# COMP0036: Group Coursework 

### Introduction

How does one ``beat the bookie``? In a game so seemingly complex, dominated by a combination of both skill and chance, predicting the outcomes of football matchs and making score predictions seems like an intractable task. On that note, however, the development of complex learning-based systems have allowed us to effectively explore this problem domain. Historically, many attempts have been made to predict the outcome of football matches by using the respective number of goals scored by each team as a measure and proxy for that team's ultimate success. In contrast, this project focuses on exploring new model design hypotheses, trained with an enhanced data set that consists of in-game match events, to effectively reflect the complex, multivariate nature of football. Furthermore, we constructively assess our models’ performance using customised evaluation metrics and compare them to that of the bookmakers’ models.

### Package Import

In [253]:
import os
import sys
import re
from math import ceil
from IPython.core.display import HTML
import random
from collections import defaultdict

#Standard Python libraries for data and visualisation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly as ply

#Import models
from sklearn.ensemble import RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.model_selection import PredefinedSplit, KFold
from sklearn.metrics import make_scorer
from sklearn.svm import SVC

#Import error metric
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, f1_score
from sklearn.metrics import accuracy_score

#Import data munging tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

#Display charts in the notebook
%matplotlib inline

### Data Import

### Shots xG model

The shot-based-data for predicting expected goals (xG) model, data was obtained from https://fbref.com. The data obtained consists of shots made in the match and match highlights at MAKO (minute after kick-off). 
Shot data was limited to the last 4 seasons, i.e., <b>2017-2021</b>.
The data was scraped off the webpage with a custom built scraper.
The scraper is available at ... insert here

Currently our data contains the following values:
    <li><b>Minute</b>: mins after kick-off shot took place, mis after the first half have 100 added to them.</li> 
    <li><b>Player</b>: player making the shot</li>
    <li><b>Squad</b>: squad the player is from</li>
    <li><b>Against</b>: squad against which shot is made</li>
    <li><b>Outcome</b>: whether a shot is a goal|blocked|saved|etc.</li>
    <li><b>Distance</b>: distance from the goal-post the shot was made</li>
    <li><b>Body Part</b>: body part used to make the shot</li>
    <li><b>Notes</b>: what kind of shot it was e.g. header|volley|etc.</li>
    <li><b>SCA 1 Player</b>: player inducing event leading to shot</li> 
    <li><b>SCA 1 Event</b>: event leading to shot</li>
    <li><b>SCA 2 Player</b>: player inducing event leading to 'SCA 1 Event'</li>
    <li><b>SCA 2 Event</b>: event leading to 'SCA 1 Event'</li>
    <li><b>Timestamp</b>: time and date of the kick-off</li>
    <li><b>Score</b>: score of the squad at the time of the shot being taken</li>
    <li><b>Player Advantage</b>: whether the squad has more players than the other</li>
    <li><b>Threat</b>: threat level of the player making the shot</li>

In [172]:
shot_data = pd.read_csv('data/fantasy-league/shot_data.csv', index_col=0)

In [173]:
shot_data.head()

Unnamed: 0,Timestamp,Score,Player Advantage,Minute,Player,Squad,Against,Outcome,Distance,Body Part,Notes,SCA 1 Player,SCA 1 Event,SCA 2 Player,SCA 2 Event,Threat
0,2016-08-13 12:30:00,0.0,0.0,147,Riyad Mahrez,Leicester City,Hull City,Goal,,,Penalty Kick,Goal,,,,720.0
1,2016-08-13 17:30:00,1.0,0.0,4,Sergio Agüero,Manchester City,Sunderland,Goal,,,Penalty Kick,—,,,,720.0
2,2016-08-15 20:00:00,1.0,0.0,147,Eden Hazard,Chelsea,West Ham United,Goal,,,Penalty Kick,Yellow Card,,,,627.0
3,2016-08-19 20:00:00,2.0,0.0,152,Zlatan Ibrahimović,Manchester United,Southampton,Goal,,,Penalty Kick,Goal,,,,627.0
4,2016-08-20 12:30:00,1.0,0.0,27,Sergio Agüero,Manchester City,Stoke City,Goal,,,Penalty Kick,Yellow Card,,,,


### Non-shots xG model

#### Load Data

For the non-shot-based expected goals (xG) model, we have obtained a dataset from [football-data.co.uk] (https://www.football-data.co.uk/) consisting of football match information over the past 10+ years. We have reduced this dataset to data from the last 5 seasons (including the current one), i.e., __2016-2021__. This version contains 1684 samples, each of which consists of 12 features and 2 labels of the full time home team goals (FTHG) and the full time away team goals (FTAG). The features are:

1. GameID = Unique ID for the match
2. Date = Match Date (dd/mm/yy)
3. HomeTeam = Home Team
4. AwayTeam = Away Team
5. Referee = Match Referee
6. HC = Home Team Corners
7. AC = Away Team Corners
8. HF = Home Team Fouls Committed
9. AF = Away Team Fouls Committed
10. HY = Home Team Yellow Cards
11. AY = Away Team Yellow Cards
12. HR = Home Team Red Cards
13. AR = Away Team Red Cards

In [174]:
DATASET_PATH = os.path.join(os.getcwd(), "data/non-shot-xG/non_shot_data.csv")
complete_data = pd.read_csv(DATASET_PATH)

In [175]:
complete_data.head()

Unnamed: 0,GameID,Date,HomeTeam,AwayTeam,Referee,HC,AC,HF,AF,HY,AY,HR,AR,FTHG,FTAG,FTR
0,1,13/08/2016,Burnley,Swansea,J Moss,7,4,10,14,3,2,0,0,0,1,A
1,2,13/08/2016,Crystal Palace,West Brom,C Pawson,3,6,12,15,2,2,0,0,0,1,A
2,3,13/08/2016,Everton,Tottenham,M Atkinson,5,6,10,14,0,0,0,0,1,1,D
3,4,13/08/2016,Hull,Leicester,M Dean,5,3,8,17,2,2,0,0,2,1,H
4,5,13/08/2016,Man City,Sunderland,R Madley,9,6,11,14,1,2,0,0,2,1,H


### Data Transformation and Exploration

### Shots xG model

set 'Timestamp' to datetime objects

In [176]:
shot_data['Timestamp'] = pd.to_datetime(shot_data['Timestamp'])

we assume that getting near to the end of the match, shots patterns change as attacks become more aggressive

In [177]:
shot_data['End_close'] = (shot_data['Minute'] > 185).astype(int)

null values of notes are associated with normal shots

In [178]:
shot_data['Notes'] = shot_data['Notes'].fillna('normal')

##### Separate Shot data by type
 We assume that a shots of different types have different probabilities of being a goal from the same distance.

In [179]:
def shot_data_by_type(type_name, df):
    # collect notes including the substring type_name, e.g. 'volley'
    types = [v for v in df['Notes'].unique() if type_name in v.lower()]
    # extract shots of the type specified
    type_df = df[[n in types for n in df['Notes']]]
    # deduce outcome of shot i.e. Goal or not Goal
    type_goals = type_df['Outcome'] == 'Goal'
    # drop unneeded columns
    type_df = type_df.drop(columns=['Squad', 'Against', 'Outcome', 'Player', 
                                    'Body Part', 'SCA 1 Player', 'SCA 1 Event', 
                                   'SCA 2 Player', 'SCA 2 Event'])
    # one-hot encoding for subtyoes
    type_df = pd.concat([type_df, pd.get_dummies(type_df['Notes'])], axis=1)
    # add a new column with label
    type_df['Goal'] = type_goals.astype(int)
    
    return type_df

types of shots excluding normal, which conatains everything but those below

In [180]:
types = ['volley', 'header', 'free kick', 'overhead', 'back heel', 'penalty kick']
# collect all DataFrames for each type
type_dfs = dict()
# sets to keep records of which types of shots have been collected
all_types = set(shot_data['Notes'].unique())
used = set()

for t in types:
    # using the function, separates the shots by type
    type_dfs[t] = shot_data_by_type(t, shot_data)
    # adds used types to the set 'used'
    used = used.union(set([v for v in shot_data['Notes'].unique() if t in v.lower()]))

remaining to be added to 'normal' shots

In [181]:
rest_of_shots = list(all_types.difference(used))
new_names = []

# add normal to each type left for use of shot_data_by_type function
for i in range(len(rest_of_shots)):
    current = rest_of_shots[i]
    new_names.append(current + ' normal' if not 'normal' == current else current)
    
# change notes according to above changes in the dataframe
for o, n in zip(rest_of_shots, new_names):
    # 'normal' does not change
    if o != n:
        originals = shot_data['Notes'] == o
        shot_data.loc[originals, 'Notes'] = n

In [182]:
type_dfs['normal'] = shot_data_by_type('normal', shot_data)

### Non-shots xG model

Now we will take a look at different features — sanitise and standardise them for use in other models.

__Note__: Standard team names and referee names along with their respective unique IDs are located in [this](data/standard) directory

#### 1. Dropping columns that are not essential to train our model.

In [183]:
general_training_data = complete_data.drop(['GameID','Date'], axis=1)

#### 2. Encode names of the teams and referees using standardised data

In [184]:
teams_data = pd.read_csv(os.path.join(os.getcwd(), "data/standard/standard.teamnames.csv"))
referee_data = pd.read_csv(os.path.join(os.getcwd(), "data/standard/standard.referee.names.csv"))

In [185]:
# Generating teams mappings 
teamname, teamID = list(teams_data['Standard teamname']), list(teams_data['TeamID'])
teamID_mapping = dict(zip(teamname, teamID))

generate_teamID_mappings = lambda teamnames: [teamID_mapping[teamname] for teamname in teamnames]

In [186]:
# Generating referees mappings 
referee, refereeID = list(referee_data['Standard referee name']), list(referee_data['RefereeID'])
refereeID_mapping = dict(zip(referee, refereeID))

generate_refereeID_mappings = lambda referees: [refereeID_mapping[referee] for referee in referees]

#### Applying transformations to the (general) training dataset.

In [187]:
# Teams
general_training_data['HomeTeam'] = generate_teamID_mappings(general_training_data['HomeTeam'])
general_training_data['AwayTeam'] = generate_teamID_mappings(general_training_data['AwayTeam'])

# Referees
general_training_data['Referee'] = generate_refereeID_mappings(general_training_data['Referee'])

#### 3. Integrating expected goals (xG) data from shots-based model 

Extracting the results of the shots-based model.

In [188]:
shots_xg_predictions = pd.read_csv(os.path.join(os.getcwd(), 'output/shots_xG_predictions.csv'))

In [189]:
# Add expected goals for the home team
general_training_data['xHG'] = shots_xg_predictions['xG_h']

# Add expected goals for the away team
general_training_data['xAG'] = shots_xg_predictions['xG_a']

In [190]:
general_training_data

Unnamed: 0,HomeTeam,AwayTeam,Referee,HC,AC,HF,AF,HY,AY,HR,AR,FTHG,FTAG,FTR,xHG,xAG
0,10,34,12,7,4,10,14,3,2,0,0,0,1,A,0.000000,0.995993
1,13,37,7,3,6,12,15,2,2,0,0,0,1,A,0.000000,0.995993
2,14,35,18,5,6,10,14,0,0,0,0,1,1,D,0.955742,0.995993
3,18,20,20,5,3,8,17,2,2,0,0,2,1,H,1.955742,0.999136
4,22,33,34,9,6,11,14,1,2,0,0,2,1,H,0.999340,0.995993
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1679,9,40,1,5,8,13,8,2,1,0,0,3,3,D,0.994346,0.996148
1680,37,1,18,3,5,7,4,1,2,0,0,0,4,A,0.000000,3.995993
1681,25,20,33,3,6,10,11,0,2,0,0,1,2,A,0.955742,1.995993
1682,12,22,3,5,3,11,10,3,1,0,0,1,3,A,0.955742,2.995993


For the purposes of the non-shot-based model, we will split the dataset into two: one containing data for predicting the __FTHG__ (Full Time Home Goals) and one containing data for predicting the __FTAG__ (Full Time Away Goals).

#### 4. Split the dataset into two parts: ``home_training_data`` and ``away_training_data``

In [191]:
X_home = home_training_data = general_training_data.drop(['FTAG', 'FTHG', 'AC', 'xAG', 'FTR'], axis=1)

In [192]:
X_home.head()

Unnamed: 0,HomeTeam,AwayTeam,Referee,HC,HF,AF,HY,AY,HR,AR,xHG
0,10,34,12,7,10,14,3,2,0,0,0.0
1,13,37,7,3,12,15,2,2,0,0,0.0
2,14,35,18,5,10,14,0,0,0,0,0.955742
3,18,20,20,5,8,17,2,2,0,0,1.955742
4,22,33,34,9,11,14,1,2,0,0,0.99934


In [193]:
X_away = home_training_data = general_training_data.drop(['FTHG', 'FTAG', 'HC', 'xHG', 'FTR'], axis=1)

In [194]:
X_away.head()

Unnamed: 0,HomeTeam,AwayTeam,Referee,AC,HF,AF,HY,AY,HR,AR,xAG
0,10,34,12,4,10,14,3,2,0,0,0.995993
1,13,37,7,6,12,15,2,2,0,0,0.995993
2,14,35,18,6,10,14,0,0,0,0,0.995993
3,18,20,20,3,8,17,2,2,0,0,0.999136
4,22,33,34,6,11,14,1,2,0,0,0.995993


### Model training and validation

### Shots xG model

##### Model Analysis

In [195]:
def clean_split_analysis(data):
    # drop non-feature columns
    data = data.drop(columns=['Timestamp', 'Notes', 'Minute'])
    # fill null values in threat with mean
    data['Threat'] = data['Threat'].fillna(data['Threat'].mean())
    # separate label columns from data
    goals = data['Goal']
    data = data.drop(columns=['Goal'])
    # fill any null value, only true for penalty kicks' distance
    data = data.fillna(0)
    
    scaler = StandardScaler()
    
    data[['Distance', 'Threat']] = scaler.fit_transform(data[['Distance', 'Threat']])
    
    return train_test_split(data, goals, test_size=0.1, random_state=10)

We iterate over each selected classifier and collect scores for each shot type

In [196]:
classifiers = [GradientBoostingClassifier(max_depth=5, learning_rate=0.1, subsample=0.15),
              GaussianNB(), LogisticRegression(max_iter=1000)]

cls_names = ['xgbc',
            'gaussian',
            'log']

cls_scores = []

for cls in classifiers:
    # variables to collect data
    correct_shots = 0
    goals = 0
    false_goals = 0
    f1_scores = 0
    total_pred_goals = 0
    total_shots = 0
    total_goals = 0

    # training and collecting result from each classifier over each type of shot
    for k, v in type_dfs.items():

        x_train, x_test, y_train, y_test = clean_split_analysis(v)
        cls = cls.fit(x_train, y_train)
        pred = cls.predict(x_test)

        total_shots += len(y_test)
        total_goals += sum(y_test > 0)
        total_pred_goals += sum(pred > 0)

        false_goals += sum(pred[pred > 0] != y_test[pred > 0])
        correct_shots += sum(pred == y_test)
        goals += sum(pred[y_test > 0] == y_test[y_test > 0])
        f1_scores += f1_score(y_test, pred) * len(y_test)

    cls_scores.append([correct_shots / total_shots, goals / total_goals, 
                   false_goals / total_pred_goals, f1_scores / total_shots])

We put these readings into a DataFrame object

In [197]:
# new DataFrame object
classifer_score_df = pd.DataFrame()
# iterate over scores
for n, s in zip(cls_names, cls_scores):
    new_row = [n, *s]
    classifer_score_df = classifer_score_df.append([new_row])
    
classifer_score_df.columns = ['Classifier', 'Accuracy', 'Recall', 'False Goals', 'F1 score']
classifer_score_df = classifer_score_df.reset_index(drop=True)

In [198]:
fig = px.bar(classifer_score_df, y=['Accuracy', 'Recall', 'False Goals', 'F1 score'], x='Classifier', barmode='group', title='Classifier comparison')
ply.io.write_image(fig, 'model_compare_shot_data.jpeg')
fig.show()

display(HTML('''<ol>
            <li><strong>Accuracy</strong>: correctly predicted shots / total shots</li>
            <li><strong>Recall</strong>: correctly classified goals / actual goals</li>
            <li><strong>False positives</strong>: incorrectly classified goals / actual goals</li>
            <li><strong>F1 score</strong>: F1 score of the classifier</li>
        </ol><br>
        Accuracy is not a reliable metric in this case due to uneven label numbers.<br>
        We instead rely on F1 scores while keeping an eye on True positives and False positives.<br>
        F1 scores for all three models are similar, we focus on True positives as a secondary metric since 
        our aim is to correctly predict goals.
        <br><br>
        <strong>From the data we can conclude that the Logistic Regression is the best option for 
        this classification.</strong>'''))

ValueError: Image generation requires the psutil package.

Install using pip:
    $ pip install psutil

Install using conda:
    $ conda install psutil


##### Parameter tuning

In [199]:
def clean_split_validation(data):
    # drop non-feature columns
    data = data.drop(columns=['Timestamp', 'Notes', 'Minute'])
    # fill null values in threat with mean
    data['Threat'] = data['Threat'].fillna(data['Threat'].mean())
    # separate label columns from data
    goals = data['Goal']
    data = data.drop(columns=['Goal'])
    # fill any null value, only true for penalty kicks' distance
    data = data.fillna(0)
    
    scaler = StandardScaler()
    
    data[['Distance', 'Threat']] = scaler.fit_transform(data[['Distance', 'Threat']])
    
    return train_test_split(data, goals, test_size=0.1, random_state=10)

We iterate over each selected classifier and collect scores for each shot type

In [200]:
classifiers = [GradientBoostingClassifier(max_depth=2, learning_rate=0.1, subsample=0.15),
               GradientBoostingClassifier(max_depth=3, learning_rate=0.1, subsample=0.15),
              GradientBoostingClassifier(max_depth=4, learning_rate=0.1, subsample=0.15),
              GradientBoostingClassifier(max_depth=5, learning_rate=0.1, subsample=0.15),
              GradientBoostingClassifier(max_depth=7, learning_rate=0.1, subsample=0.15),
              GradientBoostingClassifier(max_depth=10, learning_rate=0.1, subsample=0.15)]

cls_scores = []

cls_names = ['2', '3', '4', '5', '7', '10']

for cls in classifiers:
    # variables to collect data
    correct_shots = 0
    goals = 0
    false_goals = 0
    f1_scores = 0
    total_pred_goals = 0
    total_shots = 0
    total_goals = 0

    # training and collecting result from each classifier over each type of shot
    for k, v in type_dfs.items():

        x_train, x_test, y_train, y_test = clean_split_validation(v)
        cls = cls.fit(x_train, y_train)
        pred = cls.predict(x_test)

        total_shots += len(y_test)
        total_goals += sum(y_test > 0)
        total_pred_goals += sum(pred > 0)

        false_goals += sum(pred[pred > 0] != y_test[pred > 0])
        correct_shots += sum(pred == y_test)
        goals += sum(pred[y_test > 0] == y_test[y_test > 0])
        f1_scores += f1_score(y_test, pred) * len(y_test)

    cls_scores.append([correct_shots / total_shots, goals / total_goals, 
                   false_goals / total_pred_goals, f1_scores / total_shots])

We put these readings into a DataFrame object

In [201]:
# new DataFrame object
classifer_score_df = pd.DataFrame()
# iterate over scores
for n, s in zip(cls_names, cls_scores):
    new_row = [n, *s]
    classifer_score_df = classifer_score_df.append([new_row])
    
classifer_score_df.columns = ['Classifier', 'Accuracy', 'Recall', 'False Goals', 'F1 score']
classifer_score_df = classifer_score_df.reset_index(drop=True)

In [202]:
fig = px.bar(classifer_score_df, y=['Accuracy', 'Recall', 'False Goals', 'F1 score'], x='Classifier', barmode='group', title='Classifier comparison')
ply.io.write_image(fig, "model_valid_xgbc_shot_data.jpeg")
fig.show()

display(HTML('''<ol>
            <li><strong>Accuracy</strong>: correctly predicted shots / total shots</li>
            <li><strong>Recall</strong>: correctly classified goals / actual goals</li>
            <li><strong>False Goals</strong>: incorrectly classified goals / actual goals</li>
            <li><strong>F1 score</strong>: F1 score of the classifier</li>
        </ol><br>
        Accuracy is not a reliable metric in this case due to uneven label numbers.<br>
        We instead rely on F1 scores while keeping an eye on True positives and False positives.<br>
        F1 scores for all three models are similar, we focus on True positives as a secondary metric since 
        our aim is to correctly predict goals.
        <br><br>
        <strong>From the data we can conclude that a a max depth of 4 is the best option.<br>
        Since the Logistic Regression performs better than this hyperparam of XGBC, we use Logistic Regression
        as our model</strong>'''))

ValueError: Image generation requires the psutil package.

Install using pip:
    $ pip install psutil

Install using conda:
    $ conda install psutil


##### Model training

In [203]:
def get_data_train_ready(data):
    # drop columns which are not selected as features in our model
    data = data.drop(columns=['Timestamp', 'Notes', 'Minute'])
    # fill empty threat values with mean
    data['Threat'] = data['Threat'].fillna(data['Threat'].mean())
    # separate 'Goal' columns, they are considered labels
    goals = data['Goal']
    data = data.drop(columns=['Goal'])
    
    return train_test_split(data, goals, test_size=0.1, random_state=10)

In [204]:
def get_classifier(x_train, x_test, y_train, y_test):
    
    cls = LogisticRegression(max_iter=1000)
    cls.fit(x_train, y_train)
    
    return cls

In [205]:
classifier_by_type = {}

for k, v in type_dfs.items():
    
    model_ready_data = get_data_train_ready(type_dfs[k].fillna(0))
    
    clsf = get_classifier(*model_ready_data)
    
    classifier_by_type[k] = clsf
    

##### Predict

In [206]:
# load shot_data with match_ids
path = os.path.join(os.getcwd(), "output/shot_data.csv")
shot_data_m_ids = pd.read_csv(path)
shot_data_m_ids['MatchID'] = shot_data_m_ids['MatchID'] - 3040
shot_data_m_ids['Timestamp'] = pd.to_datetime(shot_data_m_ids['Timestamp'])
shot_data_m_ids['End_close'] = (shot_data_m_ids['Minute'] > 185).astype(int)
shot_data_m_ids['Notes'] = shot_data_m_ids['Notes'].fillna('normal')

##### Separate Shot data by type
 We assume that a shots of different types have different probabilities of being a goal from the same distance.

In [207]:
def shot_data_by_type(type_name, df):
    # collect notes including the substring type_name, e.g. 'volley'
    types = [v for v in df['Notes'].unique() if type_name in v.lower()]
    # extract shots of the type specified
    type_df = df[[n in types for n in df['Notes']]]
    # deduce outcome of shot i.e. Goal or not Goal
    type_goals = type_df['Outcome'] == 'Goal'
    # drop unneeded columns
    type_df = type_df.drop(columns=['Against', 'Outcome', 'Player', 
                                    'Body Part', 'SCA 1 Player', 'SCA 1 Event', 
                                   'SCA 2 Player', 'SCA 2 Event'])
    # one-hot encoding for subtyoes
    type_df = pd.concat([type_df, pd.get_dummies(type_df['Notes'])], axis=1)
    # add a new column with label
    type_df['Goal'] = type_goals.astype(int)
    
    return type_df

def shot_data_by_type(type_name, df):
    # collect notes including the substring type_name, e.g. 'volley'
    types = [v for v in df['Notes'].unique() if type_name in v.lower()]
    # extract shots of the type specified
    type_df = df[[n in types for n in df['Notes']]]
    type_df = type_df.reset_index(drop=True)
     # deduce outcome of shot i.e. Goal or not Goal
    type_goals = type_df['Outcome'] == 'Goal'
    # drop unneeded columns
    type_df = type_df.drop(columns=['Against', 'Outcome', 'Player', 
                                    'Body Part', 'SCA 1 Player', 'SCA 1 Event', 
                                   'SCA 2 Player', 'SCA 2 Event'])
    # one-hot encoding for subtyoes
    type_df = pd.concat([type_df, pd.get_dummies(type_df['Notes'])], axis=1)
    # add a new column with label
    type_df['Goal'] = type_goals.astype(int)
    
    return type_df

types of shots excluding normal, which conatains everything but those below

In [208]:
types = ['volley', 'header', 'free kick', 'overhead', 'back heel', 'penalty kick']
# collect all DataFrames for each type
type_dfs = dict()
# sets to keep records of which types of shots have been collected
all_types = set(shot_data_m_ids['Notes'].unique())
used = set()

for t in types:
    # using the function, separates the shots by type
    type_dfs[t] = shot_data_by_type(t, shot_data_m_ids)
    # adds used types to the set 'used'
    used = used.union(set([v for v in shot_data['Notes'].unique() if t in v.lower()]))

remaining to be added to 'normal' shots

In [209]:
all_types.difference(used)

{'Deflected', 'Lob', 'Open goal', 'normal'}

In [210]:
rest_of_shots = list(all_types.difference(used))
new_names = list(range(len(rest_of_shots)))

# add normal to each type left for use of shot_data_by_type function
for i in range(len(rest_of_shots)):
    current = rest_of_shots[i]
    new_names[i] = current + ' normal' if not 'normal' == current else current
    
# change notes according to above changes in the dataframe
for o, n in zip(rest_of_shots, new_names):
    # 'normal' does not change
    if o != n:
        originals = shot_data_m_ids['Notes'] == o
        shot_data_m_ids.loc[originals, 'Notes'] = n


In [211]:
type_dfs['normal'] = shot_data_by_type('normal', shot_data_m_ids)

##### We sum over the probabilities of each team in a match to the expected goals

In [212]:
def calc_expected_goals(key, type_dfs, cls_s, collecting_df):
    '''Calculates the expected goals in a match by a team based on shots made and classifiers trained.'''
    data = type_dfs[key]
    # fill empty threat values
    data['Threat'] = data['Threat'].fillna(data['Threat'].mean())
    # group the data by 'MatchID'
    ids = data.groupby('MatchID').groups
    # get classifier for type of shot
    cls = cls_s[key]
    
    for k, v in ids.items():
        # get shots for the match
        match_data = data.iloc[v]
        # split data over Squads
        teams = match_data.groupby('Squad').groups
        
        for t, d in teams.items():
            # team shot data
            team_data = data.loc[d]
            # drop Squad column
            team_data = team_data.drop(columns=['Squad', 'Goal',
                                                'Timestamp', 'Notes', 
                                                'Minute', 'MatchID'])
            # fill empty values with 0
            team_data = team_data.fillna(0)
            prob = cls.predict_proba(team_data)
            # sum over probabilites of all shots
            collecting_df = collecting_df.append([[k, t, sum(prob[:, 1])]])
    
    return collecting_df

In [213]:
collecting_df = pd.DataFrame()
# collect all predicted goals for each team in each match
for k in type_dfs.keys():
    collecting_df = calc_expected_goals(k, type_dfs, classifier_by_type, collecting_df)
# set column names
collecting_df.columns = ['GameID', 'Squad', 'xG']
# sort on 'GameID'
collecting_df = collecting_df.sort_values('GameID').reset_index(drop=True)

In [214]:
display(collecting_df.head())

display(HTML('''
    xG represents the probability of a shot being a goal.<br>
    Next, we move onto adding all probabilities of shots in a match for a team to get
    expected goals.
'''))

Unnamed: 0,GameID,Squad,xG
0,4.0,Leicester,0.999136
1,5.0,Man City,0.99934
2,10.0,Chelsea,0.998984
3,11.0,Man United,0.999224
4,14.0,Stoke,0.994077


##### We sum over the probabilities of each team in a match to the expected goals

In [215]:
matches = collecting_df.groupby('GameID').groups

In [216]:
match_gx = pd.DataFrame()

for k, v in matches.items():
    # get match rows
    match = collecting_df.loc[v]
    # get team names
    teams = match['Squad'].unique()
    # remove any null rows
    teams = teams[~pd.isnull(teams)]
    
    team_list = []
    # collect expected goals for each team
    for team in teams:
        team_rows_b = team == match['Squad']
        team_rows = match[team_rows_b]
        xG = sum(team_rows['xG'])
        team_list.append([team, xG])
    
    if len(team_list) == 1:
        match_gx = match_gx.append([[k, *team_list[0], np.NaN, np.NaN]])
    else:
        match_gx = match_gx.append([[k, *team_list[0], *team_list[1]]])
    
match_gx = match_gx.reset_index(drop=True)
match_gx.columns = ['GameID', 'Squad_a', 'xG_a', 'Squad_b', 'xG_b']

In [217]:
match_gx.head()

Unnamed: 0,GameID,Squad_a,xG_a,Squad_b,xG_b
0,4.0,Leicester,0.999136,,
1,5.0,Man City,0.99934,,
2,10.0,Chelsea,0.998984,,
3,11.0,Man United,0.999224,,
4,14.0,Stoke,0.994077,Man City,0.994962


##### We now separate Squads with the same MatchID into Home and Away

In [218]:
non_shot_data = pd.read_csv('data/non-shot-xG/non_shot_data.csv', index_col=0)
non_shot_data['Date'] = pd.to_datetime(non_shot_data['Date'], format='%d/%m/%Y')

In [219]:
shot_data_xg = pd.DataFrame()

for i, m in match_gx.iterrows():
    game_id = int(m['GameID'])
    # due to various versions of Data-sets, 
    # some indices might have changed or 
    # may not bethe same anymore
    if not game_id in non_shot_data.index.tolist():
        continue
    # get match row from non_shot_data
    match_row = non_shot_data.loc[game_id]
    
    # check if squad_a is HomeTeam
    squad_a_home = m['Squad_a'] == match_row['HomeTeam']
    if not squad_a_home:
        # if squad_a is not the HomeTeam, it must be the AwayTeam
        assert(m['Squad_a'] == match_row['AwayTeam'])
    
    # assign according to squad_a_home deduced above
    if squad_a_home:
        xG_home = m['xG_a']
        xG_away = m['xG_b']
    else:
        xG_away = m['xG_a']
        xG_home = m['xG_b']
        
    # add as a row to DataFrame 
    shot_data_xg = shot_data_xg.append([[game_id, xG_home, xG_away]])
    
# set columns and reset index
shot_data_xg = shot_data_xg.reset_index(drop=True)
shot_data_xg.columns = ['GameID', 'xG_h', 'xG_a']

In [220]:
shot_data_xg.index = shot_data_xg['GameID']
shot_data_xg = shot_data_xg.drop(columns=['GameID'])

In [221]:
# add xG_a, xG_h as columns to non_shot_data
all_game_data = pd.concat([non_shot_data, shot_data_xg], axis=1)

In [222]:
# save CSV
path = os.path.join(os.getcwd(), "output/shots_xG_predictions.csv")
all_game_data.to_csv(path)

##### Fill missing xG values
We fill these by taking the average of the difference between xG_(a|h) and FT(A|H)G<br>
The difference is then subtracted from FT(A|H)G to get an estimate of xG_(a|h)

In [223]:
path = os.path.join(os.getcwd(), "output/shots_xG_predictions.csv")
all_game_data = pd.read_csv(path)

In [224]:
home_goals = all_game_data[['FTHG', 'xG_h']].dropna()

away_goals = all_game_data[['FTAG', 'xG_a']].dropna()

In [225]:
home_goals['Difference'] = home_goals['FTHG'] - home_goals['xG_h']
a = home_goals['Difference'] > home_goals['Difference'].quantile(0.1)
b = home_goals['Difference'] < home_goals['Difference'].quantile(0.9)
dif = home_goals[np.logical_and(a, b)]['Difference'].mean()

missing = pd.isnull(all_game_data['xG_h'])

all_game_data.loc[missing, 'xG_h'] = all_game_data.loc[missing, 'FTHG'] - dif

below_zero = all_game_data.loc[missing, 'xG_h'] < 0

all_game_data.loc[below_zero & missing, 'xG_h'] = 0

In [226]:
away_goals['Difference'] = away_goals['FTAG'] - away_goals['xG_a']
a = away_goals['Difference'] > away_goals['Difference'].quantile(0.1)
b = away_goals['Difference'] < away_goals['Difference'].quantile(0.9)
dif = away_goals[np.logical_and(a, b)]['Difference'].mean()

missing = pd.isnull(all_game_data['xG_a'])

all_game_data.loc[missing, 'xG_a'] = all_game_data.loc[missing, 'FTAG'] - dif

below_zero = all_game_data.loc[missing, 'xG_a'] < 0

all_game_data.loc[below_zero & missing, 'xG_a'] = 0

##### Now we check if any values are missing

In [227]:
pd.isnull(all_game_data[['xG_h', 'xG_a']]).sum().sum()

0

In [228]:
complete_data_with_xG = all_game_data

In [229]:
path = os.path.join(os.getcwd(), "output/shots_xG_predictions.csv")
complete_data_with_xG.to_csv(path)

### Non-shots xG model

To train both (__FTHG__ and __FTAG__) models, we will use the [Random Forest Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) along with grid search with cross-validation to estimate hyperparameters. An analysis on why this specific learning method was used is presented in the report. 

### FTHG model

In [230]:
Y_home = general_training_data.FTHG

Now we will determine the optimal values to be used for the hyperparameters of our model from a specified range of values. We have chosen the two hyperparameters; ``max_depth`` and ``n_estimators`` to be optimised. ``max_depth`` refers to the maximum depth of the decision trees and ``n_estimators`` refers to the number of trees in the forest. 

In [246]:
# Perform grid search to obtain optimal parameter values
gsc = GridSearchCV(
        estimator=RandomForestRegressor(),
        param_grid={
            'max_depth': range(3,7),
            'n_estimators': (10, 50, 100, 500, 1000),
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)

grid_result = gsc.fit(X_home, Y_home)
best_params = grid_result.best_params_

display(best_params)

{'max_depth': 5, 'n_estimators': 500}

In [247]:
# Apply best_params to the model
model_home = RandomForestRegressor(
    max_depth=best_params["max_depth"], n_estimators=best_params["n_estimators"], random_state=False, verbose=False
)

After estimating hyperparameters for the model, we randomly split the data into training and testing sets to get a representation of all data points. 

In [248]:
# Split the data into training and testing sets
X_home_train, X_home_test, Y_home_train, Y_home_test = train_test_split(X_home, Y_home, test_size = 0.20, random_state = 42)

Fitting the model on the training data.

In [249]:
model_home.fit(X_home_train, Y_home_train)

RandomForestRegressor(max_depth=5, n_estimators=500, random_state=False,
                      verbose=False)

We will now apply the trained model to make predictions on the test set (X_test).

In [250]:
Y_home_pred_test = model_home.predict(X_home_test)

Our model has now been trained to learn the relationships between the features and the targets. To evaluate the performance of the model, we use both train-test split and K-Fold cross validation. The error metric values chosen here are the __Mean Absolute Error (MAE)__, __Mean Squared Error (MASE)__, and the __Coefficient of Determination (R2 Score)__.

In [251]:
print('Train-test split result:\n')

print('Mean squared error (MSE): %f'
      % mean_squared_error(Y_home_test, Y_home_pred_test))
print('Mean absolute error (MAE): %f'
      % mean_absolute_error(Y_home_test, Y_home_pred_test))
print('Coefficient of determination (R^2): %f'
      % r2_score(Y_home_test, Y_home_pred_test))

Train-test split result:

Mean squared error (MSE): 0.577235
Mean absolute error (MAE): 0.464697
Coefficient of determination (R^2): 0.645010


Now we pass the model to the `cross_val_score()` function which performs K-Fold cross validation on the given data and provides as an output, an error metric value, which can be used to determine the model performance. The error metric values chosen here are the same: __Mean Squared Error (MSE)__, __Mean Absolute Error (MAE)__, and the __Coefficient of Determination (R2 Score)__.

In [254]:
model_home_mse_scores = cross_val_score(model_home, X_home, Y_home, cv=5, scoring='neg_mean_squared_error')
model_home_mae_scores = cross_val_score(model_home, X_home, Y_home, cv=5, scoring='neg_mean_absolute_error') 
model_home_r2_scores = cross_val_score(model_home, X_home, Y_home, cv=5, scoring='r2')

print('K-fold cross validation result:\n')

print('Mean squared error (MSE): {n}'.format(n=abs(np.mean(model_home_mse_scores))))
print('Mean absolute error (MAE): {n}'.format(n=abs(np.mean(model_home_mae_scores))))
print('Coefficient of determination (R2 Score): {n}'.format(n=abs(np.mean(model_home_r2_scores))))

K-fold cross validation result:

Mean squared error (MSE): 0.6138261764042214
Mean absolute error (MAE): 0.46714036725721025
Coefficient of determination (R2 Score): 0.6391747979551518


### Now repeating the process for the FTAG model

In [255]:
Y_away = general_training_data.FTAG

In [256]:
# Perform grid search to obtain optimal parameter values
gsc = GridSearchCV(
        estimator=RandomForestRegressor(),
        param_grid={
            'max_depth': range(3,7),
            'n_estimators': (10, 50, 100, 500, 1000),
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)

grid_result = gsc.fit(X_away, Y_away)
best_params = grid_result.best_params_

display(best_params)

{'max_depth': 4, 'n_estimators': 1000}

In [257]:
model_away = RandomForestRegressor(
    max_depth=best_params["max_depth"], n_estimators=best_params["n_estimators"], random_state=False, verbose=False
)

In [258]:
# Split the data into training and testing sets
X_away_train, X_away_test, Y_away_train, Y_away_test = train_test_split(X_away, Y_away, test_size = 0.20, random_state = 42)

In [259]:
model_away.fit(X_away_train, Y_away_train)

RandomForestRegressor(max_depth=4, n_estimators=1000, random_state=False,
                      verbose=False)

In [260]:
Y_away_pred_test = model_away.predict(X_away_test)

In [261]:
print('Train-test split result:\n')

print('Mean squared error (MSE): %f'
      % mean_squared_error(Y_away_test, Y_away_pred_test))
print('Mean absolute error (MAE): %f'
      % mean_absolute_error(Y_away_test, Y_away_pred_test))
print('Coefficient of determination (R^2): %f'
      % r2_score(Y_away_test, Y_away_pred_test))

Train-test split result:

Mean squared error (MSE): 0.461291
Mean absolute error (MAE): 0.445057
Coefficient of determination (R^2): 0.649843


In [262]:
model_away_mse_scores = cross_val_score(model_away, X_away, Y_away, cv=5, scoring='neg_mean_squared_error')
model_away_mae_scores = cross_val_score(model_away, X_away, Y_away, cv=5, scoring='neg_mean_absolute_error') 
model_away_r2_scores = cross_val_score(model_away, X_away, Y_away, cv=5, scoring='r2')

print('K-fold cross validation result:\n')

print('Mean squared error (MSE): {n}'.format(n=abs(np.mean(model_away_mse_scores))))
print('Mean absolute error (MAE): {n}'.format(n=abs(np.mean(model_away_mae_scores))))
print('Coefficient of determination (R2 Score): {n}'.format(n=abs(np.mean(model_away_r2_scores))))

K-fold cross validation result:

Mean squared error (MSE): 0.5456203163693599
Mean absolute error (MAE): 0.4427483296393975
Coefficient of determination (R2 Score): 0.6252336122742194


### FTHG Results

In [263]:
home_training_data = general_training_data.copy().drop(['FTAG', 'AC', 'xAG', 'FTR'], axis=1)
home_model_input_data = home_training_data.copy().drop(columns=['FTHG'])

In [264]:
home_training_data.head()

Unnamed: 0,HomeTeam,AwayTeam,Referee,HC,HF,AF,HY,AY,HR,AR,FTHG,xHG
0,10,34,12,7,10,14,3,2,0,0,0,0.0
1,13,37,7,3,12,15,2,2,0,0,0,0.0
2,14,35,18,5,10,14,0,0,0,0,1,0.955742
3,18,20,20,5,8,17,2,2,0,0,2,1.955742
4,22,33,34,9,11,14,1,2,0,0,2,0.99934


In [265]:
home_pred_data = pd.get_dummies(home_model_input_data)
home_r = model_home.predict(home_pred_data)
home_r = pd.DataFrame(home_r)

In [266]:
home_r.columns= ['Predicted FTHG']
home_training_data.reset_index(drop=True, inplace=True)
home_results = pd.concat([home_training_data, home_r], axis=1)

### FTAG Results

In [267]:
away_training_data = general_training_data.copy().drop(['FTHG', 'HC', 'xHG', 'FTR'], axis=1)
away_model_input_data = away_training_data.copy().drop(columns=['FTAG'])

In [268]:
away_pred_data = pd.get_dummies(away_model_input_data)
away_r = model_away.predict(away_pred_data)
away_r = pd.DataFrame(away_r)

In [269]:
away_r.columns= ['Predicted FTAG']
away_training_data.reset_index(drop=True, inplace=True)
away_results = pd.concat([away_training_data, away_r], axis=1)

### Merging results of both models

In [271]:
complete_non_shot_predictions = general_training_data.copy()

# Add predicted FTHG
complete_non_shot_predictions['Predicted FTHG'] = home_results['Predicted FTHG']

# Add predicted FTAG
complete_non_shot_predictions['Predicted FTAG'] = away_results['Predicted FTAG']

complete_non_shot_predictions.head()

Unnamed: 0,HomeTeam,AwayTeam,Referee,HC,AC,HF,AF,HY,AY,HR,AR,FTHG,FTAG,FTR,xHG,xAG,Predicted FTHG,Predicted FTAG
0,10,34,12,7,4,10,14,3,2,0,0,0,1,A,0.0,0.995993,0.0,1.119157
1,13,37,7,3,6,12,15,2,2,0,0,0,1,A,0.0,0.995993,0.0,1.096418
2,14,35,18,5,6,10,14,0,0,0,0,1,1,D,0.955742,0.995993,0.989988,1.094541
3,18,20,20,5,3,8,17,2,2,0,0,2,1,H,1.955742,0.999136,2.031948,1.196834
4,22,33,34,9,6,11,14,1,2,0,0,2,1,H,0.99934,0.995993,2.25471,1.10086


In [272]:
path = os.path.join(os.getcwd(), "output/non_shot_predictions.csv")
complete_non_shot_predictions.to_csv(path, index=False)

### ELO Rating classifier

## Offensive and defensive ELO ratings to predict number of goals

We keep track of two ratings for all teams offensive rating ($R_O$) and defensive rating ($R_D$).

We can then predict the number of goals a team will score by taking the difference of their offensive rating and the opponent's defensive rating.

The number of goals scored against them can be calculated by considering it from the opponent's perspective.

$E[\text{team}] = R_O[\text{team}] - R_D[\text{opponent}]$

We can update a team's offensive rating by adding the difference between the actual number of goals and the expected goals multiplied by the learning rate.
We can update a team's defensive rating by adding the difference between the expected goals scored against them and the actual number of goals scored against them multiplied by the learning rate.

$R_O[\text{team}] = R_O[\text{team}] + k(G[\text{team}] - E[\text{team}])$

$R_D[\text{team}] = R_D[\text{team}] + k(E[\text{opponent}] - G[\text{opponent}])$

We start every team with a rating of 0. The order of the training data makes a difference to the model and so the training data should be in chronological order in order to account for teams changing over team.

We take the output of the elo rating predictor and use it to predict the final result. This is done using a simple piecewise function with an optimised draw size and a SVC. The results are compared and the SVC is chosen because it is more accurate.


In [273]:
class GoalElo:
    def __init__(self, initial_rating=0, learning_rate=0.05, draw_size=0.5):
        self.offensive_ratings = defaultdict(lambda: initial_rating)
        self.defensive_ratings = defaultdict(lambda: initial_rating)
        self.match_count = defaultdict(lambda: 0)
        self.learning_rate = learning_rate
        self.draw_size = draw_size

    def predict(self, team, opponent):
        ''' Predicts the number of goals team will score against opponent. '''
        return self.offensive_ratings[team] - self.defensive_ratings[opponent]

    def predict_result(self, team, opponent):
        ''' Predicts the result of a match. 1 if team wins, 0 if opponent wins and 0.5 if it is a draw. '''
        goals_scored = self.predict(team, opponent)
        goals_conceded = self.predict(opponent, team)
        return self.classify_result(goals_scored, goals_conceded)

    def classify_result(self, goals_scored, goals_conceded):
        ''' Piecewise function to predict result from number of goals '''
        goal_difference = goals_scored - goals_conceded
        # result = round(1 / (1 + 10**(-goal_difference)))
        result = 1 if goal_difference > 0 else 0
        if abs(goal_difference) < self.draw_size:
            result = 0.5
        return result


    def predict_data(self, df):
        ''' Takes a data frame of home and away teams to predict the number of goals and result of '''
        out = df.copy()
        for i, row in out.iterrows():
            out.at[i, 'EHG'] = self.predict(row['HomeTeam'], row['AwayTeam'])
            out.at[i, 'EAG'] = self.predict(row['AwayTeam'], row['HomeTeam'])
            out.at[i, 'ER'] = decode_result(self.predict_result(row['HomeTeam'], row['AwayTeam']))
        return out

    def update_match(self, home, away, home_actual_goals, away_actual_goals):
        ''' Updates the offensive and defensive ratings of both teams in a match. '''
        home_expected_goals = self.predict(home, away)
        away_expected_goals = self.predict(away, home)
        self.offensive_ratings[home] += self.learning_rate * (home_actual_goals - home_expected_goals)
        self.offensive_ratings[away] += self.learning_rate * (away_actual_goals - away_expected_goals)
        self.defensive_ratings[home] += self.learning_rate * (away_expected_goals - away_actual_goals)
        self.defensive_ratings[away] += self.learning_rate * (home_expected_goals - home_actual_goals)
        self.match_count[home] += 1
        self.match_count[away] += 1

    def ratings_dataframe(self):
        ''' Creates an easy to read dataframe of the ratings '''
        df = pd.DataFrame(self.offensive_ratings.items(), columns=['Team', 'Offensive Rating'])
        df['Defensive Rating'] = df['Team'].map(self.defensive_ratings)
        df['Matches'] = df['Team'].map(self.match_count)
        df = df.sort_values('Offensive Rating', ascending=False)
        return df

    def fit(self, df):
        ''' Takes a data frame of matches with columns HomeTeam, AwayTeam, Predicted FTHG, Predicted FTAG and updates teams ratings using the data in order. '''
        for i, row in df.iterrows():
            if 'Predicted FTHG' in row:
                self.update_match(row['HomeTeam'], row['AwayTeam'], row['Predicted FTHG'], row['Predicted FTAG'])
            else:
                self.update_match(row['HomeTeam'], row['AwayTeam'], row['FTHG'], row['FTAG'])


In [274]:
def decode_result(result):
    if result == 1:
        return 'H'
    elif result == 0:
        return 'A'
    return 'D'

In [275]:
match_data = pd.read_csv(os.path.join(os.getcwd(), "output/non_shot_predictions.csv"))

In [276]:
training, test = train_test_split(match_data, test_size=0.05, shuffle=False)

In [277]:
# Train elo ratings
goal_elo = GoalElo()
goal_elo.fit(training)


In [278]:
# Predict number of goals for use in training result classifier
goal_predicted_data = goal_elo.predict_data(training)

X = np.array([goal_predicted_data['EHG'].to_numpy(), goal_predicted_data['EAG'].to_numpy()]).T
y = goal_predicted_data['FTR'].to_numpy()
# y = [decode_result(r) for r in goal_predicted_data['FTR'].to_numpy()]

# Train classifier to predict result from number of goals
goal_result_classifier = SVC(gamma='auto')
goal_result_classifier.fit(X, y)

SVC(gamma='auto')

In [307]:
# Predict number of goals scored using goal_elo
goal_prediction = goal_elo.predict_data(test)
X_test = np.array([goal_prediction['EHG'].to_numpy(), goal_prediction['EAG'].to_numpy()]).T
y_test = goal_prediction['FTR'].to_numpy()
# Predict result using SVC and elo predicted number of goals
y_pred = goal_result_classifier.predict(X_test)
# Predict result using piecewise function and elo predicted number of goals
y_pred2 = goal_prediction['ER'].to_numpy()

# Measure accuracy
print("Accuracy: {n}".format(n=accuracy_score(y_test, y_pred)))
print()

# Measure f1 score
print("F1 Score: {n}".format(n=f1_score(y_test, y_pred, average='weighted')))

Accuracy: 0.5058823529411764

F1 Score: 0.40906501547987617


### Final predictions on Test Set

First, importing the test dataset.

In [294]:
# Import data to predict
final_test_data = pd.read_csv(os.path.join(os.getcwd(), 'data/epl-test.csv'))
final_test_data

Unnamed: 0,Date,HomeTeam,AwayTeam
0,16 Jan 21,Arsenal,Newcastle
1,16 Jan 21,Aston Villa,Everton
2,16 Jan 21,Fulham,Chelsea
3,16 Jan 21,Leeds,Brighton
4,16 Jan 21,Leicester,Southampton
5,16 Jan 21,Liverpool,Man United
6,16 Jan 21,Man City,Crystal Palace
7,16 Jan 21,Sheffield United,Tottenham
8,16 Jan 21,West Ham,Burnley
9,16 Jan 21,Wolves,West Brom


Using ELO Ratings to predict the home and away team goals.

In [295]:
for i, r in final_test_data.iterrows():
    final_test_data.at[i, 'HomeGoals'] = goal_elo.predict(teamname_mapping[r['HomeTeam']], teamname_mapping[r['AwayTeam']])
    final_test_data.at[i, 'AwayGoals'] = goal_elo.predict(teamname_mapping[r['AwayTeam']], teamname_mapping[r['HomeTeam']])

In [296]:
final_test_data

Unnamed: 0,Date,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,16 Jan 21,Arsenal,Newcastle,1.623105,0.957264
1,16 Jan 21,Aston Villa,Everton,1.544919,1.488274
2,16 Jan 21,Fulham,Chelsea,0.984107,2.322189
3,16 Jan 21,Leeds,Brighton,1.070574,0.741452
4,16 Jan 21,Leicester,Southampton,1.441821,1.42889
5,16 Jan 21,Liverpool,Man United,1.77088,1.573276
6,16 Jan 21,Man City,Crystal Palace,2.257173,0.72945
7,16 Jan 21,Sheffield United,Tottenham,0.839343,1.564277
8,16 Jan 21,West Ham,Burnley,1.496072,0.91116
9,16 Jan 21,Wolves,West Brom,1.373062,0.645789


Running the match outcome classifier on the test dataset to generate final predictions.

In [297]:
goal_predictions = np.array([final_test_data['HomeGoals'].to_numpy(), final_test_data['AwayGoals'].to_numpy()]).T

In [298]:
final_test_data['FTR'] = goal_result_classifier.predict(goal_predictions)

In [299]:
final_test_data

Unnamed: 0,Date,HomeTeam,AwayTeam,HomeGoals,AwayGoals,FTR
0,16 Jan 21,Arsenal,Newcastle,1.623105,0.957264,H
1,16 Jan 21,Aston Villa,Everton,1.544919,1.488274,H
2,16 Jan 21,Fulham,Chelsea,0.984107,2.322189,A
3,16 Jan 21,Leeds,Brighton,1.070574,0.741452,H
4,16 Jan 21,Leicester,Southampton,1.441821,1.42889,H
5,16 Jan 21,Liverpool,Man United,1.77088,1.573276,H
6,16 Jan 21,Man City,Crystal Palace,2.257173,0.72945,H
7,16 Jan 21,Sheffield United,Tottenham,0.839343,1.564277,A
8,16 Jan 21,West Ham,Burnley,1.496072,0.91116,H
9,16 Jan 21,Wolves,West Brom,1.373062,0.645789,H


Writing results to a [CSV file](./output/final_predictions.csv).

In [302]:
final_test_data.to_csv(os.path.join(os.getcwd(), 'output/final_predictions.csv'))