# COMP0036: Group Coursework 

## Beat the Bookie

`<insert Introduction here>`

### Package Import

In [115]:
import os
import sys
import re
from math import ceil
from IPython.core.display import HTML
import random

#Standard Python libraries for data and visualisation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

#Import models
from sklearn.ensemble import RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import PredefinedSplit, KFold
from sklearn.metrics import make_scorer

#Import error metric
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, f1_score

#Import data munging tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

#Display charts in the notebook
%matplotlib inline

### Data Import
``<insert a brief description of our datasets here>``

### Shots xG model

The shot-based-data for predicting expected goals (xG) model, data was obtained from https://fbref.com. The data obtained consists of shots made in the match and match highlights at MAKO (minute after kick-off). 
Shot data was limited to the last 4 seasons, i.e., <b>2017-2021</b>.
The data was scraped off the webpage with a custom built scraper.
The scraper is available at ... insert here

Currently our data contains the following values:
    <li><b>Minute</b>: mins after kick-off shot took place, mis after the first half have 100 added to them.</li> 
    <li><b>Player</b>: player making the shot</li>
    <li><b>Squad</b>: squad the player is from</li>
    <li><b>Against</b>: squad against which shot is made</li>
    <li><b>Outcome</b>: whether a shot is a goal|blocked|saved|etc.</li>
    <li><b>Distance</b>: distance from the goal-post the shot was made</li>
    <li><b>Body Part</b>: body part used to make the shot</li>
    <li><b>Notes</b>: what kind of shot it was e.g. header|volley|etc.</li>
    <li><b>SCA 1 Player</b>: player inducing event leading to shot</li> 
    <li><b>SCA 1 Event</b>: event leading to shot</li>
    <li><b>SCA 2 Player</b>: player inducing event leading to 'SCA 1 Event'</li>
    <li><b>SCA 2 Event</b>: event leading to 'SCA 1 Event'</li>
    <li><b>Timestamp</b>: time and date of the kick-off</li>
    <li><b>Score</b>: score of the squad at the time of the shot being taken</li>
    <li><b>Player Advantage</b>: whether the squad has more players than the other</li>
    <li><b>Threat</b>: threat level of the player making the shot</li>

In [45]:
shot_data = pd.read_csv('data/fantasy-league/shot_data.csv', index_col=0)

In [47]:
shot_data.head()

Unnamed: 0,Timestamp,Score,Player Advantage,Minute,Player,Squad,Against,Outcome,Distance,Body Part,Notes,SCA 1 Player,SCA 1 Event,SCA 2 Player,SCA 2 Event,Threat
0,2016-08-13 12:30:00,0.0,0.0,147,Riyad Mahrez,Leicester City,Hull City,Goal,,,Penalty Kick,Goal,,,,720.0
1,2016-08-13 17:30:00,1.0,0.0,4,Sergio Agüero,Manchester City,Sunderland,Goal,,,Penalty Kick,—,,,,720.0
2,2016-08-15 20:00:00,1.0,0.0,147,Eden Hazard,Chelsea,West Ham United,Goal,,,Penalty Kick,Yellow Card,,,,627.0
3,2016-08-19 20:00:00,2.0,0.0,152,Zlatan Ibrahimović,Manchester United,Southampton,Goal,,,Penalty Kick,Goal,,,,627.0
4,2016-08-20 12:30:00,1.0,0.0,27,Sergio Agüero,Manchester City,Stoke City,Goal,,,Penalty Kick,Yellow Card,,,,


### Non-shots xG model

#### Load Data

For the non-shot-based expected goals (xG) model, we have obtained a dataset from [football-data.co.uk] (https://www.football-data.co.uk/) consisting of football match information over the past 10+ years. We have reduced this dataset to data from the last 5 seasons (including the current one), i.e., __2016-2021__. This version contains 1684 samples, each of which consists of 12 features and 2 labels of the full time home team goals (FTHG) and the full time away team goals (FTAG). The features are:

1. GameID = Unique ID for the match
2. Date = Match Date (dd/mm/yy)
3. HomeTeam = Home Team
4. AwayTeam = Away Team
5. Referee = Match Referee
6. HC = Home Team Corners
7. AC = Away Team Corners
8. HF = Home Team Fouls Committed
9. AF = Away Team Fouls Committed
10. HY = Home Team Yellow Cards
11. AY = Away Team Yellow Cards
12. HR = Home Team Red Cards
13. AR = Away Team Red Cards

In [2]:
DATASET_PATH = os.path.join(os.getcwd(), "data/non-shot-xG/non_shot_data.csv")
complete_data = pd.read_csv(DATASET_PATH)

In [3]:
complete_data.head()

Unnamed: 0,GameID,Date,HomeTeam,AwayTeam,Referee,HC,AC,HF,AF,HY,AY,HR,AR,FTHG,FTAG,FTR
0,1,13/08/2016,Burnley,Swansea,J Moss,7,4,10,14,3,2,0,0,0,1,A
1,2,13/08/2016,Crystal Palace,West Brom,C Pawson,3,6,12,15,2,2,0,0,0,1,A
2,3,13/08/2016,Everton,Tottenham,M Atkinson,5,6,10,14,0,0,0,0,1,1,D
3,4,13/08/2016,Hull,Leicester,M Dean,5,3,8,17,2,2,0,0,2,1,H
4,5,13/08/2016,Man City,Sunderland,R Madley,9,6,11,14,1,2,0,0,2,1,H


### Data Transformation and Exploration
``<insert a brief description of our preperation and standardisation process here>``

### Shots xG model

set 'Timestamp' to datetime objects

In [49]:
shot_data['Timestamp'] = pd.to_datetime(shot_data['Timestamp'])

we assume that getting near to the end of the match, shots patterns change as attacks become more aggressive

In [50]:
shot_data['End_close'] = (shot_data['Minute'] > 185).astype(int)

null values of notes are associated with normal shots

In [51]:
shot_data['Notes'] = shot_data['Notes'].fillna('normal')

##### Separate Shot data by type
 We assume that a shots of different types have different probabilities of being a goal from the same distance.

In [52]:
def shot_data_by_type(type_name, df):
    # collect notes including the substring type_name, e.g. 'volley'
    types = [v for v in df['Notes'].unique() if type_name in v.lower()]
    # extract shots of the type specified
    type_df = df[[n in types for n in df['Notes']]]
    type_df = type_df.reset_index(drop=True)
    # deduce outcome of shot i.e. Goal or not Goal
    type_goals = type_df['Outcome'] == 'Goal'
    # drop unneeded columns
    type_df = type_df.drop(columns=['Squad', 'Against', 'Outcome', 'Player', 
                                    'Body Part', 'SCA 1 Player', 'SCA 1 Event', 
                                   'SCA 2 Player', 'SCA 2 Event'])
    # one-hot encoding for subtyoes
    type_df = pd.concat([type_df, pd.get_dummies(type_df['Notes'])], axis=1)
    # add a new column with label
    type_df['Goal'] = type_goals.astype(int)
    
    return type_df

types of shots excluding normal, which conatains everything but those below

In [53]:
types = ['volley', 'header', 'free kick', 'overhead', 'back heel', 'penalty kick']
# collect all DataFrames for each type
type_dfs = dict()
# sets to keep records of which types of shots have been collected
all_types = set(shot_data['Notes'].unique())
used = set()

for t in types:
    # using the function, separates the shots by type
    type_dfs[t] = shot_data_by_type(t, shot_data)
    # adds used types to the set 'used'
    used = used.union(set([v for v in shot_data['Notes'].unique() if t in v.lower()]))

remaining to be added to 'normal' shots

In [54]:
rest_of_shots = list(all_types.difference(used))
new_names = []

# add normal to each type left for use of shot_data_by_type function
for i in range(len(rest_of_shots)):
    current = rest_of_shots[i]
    new_names.append(current + ' normal' if not 'normal' == current else current)
    
# change notes according to above changes in the dataframe
for o, n in zip(rest_of_shots, new_names):
    # 'normal' does not change
    if o != n:
        originals = shot_data['Notes'] == o
        shot_data.loc[originals, 'Notes'] = n

In [55]:
type_dfs['normal'] = shot_data_by_type('normal', shot_data)

### Non-shots xG model

Now we will take a look at different features — sanitise and standardise them for use in other models.

__Note__: Standard team names and referee names along with their respective unique IDs are located in [this](data/standard) directory

#### 1. Dropping columns that are not essential to train our model.

In [4]:
general_training_data = complete_data.drop(['GameID','Date'], axis=1)

#### 2. Encode names of the teams and referees using standardised data

In [5]:
teams_data = pd.read_csv(os.path.join(os.getcwd(), "data/standard/standard.teamnames.csv"))
referee_data = pd.read_csv(os.path.join(os.getcwd(), "data/standard/standard.referee.names.csv"))

In [6]:
# Generating teams mappings 
teamname, teamID = list(teams_data['Standard teamname']), list(teams_data['TeamID'])
teamID_mapping = dict(zip(teamname, teamID))

generate_teamID_mappings = lambda teamnames: [teamID_mapping[teamname] for teamname in teamnames]

In [7]:
# Generating referees mappings 
referee, refereeID = list(referee_data['Standard referee name']), list(referee_data['RefereeID'])
refereeID_mapping = dict(zip(referee, refereeID))

generate_refereeID_mappings = lambda referees: [refereeID_mapping[referee] for referee in referees]

#### Applying transformations to the (general) training dataset.

In [8]:
# Teams
general_training_data['HomeTeam'] = generate_teamID_mappings(general_training_data['HomeTeam'])
general_training_data['AwayTeam'] = generate_teamID_mappings(general_training_data['AwayTeam'])

# Referees
general_training_data['Referee'] = generate_refereeID_mappings(general_training_data['Referee'])

#### 3. Integrating expected goals (xG) data from shots-based model 

Extracting the results of the shots-based model.

In [9]:
shots_xg_predictions = pd.read_csv(os.path.join(os.getcwd(), 'output/shots_xG_predictions.csv'))

In [10]:
# Add expected goals for the home team
general_training_data['xHG'] = shots_xg_predictions['xG_h']

# Add expected goals for the away team
general_training_data['xAG'] = shots_xg_predictions['xG_a']

In [11]:
general_training_data

Unnamed: 0,HomeTeam,AwayTeam,Referee,HC,AC,HF,AF,HY,AY,HR,AR,FTHG,FTAG,FTR,xHG,xAG
0,10,34,12,7,4,10,14,3,2,0,0,0,1,A,0.000000,1.008553
1,13,37,7,3,6,12,15,2,2,0,0,0,1,A,0.000000,1.008553
2,14,35,18,5,6,10,14,0,0,0,0,1,1,D,0.950882,1.008553
3,18,20,20,5,3,8,17,2,2,0,0,2,1,H,1.950882,1.000000
4,22,33,34,9,6,11,14,1,2,0,0,2,1,H,1.000000,1.008553
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1679,9,40,1,5,8,13,8,2,1,0,0,3,3,D,1.000000,1.000000
1680,37,1,18,3,5,7,4,1,2,0,0,0,4,A,0.000000,4.008553
1681,25,20,33,3,6,10,11,0,2,0,0,1,2,A,0.950882,2.008553
1682,12,22,3,5,3,11,10,3,1,0,0,1,3,A,0.950882,3.008553


For the purposes of the non-shot-based model, we will split the dataset into two: one containing data for predicting the __FTHG__ (Full Time Home Goals) and one containing data for predicting the __FTAG__ (Full Time Away Goals).

#### 4. Split the dataset into two parts: ``home_training_data`` and ``away_training_data``

In [12]:
X_home = home_training_data = general_training_data.drop(['FTAG', 'FTHG', 'AC', 'xAG', 'FTR'], axis=1)

In [13]:
X_home.head()

Unnamed: 0,HomeTeam,AwayTeam,Referee,HC,HF,AF,HY,AY,HR,AR,xHG
0,10,34,12,7,10,14,3,2,0,0,0.0
1,13,37,7,3,12,15,2,2,0,0,0.0
2,14,35,18,5,10,14,0,0,0,0,0.950882
3,18,20,20,5,8,17,2,2,0,0,1.950882
4,22,33,34,9,11,14,1,2,0,0,1.0


In [14]:
X_away = home_training_data = general_training_data.drop(['FTHG', 'FTAG', 'HC', 'xHG', 'FTR'], axis=1)

In [15]:
X_away.head()

Unnamed: 0,HomeTeam,AwayTeam,Referee,AC,HF,AF,HY,AY,HR,AR,xAG
0,10,34,12,4,10,14,3,2,0,0,1.008553
1,13,37,7,6,12,15,2,2,0,0,1.008553
2,14,35,18,6,10,14,0,0,0,0,1.008553
3,18,20,20,3,8,17,2,2,0,0,1.0
4,22,33,34,6,11,14,1,2,0,0,1.008553


### Methodology Overview
``Pipeline overview``

### Model training and validation

### Shots xG model

##### Model Analysis

In [103]:
def clean_split_analysis(data):
    # drop non-feature columns
    data = data.drop(columns=['Timestamp', 'Notes', 'Minute'])
    # fill null values in threat with mean
    data['Threat'] = data['Threat'].fillna(data['Threat'].mean())
    # separate label columns from data
    goals = data['Goal']
    data = data.drop(columns=['Goal'])
    # fill any null value, only true for penalty kicks' distance
    data = data.fillna(0)
    
    scaler = StandardScaler()
    
    data[['Distance', 'Threat']] = scaler.fit_transform(data[['Distance', 'Threat']])
    
    return train_test_split(data, goals, test_size=0.1, random_state=10)

We iterate over each selected classifier and collect scores for each shot type

In [104]:
classifiers = [GradientBoostingClassifier(max_depth=5, learning_rate=0.1, subsample=0.15),
              GaussianNB(), LogisticRegression(max_iter=1000)]

cls_names = ['xgbc',
            'gaussian',
            'log']

cls_scores = []

for cls in classifiers:
    # variables to collect data
    correct_shots = 0
    goals = 0
    false_goals = 0
    f1_scores = 0
    total_pred_goals = 0
    total_shots = 0
    total_goals = 0

    # training and collecting result from each classifier over each type of shot
    for k, v in type_dfs.items():

        x_train, x_test, y_train, y_test = get_train_ready(v)
        cls = cls.fit(x_train, y_train)
        pred = cls.predict(x_test)

        total_shots += len(y_test)
        total_goals += sum(y_test > 0)
        total_pred_goals += sum(pred > 0)

        false_goals += sum(pred[pred > 0] != y_test[pred > 0])
        correct_shots += sum(pred == y_test)
        goals += sum(pred[y_test > 0] == y_test[y_test > 0])
        f1_scores += f1_score(y_test, pred) * len(y_test)

    cls_scores.append([correct_shots / total_shots, goals / total_goals, 
                   false_goals / total_pred_goals, f1_scores / total_shots])

We put these readings into a DataFrame object

In [105]:
# new DataFrame object
classifer_score_df = pd.DataFrame()
# iterate over scores
for n, s in zip(cls_names, cls_scores):
    new_row = [n, *s]
    classifer_score_df = classifer_score_df.append([new_row])
    
classifer_score_df.columns = ['Classifier', 'Accuracy', 'True positives', 'False Positives', 'F1 score']
classifer_score_df = classifer_score_df.reset_index(drop=True)

In [106]:
fig = px.bar(classifer_score_df, y=['Accuracy', 'True positives', 'False Positives', 'F1 score'], x='Classifier', barmode='group', title='Classifier comparison')
fig.show()

display(HTML('''<ol>
            <li><strong>Accuracy</strong>: correctly predicted shots / total shots</li>
            <li><strong>True positives</strong>: correctly classified goals / actual goals</li>
            <li><strong>False positives</strong>: incorrectly classified goals / actual goals</li>
            <li><strong>F1 score</strong>: F1 score of the classifier</li>
        </ol><br>
        Accuracy is not a reliable metric in this case due to uneven label numbers.<br>
        We instead rely on F1 scores while keeping an eye on True positives and False positives.<br>
        F1 scores for all three models are similar, we focus on True positives as a secondary metric since 
        our aim is to correctly predict goals.
        <br><br>
        <strong>From the data we can conclude that a GradientBoostingClassifier is the best option for 
        this classification.</strong>'''))

##### Parameter tuning

In [113]:
def clean_split_validation(data):
    # drop non-feature columns
    data = data.drop(columns=['Timestamp', 'Notes', 'Minute'])
    # fill null values in threat with mean
    data['Threat'] = data['Threat'].fillna(data['Threat'].mean())
    # separate label columns from data
    goals = data['Goal']
    data = data.drop(columns=['Goal'])
    # fill any null value, only true for penalty kicks' distance
    data = data.fillna(0)
    
    scaler = StandardScaler()
    
    data[['Distance', 'Threat']] = scaler.fit_transform(data[['Distance', 'Threat']])
    
    return train_test_split(data, goals, test_size=0.1, random_state=10)

We iterate over each selected classifier and collect scores for each shot type

In [117]:
classifiers = [GradientBoostingClassifier(max_depth=2, learning_rate=0.1, subsample=0.15),
              GradientBoostingClassifier(max_depth=4, learning_rate=0.1, subsample=0.15),
              GradientBoostingClassifier(max_depth=5, learning_rate=0.1, subsample=0.15),
              GradientBoostingClassifier(max_depth=7, learning_rate=0.1, subsample=0.15),
              GradientBoostingClassifier(max_depth=10, learning_rate=0.1, subsample=0.15)]

cls_scores = []

for cls in classifiers:
    # variables to collect data
    correct_shots = 0
    goals = 0
    false_goals = 0
    f1_scores = 0
    total_pred_goals = 0
    total_shots = 0
    total_goals = 0

    # training and collecting result from each classifier over each type of shot
    for k, v in type_dfs.items():

        x_train, x_test, y_train, y_test = get_train_ready(v)
        cls = cls.fit(x_train, y_train)
        pred = cls.predict(x_test)

        total_shots += len(y_test)
        total_goals += sum(y_test > 0)
        total_pred_goals += sum(pred > 0)

        false_goals += sum(pred[pred > 0] != y_test[pred > 0])
        correct_shots += sum(pred == y_test)
        goals += sum(pred[y_test > 0] == y_test[y_test > 0])
        f1_scores += f1_score(y_test, pred) * len(y_test)

    cls_scores.append([correct_shots / total_shots, goals / total_goals, 
                   false_goals / total_pred_goals, f1_scores / total_shots])

We put these readings into a DataFrame object

In [87]:
# new DataFrame object
classifer_score_df = pd.DataFrame()
# iterate over scores
for n, s in zip(cls_names, cls_scores):
    new_row = [n, *s]
    classifer_score_df = classifer_score_df.append([new_row])
    
classifer_score_df.columns = ['Classifier', 'Accuracy', 'True positives', 'False Positives', 'F1 score']
classifer_score_df = classifer_score_df.reset_index(drop=True)

In [94]:
fig = px.bar(classifer_score_df, y=['Accuracy', 'True positives', 'False Positives', 'F1 score'], x='Classifier', barmode='group', title='Classifier comparison')
fig.show()

display(HTML('''<ol>
            <li><strong>Accuracy</strong>: correctly predicted shots / total shots</li>
            <li><strong>True positives</strong>: correctly classified goals / actual goals</li>
            <li><strong>False positives</strong>: incorrectly classified goals / actual goals</li>
            <li><strong>F1 score</strong>: F1 score of the classifier</li>
        </ol><br>
        Accuracy is not a reliable metric in this case due to uneven label numbers.<br>
        We instead rely on F1 scores while keeping an eye on True positives and False positives.<br>
        F1 scores for all three models are similar, we focus on True positives as a secondary metric since 
        our aim is to correctly predict goals.
        <br><br>
        <strong>From the data we can conclude that a GradientBoostingClassifier is the best option for 
        this classification.</strong>'''))

##### Model training

In [95]:
def get_data_train_ready(data):
    # drop columns which are not selected as features in our model
    data = data.drop(columns=['Timestamp', 'Notes', 'Minute'])
    # fill empty threat values with mean
    data['Threat'] = data['Threat'].fillna(data['Threat'].mean())
    # separate 'Goal' columns, they are considered labels
    goals = data['Goal']
    data = data.drop(columns=['Goal'])
    
    return train_test_split(data, goals, test_size=0.1, random_state=10)

In [101]:
def get_classifier(x_train, x_test, y_train, y_test):
    
    cls = GradientBoostingClassifier(max_depth=4, learning_rate=0.1, subsample=0.15)
    cls.fit(x_train, y_train)
    
    return cls

In [102]:
classifier_by_type = {}

for k, v in type_dfs.items():
    
    model_ready_data = get_data_train_ready(type_dfs[k].fillna(0))
    
    clsf = get_classifier(*model_ready_data)
    
    classifier_by_type[k] = clsf
    

### Non-shots xG model

### FTHG model

In [16]:
Y_home = general_training_data.FTHG

In [17]:
X_home_train, X_home_test, Y_home_train, Y_home_test = train_test_split(X_home, Y_home, test_size=0.2)

In [18]:
model_home = RandomForestRegressor(n_estimators = 100)
model_home.fit(X_home_train, Y_home_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

### FTAG model

In [19]:
Y_away = general_training_data.FTAG

In [20]:
X_away_train, X_away_test, Y_away_train, Y_away_test = train_test_split(X_away, Y_away, test_size=0.2)

In [21]:
model_away = RandomForestRegressor(n_estimators = 100)
model_away.fit(X_away_train, Y_away_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

### FTHG Results

In [22]:
home_training_data = general_training_data.copy().drop(['FTAG', 'AC', 'xAG', 'FTR'], axis=1)
home_model_input_data = home_training_data.copy().drop(columns=['FTHG'])

In [23]:
home_training_data

Unnamed: 0,HomeTeam,AwayTeam,Referee,HC,HF,AF,HY,AY,HR,AR,FTHG,xHG
0,10,34,12,7,10,14,3,2,0,0,0,0.000000
1,13,37,7,3,12,15,2,2,0,0,0,0.000000
2,14,35,18,5,10,14,0,0,0,0,1,0.950882
3,18,20,20,5,8,17,2,2,0,0,2,1.950882
4,22,33,34,9,11,14,1,2,0,0,2,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
1679,9,40,1,5,13,8,2,1,0,0,3,1.000000
1680,37,1,18,3,7,4,1,2,0,0,0,0.000000
1681,25,20,33,3,10,11,0,2,0,0,1,0.950882
1682,12,22,3,5,11,10,3,1,0,0,1,0.950882


In [24]:
home_pred_data = pd.get_dummies(home_model_input_data)
home_r = model_home.predict(home_pred_data)
home_r = pd.DataFrame(home_r)

In [25]:
home_r.columns= ['Predicted FTHG']
home_training_data.reset_index(drop=True, inplace=True)
home_results = pd.concat([home_training_data, home_r], axis=1)
home_results["Deviation in FTHG"] = abs(home_results["Predicted FTHG"] - home_results["FTHG"])

### FTAG Results

In [26]:
away_training_data = general_training_data.copy().drop(['FTHG', 'HC', 'xHG', 'FTR'], axis=1)
away_model_input_data = away_training_data.copy().drop(columns=['FTAG'])

In [27]:
away_pred_data = pd.get_dummies(away_model_input_data)
away_r = model_away.predict(away_pred_data)
away_r = pd.DataFrame(away_r)

In [28]:
away_r.columns= ['Predicted FTAG']
away_training_data.reset_index(drop=True, inplace=True)
away_results = pd.concat([away_training_data, away_r], axis=1)
away_results["Deviation in FTAG"] = abs(away_results["Predicted FTAG"] - away_results["FTAG"])

### Merging results of both models

In [29]:
complete_non_shot_predictions = general_training_data.copy()

# Add predicted FTHG
complete_non_shot_predictions['Predicted FTHG'] = home_results['Predicted FTHG']

# Add predicted FTAG
complete_non_shot_predictions['Predicted FTAG'] = away_results['Predicted FTAG']

complete_non_shot_predictions

Unnamed: 0,HomeTeam,AwayTeam,Referee,HC,AC,HF,AF,HY,AY,HR,AR,FTHG,FTAG,FTR,xHG,xAG,Predicted FTHG,Predicted FTAG
0,10,34,12,7,4,10,14,3,2,0,0,0,1,A,0.000000,1.008553,0.00,0.99
1,13,37,7,3,6,12,15,2,2,0,0,0,1,A,0.000000,1.008553,0.00,0.99
2,14,35,18,5,6,10,14,0,0,0,0,1,1,D,0.950882,1.008553,1.00,1.00
3,18,20,20,5,3,8,17,2,2,0,0,2,1,H,1.950882,1.000000,1.96,1.45
4,22,33,34,9,6,11,14,1,2,0,0,2,1,H,1.000000,1.008553,2.29,1.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1679,9,40,1,5,8,13,8,2,1,0,0,3,3,D,1.000000,1.000000,2.66,3.00
1680,37,1,18,3,5,7,4,1,2,0,0,0,4,A,0.000000,4.008553,0.00,3.94
1681,25,20,33,3,6,10,11,0,2,0,0,1,2,A,0.950882,2.008553,1.00,1.99
1682,12,22,3,5,3,11,10,3,1,0,0,1,3,A,0.950882,3.008553,1.00,3.00


In [30]:
path = os.path.join(os.getcwd(), "output/non_shot_predictions.csv")
complete_non_shot_predictions.to_csv(path, index=False)

### ELO Rating classifier

## Offensive and defensive ELO ratings to predict number of goals

We keep track of two ratings for all teams offensive rating ($R_O$) and defensive rating ($R_D$).

We can then predict the number of goals a team will score by taking the difference of their offensive rating and the opponent's defensive rating.

The number of goals scored against them can be calculated by considering it from the opponent's perspective.

$E[\text{team}] = R_O[\text{team}] - R_D[\text{opponent}]$

We can update a team's offensive rating by adding the difference between the actual number of goals and the expected goals multiplied by the learning rate.
We can update a team's defensive rating by adding the difference between the expected goals scored against them and the actual number of goals scored against them multiplied by the learning rate.

$R_O[\text{team}] = R_O[\text{team}] + k(G[\text{team}] - E[\text{team}])$

$R_D[\text{team}] = R_D[\text{team}] + k(E[\text{opponent}] - G[\text{opponent}])$

We start every team with a rating of 0. The order of the training data makes a difference to the model and so the training data should be in chronological order in order to account for teams changing over team.

We take the output of the elo rating predictor and use it to predict the final result. This is done using a simple piecewise function with an optimised draw size and a SVC. The results are compared and the SVC is chosen because it is more accurate.


In [31]:
class GoalElo:
    def __init__(self, initial_rating=0, learning_rate=0.05, draw_size=0.5):
        self.offensive_ratings = defaultdict(lambda: initial_rating)
        self.defensive_ratings = defaultdict(lambda: initial_rating)
        self.match_count = defaultdict(lambda: 0)
        self.learning_rate = learning_rate
        self.draw_size = draw_size

    def predict(self, team, opponent):
        ''' Predicts the number of goals team will score against opponent. '''
        return self.offensive_ratings[team] - self.defensive_ratings[opponent]

    def predict_result(self, team, opponent):
        ''' Predicts the result of a match. 1 if team wins, 0 if opponent wins and 0.5 if it is a draw. '''
        goals_scored = self.predict(team, opponent)
        goals_conceded = self.predict(opponent, team)
        return self.classify_result(goals_scored, goals_conceded)

    def classify_result(self, goals_scored, goals_conceded):
        ''' Piecewise function to predict result from number of goals '''
        goal_difference = goals_scored - goals_conceded
        # result = round(1 / (1 + 10**(-goal_difference)))
        result = 1 if goal_difference > 0 else 0
        if abs(goal_difference) < self.draw_size:
            result = 0.5
        return result


    def predict_data(self, df):
        ''' Takes a data frame of home and away teams to predict the number of goals and result of '''
        out = df.copy()
        for i, row in out.iterrows():
            out.at[i, 'EHG'] = self.predict(row['HomeTeam'], row['AwayTeam'])
            out.at[i, 'EAG'] = self.predict(row['AwayTeam'], row['HomeTeam'])
            out.at[i, 'ER'] = decode_result(self.predict_result(row['HomeTeam'], row['AwayTeam']))
        return out

    def update_match(self, home, away, home_actual_goals, away_actual_goals):
        ''' Updates the offensive and defensive ratings of both teams in a match. '''
        home_expected_goals = self.predict(home, away)
        away_expected_goals = self.predict(away, home)
        self.offensive_ratings[home] += self.learning_rate * (home_actual_goals - home_expected_goals)
        self.offensive_ratings[away] += self.learning_rate * (away_actual_goals - away_expected_goals)
        self.defensive_ratings[home] += self.learning_rate * (away_expected_goals - away_actual_goals)
        self.defensive_ratings[away] += self.learning_rate * (home_expected_goals - home_actual_goals)
        self.match_count[home] += 1
        self.match_count[away] += 1

    def ratings_dataframe(self):
        ''' Creates an easy to read dataframe of the ratings '''
        df = pd.DataFrame(self.offensive_ratings.items(), columns=['Team', 'Offensive Rating'])
        df['Defensive Rating'] = df['Team'].map(self.defensive_ratings)
        df['Matches'] = df['Team'].map(self.match_count)
        df = df.sort_values('Offensive Rating', ascending=False)
        return df

    def fit(self, df):
        ''' Takes a data frame of matches with columns HomeTeam, AwayTeam, Predicted FTHG, Predicted FTAG and updates teams ratings using the data in order. '''
        for i, row in df.iterrows():
            if 'Predicted FTHG' in row:
                self.update_match(row['HomeTeam'], row['AwayTeam'], row['Predicted FTHG'], row['Predicted FTAG'])
            else:
                self.update_match(row['HomeTeam'], row['AwayTeam'], row['FTHG'], row['FTAG'])


In [32]:
def decode_result(result):
    if result == 1:
        return 'H'
    elif result == 0:
        return 'A'
    return 'D'

In [33]:
match_data = pd.read_csv(os.path.join(os.getcwd(), "output/non_shot_predictions.csv"))

In [34]:
training, test = train_test_split(match_data, test_size=0.05, shuffle=False)

In [35]:
# Train elo ratings
goal_elo = GoalElo()
goal_elo.fit(training)


In [36]:
# Predict number of goals for use in training result classifier
goal_predicted_data = goal_elo.predict_data(training)

X = np.array([goal_predicted_data['EHG'].to_numpy(), goal_predicted_data['EAG'].to_numpy()]).T
y = goal_predicted_data['FTR'].to_numpy()
# y = [decode_result(r) for r in goal_predicted_data['FTR'].to_numpy()]

# Train classifier to predict result from number of goals
goal_result_classifier = SVC(gamma='auto')
goal_result_classifier.fit(X, y)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [37]:
# Predict number of goals scored using goal_elo
goal_prediction = goal_elo.predict_data(test)
X_test = np.array([goal_prediction['EHG'].to_numpy(), goal_prediction['EAG'].to_numpy()]).T
y_test = goal_prediction['FTR'].to_numpy()
# Predict result using SVC and elo predicted number of goals
y_pred = goal_result_classifier.predict(X_test)
# Predict result using piecewise function and elo predicted number of goals
y_pred2 = goal_prediction['ER'].to_numpy()

# Measure accuracy
print("Accuracy:")
print("DS: ", accuracy_score(y_test, y_pred2))
print("SVC: ", accuracy_score(y_test, y_pred))

# Measure f1 score
print("F1 Score:")
print("DS: ", f1_score(y_test, y_pred2, average='weighted'))
print("SVC: ", f1_score(y_test, y_pred, average='weighted'))

Accuracy:
DS:  0.43529411764705883
SVC:  0.4470588235294118
F1 Score:
DS:  0.44414539400665926
SVC:  0.35406162464985996


### Results

``<insert model tests here>``

### Final predictions

In [38]:
# Generating reverse teams mappings 
teamname, teamID = list(teams_data['Standard teamname']), list(teams_data['TeamID'])
teamname_mapping = dict(zip(teamname, teamID))

In [39]:
# Import data to predict
final_test_data = pd.read_csv(os.path.join(os.getcwd(), 'data/epl-test.csv'))
final_test_data

Unnamed: 0,Date,HomeTeam,AwayTeam
0,16 Jan 21,Arsenal,Newcastle
1,16 Jan 21,Aston Villa,Everton
2,16 Jan 21,Fulham,Chelsea
3,16 Jan 21,Leeds,Brighton
4,16 Jan 21,Leicester,Southampton
5,16 Jan 21,Liverpool,Man United
6,16 Jan 21,Man City,Crystal Palace
7,16 Jan 21,Sheffield United,Tottenham
8,16 Jan 21,West Ham,Burnley
9,16 Jan 21,Wolves,West Brom


In [40]:
# Use elo to predict goals
for i, r in final_test_data.iterrows():
    final_test_data.at[i, 'HomeGoals'] = goal_elo.predict(teamname_mapping[r['HomeTeam']], teamname_mapping[r['AwayTeam']])
    final_test_data.at[i, 'AwayGoals'] = goal_elo.predict(teamname_mapping[r['AwayTeam']], teamname_mapping[r['HomeTeam']])

In [41]:
final_test_data

Unnamed: 0,Date,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,16 Jan 21,Arsenal,Newcastle,1.568074,0.92679
1,16 Jan 21,Aston Villa,Everton,1.4279,1.519389
2,16 Jan 21,Fulham,Chelsea,0.910163,2.518164
3,16 Jan 21,Leeds,Brighton,1.134359,0.863986
4,16 Jan 21,Leicester,Southampton,1.648038,1.410989
5,16 Jan 21,Liverpool,Man United,2.03801,1.6463
6,16 Jan 21,Man City,Crystal Palace,2.390692,0.743088
7,16 Jan 21,Sheffield United,Tottenham,0.817999,1.65452
8,16 Jan 21,West Ham,Burnley,1.526704,0.86411
9,16 Jan 21,Wolves,West Brom,1.309354,0.50499


In [42]:
goal_predictions = np.array([final_test_data['HomeGoals'].to_numpy(), final_test_data['AwayGoals'].to_numpy()]).T

In [43]:
final_test_data['FTR'] = goal_result_classifier.predict(goal_predictions)

In [44]:
final_test_data

Unnamed: 0,Date,HomeTeam,AwayTeam,HomeGoals,AwayGoals,FTR
0,16 Jan 21,Arsenal,Newcastle,1.568074,0.92679,H
1,16 Jan 21,Aston Villa,Everton,1.4279,1.519389,H
2,16 Jan 21,Fulham,Chelsea,0.910163,2.518164,A
3,16 Jan 21,Leeds,Brighton,1.134359,0.863986,H
4,16 Jan 21,Leicester,Southampton,1.648038,1.410989,H
5,16 Jan 21,Liverpool,Man United,2.03801,1.6463,H
6,16 Jan 21,Man City,Crystal Palace,2.390692,0.743088,H
7,16 Jan 21,Sheffield United,Tottenham,0.817999,1.65452,A
8,16 Jan 21,West Ham,Burnley,1.526704,0.86411,H
9,16 Jan 21,Wolves,West Brom,1.309354,0.50499,H
