# Predicting Outcome of Football (Soccer) Games

## Background

Each year, billions of dollars enter the betting markets around the world, with the lion’s share being bet on football matches. Punters seek to find any information that can give them an edge against the ‘house’, and the source of this information can even be an octopus ‘predicting’ the results of the World Cup games (Remember Paul the Octopus?). Building a machine learning model that can predict the outcome of football games with a relatively high accuracy rate from the industry averages can prove to be a lucrative source of income.

Unsurprisingly, this lucrative business of making accurate projections of football games and trying to beat the odds is also an extremely difficult problem to solve. As we will see later in this proposal, probability outcome of results (home win, draw, away win) are fairly close to each other, complicating the process of making accurate predictions. On the other hand, ‘shocking’ results, where a weaker squad upsets expectations by winning against a much stronger team occur regularly, even in the entirety of a campaign - as in the 2015-16 English Premier League when Leicester City won the championship.


## Problem Statement

In this project, I will attempt the build several models that will make predictions for the outcome of football games in the English Premier League. 

Ulmer and Fernandez (2014) show in their report that the balanced distribution of outcomes means to high entropy. For instance, in the 2010-11 season, 35.5% home wins, 29% draws and 35.5% away wins, the entropy value was 0.72, close to the value of 1 that depicts pure randomness. This level of randomness creates a particularly difficult problem to solve for the classification models.

## Data and Input

I’ll be using two main sources of data: Football-data.co.uk which provides historical results for matches in the European leagues since the early 21st century, and clubelo.com which calculates ELO ratings for European clubs. 


In [4]:
import pandas as pd

df = pd.read_csv("https://www.football-data.co.uk/mmz4281/1819/E0.csv")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 62 columns):
Div         380 non-null object
Date        380 non-null object
HomeTeam    380 non-null object
AwayTeam    380 non-null object
FTHG        380 non-null int64
FTAG        380 non-null int64
FTR         380 non-null object
HTHG        380 non-null int64
HTAG        380 non-null int64
HTR         380 non-null object
Referee     380 non-null object
HS          380 non-null int64
AS          380 non-null int64
HST         380 non-null int64
AST         380 non-null int64
HF          380 non-null int64
AF          380 non-null int64
HC          380 non-null int64
AC          380 non-null int64
HY          380 non-null int64
AY          380 non-null int64
HR          380 non-null int64
AR          380 non-null int64
B365H       380 non-null float64
B365D       380 non-null float64
B365A       380 non-null float64
BWH         380 non-null float64
BWD         380 non-null float64
BWA       

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAv<2.5,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA
0,E0,10/08/2018,Man United,Leicester,2,1,H,1,0,H,...,1.79,17,-0.75,1.75,1.7,2.29,2.21,1.55,4.07,7.69
1,E0,11/08/2018,Bournemouth,Cardiff,2,0,H,1,0,H,...,1.83,20,-0.75,2.2,2.13,1.8,1.75,1.88,3.61,4.7
2,E0,11/08/2018,Fulham,Crystal Palace,0,2,A,0,1,A,...,1.87,22,-0.25,2.18,2.11,1.81,1.77,2.62,3.38,2.9
3,E0,11/08/2018,Huddersfield,Chelsea,0,3,A,0,2,A,...,1.84,23,1.0,1.84,1.8,2.13,2.06,7.24,3.95,1.58
4,E0,11/08/2018,Newcastle,Tottenham,1,2,A,1,2,A,...,1.81,20,0.25,2.2,2.12,1.8,1.76,4.74,3.53,1.89


Above is a sample data from the football-data.co.uk database for the 2018-19 EPL team. Metadata can be found at (https://www.football-data.co.uk/notes.txt). This structure allows us to extract features such as day of the match, winning streaks, scoring and defending performance and home and away performance. 

In [13]:
df_elo = pd.read_csv("http://api.clubelo.com/ManUnited")
df_elo.info()
df.head(5)
df_elo.tail(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5576 entries, 0 to 5575
Data columns (total 7 columns):
Rank       5576 non-null object
Club       5576 non-null object
Country    5576 non-null object
Level      5576 non-null int64
Elo        5576 non-null float64
From       5576 non-null object
To         5576 non-null object
dtypes: float64(1), int64(1), object(5)
memory usage: 305.1+ KB


Unnamed: 0,Rank,Club,Country,Level,Elo,From,To
5571,15,Man United,ENG,1,1802.880493,2019-12-06,2019-12-07
5572,15,Man United,ENG,1,1816.841675,2019-12-08,2019-12-12
5573,15,Man United,ENG,1,1816.841675,2019-12-13,2019-12-15
5574,15,Man United,ENG,1,1816.841675,2019-12-16,2019-12-21
5575,15,Man United,ENG,1,1816.841675,2019-12-22,2019-12-31


Above is the data taken from clubelo.com for Manchester United. ELO rating can be used to directly compare the competing teams. ELO rating, developed by Arpad Elo, is commonly used to calculate the relative skill levels of chess players. However, it can be applied to any game between two players with a win-draw-lose outcome structure. It takes into account the strength of the opponents to estimate the strength of a team, which would be a very useful feature when predicting match results.

We can merge this data by date to each row and team on the football-data set. This allows us to have a accurate estimates of team strengths for every match.

In [18]:
import geocoder
man_united = geocoder.osm("manchester united stadium")
print(man_united.lat, man_united.lng)

53.46329545 -2.2900377499127


A third source of data we can use is the Open Street Maps through the geocoder package. We can retrieve the coordinates of the stadium's for each team, and then calculate the coordinate distance between stadiums' to estimate the travel distance for the away team, as well as detecting the same city teams which might indicate a 'derby' or a game with special significance.

## Solution Statement

As each game has 3 possible outcomes (Home win, Draw, Away win) we need to employ multinomial classification algorithms to solve our problem of predicting the results. 

Tax and Joustra (2015) have an excellent paper where they build multinomial classification models to answer the same problem on Dutch football games in Eredivise. Their choice of models include Naive Bayes, multilayer perceptron and Random Forest. I will be following their steps, deviating only in the choice of the an additional ensemble model as I believe Extreme Gradient Boosting (XGBoost) can be a better choice than Random Forests. Principal Component Analysis (PCA) will also be employed for dimensionality reduction before Naive Bayes and Perceptron models. This is not required for XGBoost as it has an inbuilt feature selection method.

Due to the temporal nature of the data, cross validation will not be an appropriate technique to validate models. Instead, I will be using data from the 2015 and 2016 seasons to evaluate the models and 2017 and 2018 seasons to test them, while all seasons with complete data prior to 2015 will be used to train the models.

## Benchmark Model

Ulmer and Fernandez (2014) try to solve the same problem using data from the seasons between 2002 and 2011 to train the model and the 2012 and 2013 seasons to test it. Their evaluation metric is the error rate on the test dataset, which is the percentage of incorrectly predicted outcomes. The lowest error rate they could achieve was was 48%, which means an accuracy score of 52%. This was achieved by a linear Stochastic Gradient Descent model where draws were not predicted. The second best Random Forest model, which actually predicted draws had an error rate of 50%. My aim in this work is to obtain a lower error rate than 48% (or a higher accuracy score than 52%). Both the linear SGD and Random Forest models will be employed as benchmarks.

Accuracy score (or error rate) was the choice of evaluation metric for both studies that this project follows (Ulmer and Fernandez 2014, Tax and Joustra 2015). As the class variable is relatively balanced and the aim of the project is to predict the outcome correctly to increase betting profits, accuracy score becomes a useful evaluation metric for this work that is easy to interpret.

## Project Design

The first step in this project will be to consolidate all data from the three main data sources. This will be done using pandas and geocoder libraries. Once we have a dataframe with our full dataset, I'll proceed to feature engineering.

New features will include overall and home winning streaks for each team, performance in the last game as well as past X games (X to be defined later), goals scored and condeded in the past games and log of the distance estimate between home and away teams. Distance estimation can be done using the haversine formula (https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points). Numpy will be a useful library during the feature engineering.

Once all features are created, I will be building a pipeline that scales the data, and performs dimensiality reduction with PCA if required, before moving on to parameter tuning. PCA can also be tuned within the pipeline. Scaling and PCA methods will be imported through sklearn.

Following models will be used:

linear SGD (sklearn) - Benchmark
Random Forests (sklearn) - Benchmark
Naive Bayes (sklearn)
Multilayer Perceptron (Keras)
Extreme Gradient Boosting (xgboost)

As mentioned earlier data from the 2015 and 2016 seasons will be the validation set, 2017 and 2018 will be the test sets and seasons prior t0 2015 will be used to train these models. Accuracy score will be the main evaluation metric.

## References

Tax, Niek, and Yme Joustra. "Predicting the Dutch football competition using public data: A machine learning approach." Transactions on knowledge and data engineering 10.10 (2015): 1-13.

http://cs229.stanford.edu/proj2014/Ben%20Ulmer,%20Matt%20Fernandez,%20Predicting%20Soccer%20Results%20in%20the%20English%20Premier%20League.pdf
