# Forecasting English Premier League Results

In the previous notebook, we saw how to fit a simple model to account for the variation in game results. Our simple model explained the variation of results as a function of the ratio of TM values for the two teams, and home advantage. We found that this model came close to approximating the reliability of bookmakers in predicting results. 

Now we want to test the reliability of our model as a *forecasting* model. The difference from a statistical point of view is this: with the previous model we assessed the ability of the model to predict results "within sample" - meaning that the same data used to estimate the regression model was used to test the accuracy of the model, while now we want to test the accuracy of the model "out-of-sample" - to test the reliability of the model that has been estimated using one dataset, when predicting the results based on another dataset.

Forecasting events that have not yet happened is by definition out-of-sample forecasting. We are going to use the Premier League data for the season 2019/20. Due to Covid-19, the Premier League season was suspended on March 13, 2020, leaving 89 of the 380 games unplayed. At the time of this writing (April 8, 2020), it was not clear if it will ever be possible to complete the season, although the League still holds out some hope that it might be. 
We will take the following steps:

1. Use data on the first 198 games played (all games played in 2019) to estimate our regression model
2. Use the coefficients from this model together with the TM values for the clubs to forecast the outcome of 90 played in 2020
3. We will compare this out-of-sample forecasting performance to the betting odds and also to the weekly forecasts of the popular data analytics website, Nate Silver's FiveThirtyEight.

Note that because TM values are typically available in advance of any game being played, you can apply the model described here to real time game forecasting, not just for the Premier League, but for any league which is covered by TransferMarkt.

In [1]:
# This allows us to show the full screen width

from IPython.display import display, HTML

display(HTML(data="""
<style>
    div#notebook-container    { width: 95%; }
    div#menubar-container     { width: 65%; }
    div#maintoolbar-container { width: 99%; }
</style>
"""))

In [2]:
# install the packages we need

import pandas as pd
import numpy as np

In [3]:
# load the data for the 2019/20 season. This includes 380 scheduled games, of which only 288 had been played before the suspension of the season on March 13, 2020

season19_20 = pd.read_excel("../../Data/Week 3/EPL19-20.xlsx")
season19_20

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr
0,2019-10-08 00:00:00,AFC Bournemouth,Sheffield United,0,8.0,10.0,2019.0,1.0,1.0,D,1.95,3.60,3.60,281.70,62.33,0.55,0.21,0.24
1,2019-10-08 00:00:00,Burnley,Southampton,0,8.0,10.0,2019.0,3.0,0.0,H,2.62,3.20,2.75,180.68,209.70,0.45,0.30,0.26
2,2019-10-08 00:00:00,Crystal Palace,Everton,0,8.0,10.0,2019.0,0.0,0.0,D,3.00,3.25,2.37,207.50,457.20,0.36,0.38,0.26
3,2019-10-08 00:00:00,Tottenham Hotspur,Aston Villa,0,8.0,10.0,2019.0,3.0,1.0,H,1.30,5.25,10.00,881.55,140.40,0.73,0.09,0.18
4,2019-10-08 00:00:00,Watford,Brighton and Hove Albion,0,8.0,10.0,2019.0,0.0,3.0,A,1.90,3.40,4.00,214.52,180.99,0.51,0.24,0.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,2020-05-17 00:00:00,Leicester City,Manchester United,1,,,,,,,,,,343.13,644.63,,,
376,2020-05-17 00:00:00,Manchester City,Norwich City,1,,,,,,,,,,1140.00,81.54,,,
377,2020-05-17 00:00:00,Newcastle United,Liverpool,1,,,,,,,,,,225.97,959.18,,,
378,2020-05-17 00:00:00,Southampton,Sheffield United,1,,,,,,,,,,209.70,62.33,,,


In [4]:
season19_20.describe()

Unnamed: 0,notplayed,month,day,year,FTHG,FTAG,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr
count,380.0,288.0,288.0,288.0,288.0,288.0,288.0,288.0,288.0,380.0,380.0,288.0,288.0,288.0
mean,0.242105,7.614583,16.027778,2019.3125,1.506944,1.215278,2.867257,4.319618,4.77441,402.7455,402.7455,0.464549,0.268993,0.266771
std,0.428922,4.193166,9.400085,0.464319,1.19229,1.207775,2.277826,1.514852,4.165434,303.061162,303.061162,0.190143,0.134825,0.13191
min,0.0,1.0,1.0,2019.0,0.0,0.0,1.07,3.1,1.14,62.33,62.33,0.07,0.01,0.03
25%,0.0,2.0,8.0,2019.0,1.0,0.0,1.6075,3.4,2.3,200.8725,200.8725,0.3475,0.2,0.2
50%,0.0,9.0,17.5,2019.0,1.0,1.0,2.2,3.75,3.35,279.34,279.34,0.45,0.25,0.25
75%,0.0,11.0,24.0,2020.0,2.0,2.0,3.1,4.75,5.75,588.9425,588.9425,0.5825,0.29,0.29
max,1.0,12.0,31.0,2020.0,8.0,9.0,15.0,13.0,26.0,1140.0,1140.0,0.92,0.79,0.75


We now create a variable 'B365res' which is the predicted result judged by the outcome with the highest probability as implied by the B365 betting odds. The odds here are decimal odds, which means that the lowest value is the result with the highest probability based on the odds. 

Note that we identify the outcome using inequalities, e.g. if the decimal odds of a home win are smaller than the odds of a draw, and the home win-win odds are smaller than the odds of an away win, then we say that the predicted outcome is a home win. But what if the odds of a home win were smaller than the odds of a draw, but were equal to the odds of an away win? Our code does not give any indication about how to resolve that tie, and therefore our formula would return an empty cell for that row. In our data, there are no examples of this, and generally this is rare, but if we came across that we would have to decide what to do. If there were many cases we would have to create a separate category of outcome for assessing the reliability of the odds. If there were a small number of cases we could randomly resolve the tie in favour of one outcome or the other. This problem should never arise when it comes to our regression model, since the probabilities will always be estimated to many decimal places, and the chances of two identical outcomes is almost zero. 

In [5]:
season19_20['B365res']= np.where((season19_20['B365H']<season19_20['B365D']) & (season19_20['B365H']<season19_20['B365A']),'H',\
                            np.where((season19_20['B365D']<season19_20['B365H']) & (season19_20['B365D']<season19_20['B365A']),'D',\
                                    np.where((season19_20['B365A']<season19_20['B365H']) & (season19_20['B365A']<season19_20['B365D']),'A',"")))
pd.set_option('display.max_rows', 400)
season19_20

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,B365res
0,2019-10-08 00:00:00,AFC Bournemouth,Sheffield United,0,8.0,10.0,2019.0,1.0,1.0,D,1.95,3.6,3.6,281.7,62.33,0.55,0.21,0.24,H
1,2019-10-08 00:00:00,Burnley,Southampton,0,8.0,10.0,2019.0,3.0,0.0,H,2.62,3.2,2.75,180.68,209.7,0.45,0.3,0.26,H
2,2019-10-08 00:00:00,Crystal Palace,Everton,0,8.0,10.0,2019.0,0.0,0.0,D,3.0,3.25,2.37,207.5,457.2,0.36,0.38,0.26,A
3,2019-10-08 00:00:00,Tottenham Hotspur,Aston Villa,0,8.0,10.0,2019.0,3.0,1.0,H,1.3,5.25,10.0,881.55,140.4,0.73,0.09,0.18,H
4,2019-10-08 00:00:00,Watford,Brighton and Hove Albion,0,8.0,10.0,2019.0,0.0,3.0,A,1.9,3.4,4.0,214.52,180.99,0.51,0.24,0.25,H
5,2019-10-08 00:00:00,West Ham United,Manchester City,0,8.0,10.0,2019.0,0.0,5.0,A,12.0,6.5,1.22,299.03,1140.0,0.1,0.74,0.16,A
6,2019-11-08 00:00:00,Leicester City,Wolverhampton Wanderers,0,8.0,11.0,2019.0,0.0,0.0,D,2.2,3.2,3.4,343.13,276.98,0.47,0.26,0.27,H
7,2019-11-08 00:00:00,Manchester United,Chelsea,0,8.0,11.0,2019.0,4.0,0.0,H,2.1,3.3,3.5,644.63,697.5,0.37,0.38,0.25,H
8,2019-11-08 00:00:00,Newcastle United,Arsenal,0,8.0,11.0,2019.0,0.0,1.0,A,4.5,3.75,1.72,225.97,570.38,0.31,0.45,0.25,A
9,17/08/2019,Arsenal,Burnley,0,8.0,17.0,2019.0,2.0,1.0,H,1.3,5.5,10.0,570.38,180.68,0.64,0.16,0.21,H


One way to check the reliability of the betting odds is to use pd.crosstab() to show the count's actual results against the predicted results. We show this below.

What jumps out at you from this crosstab is that the bookmaker odds *never* predict a draw! Later on, you will find this is true of our model, and also true of the forecasting model published by Nate Silver's FiveThirtyEight. But draws are very common in soccer - in any league they typically account for around a quarter of all results! We will discuss this finding again when we look at the forecasts in more detail below. 

In [6]:
pd.crosstab(season19_20['FTR'], season19_20['B365res'],dropna= True)

B365res,A,H
FTR,Unnamed: 1_level_1,Unnamed: 2_level_1
A,49,38
D,20,52
H,26,103


## Estimating our regression model

Though not all of the games for the season have been played, we have TM values for all the teams. These are the values published just before the season started. TransferMarkt updates its TM values for the Premier League several times a year, and it might be possible to improve the accuracy of forecasts by always choosing the most recent TM values, but for the purposes of showing how to apply the forecast model, the values from the beginning of the season will be good enough.

The TM values for the home team and away team are loaded into the df, so we only need to create the log of the ratio of home team TM value to away team TM value.

In [7]:
# create the log of TM ratios 
season19_20['lhTMratio'] = np.log(season19_20['TMhome']/season19_20['TMaway'])

Now we create a numerical value for the actual outcome of the game (FTR) where H = 2, D = 1 and A = 0.

In [8]:
# create a value = 2 if homewin, 1 if draw, 0 if awaywin
season19_20['winvalue'] = np.where(season19_20['FTR'] == 'H', 2 ,(np.where(season19_20['FTR'] == 'D', 1, 0)))
season19_20

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,...,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,B365res,lhTMratio,winvalue
0,2019-10-08 00:00:00,AFC Bournemouth,Sheffield United,0,8.0,10.0,2019.0,1.0,1.0,D,...,3.6,3.6,281.7,62.33,0.55,0.21,0.24,H,1.5084,1
1,2019-10-08 00:00:00,Burnley,Southampton,0,8.0,10.0,2019.0,3.0,0.0,H,...,3.2,2.75,180.68,209.7,0.45,0.3,0.26,H,-0.14895,2
2,2019-10-08 00:00:00,Crystal Palace,Everton,0,8.0,10.0,2019.0,0.0,0.0,D,...,3.25,2.37,207.5,457.2,0.36,0.38,0.26,A,-0.78999,1
3,2019-10-08 00:00:00,Tottenham Hotspur,Aston Villa,0,8.0,10.0,2019.0,3.0,1.0,H,...,5.25,10.0,881.55,140.4,0.73,0.09,0.18,H,1.837186,2
4,2019-10-08 00:00:00,Watford,Brighton and Hove Albion,0,8.0,10.0,2019.0,0.0,3.0,A,...,3.4,4.0,214.52,180.99,0.51,0.24,0.25,H,0.169961,0
5,2019-10-08 00:00:00,West Ham United,Manchester City,0,8.0,10.0,2019.0,0.0,5.0,A,...,6.5,1.22,299.03,1140.0,0.1,0.74,0.16,A,-1.33824,0
6,2019-11-08 00:00:00,Leicester City,Wolverhampton Wanderers,0,8.0,11.0,2019.0,0.0,0.0,D,...,3.2,3.4,343.13,276.98,0.47,0.26,0.27,H,0.214164,1
7,2019-11-08 00:00:00,Manchester United,Chelsea,0,8.0,11.0,2019.0,4.0,0.0,H,...,3.3,3.5,644.63,697.5,0.37,0.38,0.25,H,-0.078826,2
8,2019-11-08 00:00:00,Newcastle United,Arsenal,0,8.0,11.0,2019.0,0.0,1.0,A,...,3.75,1.72,225.97,570.38,0.31,0.45,0.25,A,-0.925901,0
9,17/08/2019,Arsenal,Burnley,0,8.0,17.0,2019.0,2.0,1.0,H,...,5.5,10.0,570.38,180.68,0.64,0.16,0.21,H,1.149575,2


We are going to use the games from the season played in the calendar year of 2019 to estimate our regression model (198 games), and then use the estimated coefficients from that model to forecast the games played in 2020 (90 games). The split here is arbitrary - we could have chosen the balance differently. We should expect the regression model to improve as we increase the number of games included, but also to deteriorate as the time elapsed between the games used for estimation and the games forecast increases (since the earlier information becomes out of date). Viewed as a forecasting model, however, the point is that from the end of December 2019 it would have been possible to use the model to forecast the outcome of games played in 2020, *before* the games had actually been played.

We first define our subset of games for estimation:

In [9]:
season19 = season19_20[:198]
season19

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,...,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,B365res,lhTMratio,winvalue
0,2019-10-08 00:00:00,AFC Bournemouth,Sheffield United,0,8.0,10.0,2019.0,1.0,1.0,D,...,3.6,3.6,281.7,62.33,0.55,0.21,0.24,H,1.5084,1
1,2019-10-08 00:00:00,Burnley,Southampton,0,8.0,10.0,2019.0,3.0,0.0,H,...,3.2,2.75,180.68,209.7,0.45,0.3,0.26,H,-0.14895,2
2,2019-10-08 00:00:00,Crystal Palace,Everton,0,8.0,10.0,2019.0,0.0,0.0,D,...,3.25,2.37,207.5,457.2,0.36,0.38,0.26,A,-0.78999,1
3,2019-10-08 00:00:00,Tottenham Hotspur,Aston Villa,0,8.0,10.0,2019.0,3.0,1.0,H,...,5.25,10.0,881.55,140.4,0.73,0.09,0.18,H,1.837186,2
4,2019-10-08 00:00:00,Watford,Brighton and Hove Albion,0,8.0,10.0,2019.0,0.0,3.0,A,...,3.4,4.0,214.52,180.99,0.51,0.24,0.25,H,0.169961,0
5,2019-10-08 00:00:00,West Ham United,Manchester City,0,8.0,10.0,2019.0,0.0,5.0,A,...,6.5,1.22,299.03,1140.0,0.1,0.74,0.16,A,-1.33824,0
6,2019-11-08 00:00:00,Leicester City,Wolverhampton Wanderers,0,8.0,11.0,2019.0,0.0,0.0,D,...,3.2,3.4,343.13,276.98,0.47,0.26,0.27,H,0.214164,1
7,2019-11-08 00:00:00,Manchester United,Chelsea,0,8.0,11.0,2019.0,4.0,0.0,H,...,3.3,3.5,644.63,697.5,0.37,0.38,0.25,H,-0.078826,2
8,2019-11-08 00:00:00,Newcastle United,Arsenal,0,8.0,11.0,2019.0,0.0,1.0,A,...,3.75,1.72,225.97,570.38,0.31,0.45,0.25,A,-0.925901,0
9,17/08/2019,Arsenal,Burnley,0,8.0,17.0,2019.0,2.0,1.0,H,...,5.5,10.0,570.38,180.68,0.64,0.16,0.21,H,1.149575,2


Once again we are going to use an ordered logistic regression to estimate the results. First, we import the package for running ordered logit model.

In [10]:
from bevel.linear_ordinal_regression import OrderedLogit
ol = OrderedLogit()

Now  we run the ordered logit regression of game outcome (winvalue) on the TM ratio:

In [11]:
ol.fit(season19['lhTMratio'], season19['winvalue'])
ol.print_summary()

n=198
                  beta  se(beta)      p  lower 0.95  upper 0.95     
attribute names                                                     
column_1        0.5553    0.1298 0.0000      0.3009      0.8096  ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Somers' D = 0.252


We can see the value of the coefficient of lhTMratio is 0.555 and the standard error is 0.1298, yielding a t-statistic of over 4 implying that the coefficient is statistically significant at the 0.01 level and better. The higher the ratio, the better the outcome, as viewed by the home team. 

Recall that in our model for the seasons 2010/11 to 2018/19 we found that the coefficient of lhTMratio is 0.528 and the standard error is 0.0323. The coefficient estimate is quite similar, though the precision of the estimate was somewhat larger, which can be attributed to the larger sample size. Indeed, we could use those estimates to forecast all of the 2019/20 games.  In practice, the ideal procedure would be to use coefficients from the previous season to forecast results at the beginning of a new season, and then transfer to estimates based on the current seasons once enough games have been played (e.g. 100 or more). Of course, in each case, the most recent TM values should be used for forecasting.

We want to convert our estimates into probabilities, and to do this we need the estimates of the intercepts as well. the coefficient for lhTMratio is stored as ol.coef_[0], while the threshold between A and D is stored as ol.coef_[1] the threshold between D and H is stored as ol.coef_[2] We can print these out with appropriate names:

In [12]:
#%% To get the coefficients and the intercepts
print(f'beta = {ol.coef_[0]:.4f}')
print(f'interceptAD = {ol.coef_[1]:.4f}')
print(f'interceptDH = {ol.coef_[2]:.4f}')

beta = 0.5553
interceptAD = -0.7678
interceptDH = 0.3216


To generate the forecast probabilities we need to manipulate the coefficients. The logit regression equation has the form log(p/(1-p)) = a + bX. By rearranging this equation we can obtain p = 1/(1+ exp(a +bX)).

For each game, we know X (lhTMratio) and now we know b (0.555). Since this is an ordered logit with three possible outcomes, the probability of the worst outcome A (viewed by the home team) depends on intercept_AD (ol.coef_[1]), the probability of the middle outcome D depends on intercept_AD and intercept_DH (ol.coef_[1] and ol.coef_[2]), while the probability of the best outcome depends on intercept_DH.

If we calculate the probability of A first, using ol.coef_[0] and ol.coef_[1], then when calculating the probability of D we can use the fact that we already have the probability of A, and also when calculating the probability of H we can use the fact that we already have the probability of D.

Thus, we now create the predicted values of the H, A and D probabilities from our model. We can create a prediction for every game in the season, so we now apply the formulas to the season19_20 df, not the season19 df that we used to generate the regression estimates:

In [13]:
# Predicted probabilities

season19_20['predA'] = 1/(1+np.exp(-(ol.coef_[1]-ol.coef_[0]*season19_20['lhTMratio'])))
season19_20['predD'] = 1/(1+np.exp(-(ol.coef_[2]-ol.coef_[0]*season19_20['lhTMratio']))) - season19_20['predA']
season19_20['predH'] = 1 - season19_20['predA'] - season19_20['predD']

pd.set_option('display.max_columns', 50)
season19_20

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,B365res,lhTMratio,winvalue,predA,predD,predH
0,2019-10-08 00:00:00,AFC Bournemouth,Sheffield United,0,8.0,10.0,2019.0,1.0,1.0,D,1.95,3.6,3.6,281.7,62.33,0.55,0.21,0.24,H,1.5084,1,0.167224,0.206556,0.626221
1,2019-10-08 00:00:00,Burnley,Southampton,0,8.0,10.0,2019.0,3.0,0.0,H,2.62,3.2,2.75,180.68,209.7,0.45,0.3,0.26,H,-0.14895,2,0.335116,0.264595,0.400289
2,2019-10-08 00:00:00,Crystal Palace,Everton,0,8.0,10.0,2019.0,0.0,0.0,D,3.0,3.25,2.37,207.5,457.2,0.36,0.38,0.26,A,-0.78999,1,0.418441,0.26296,0.318599
3,2019-10-08 00:00:00,Tottenham Hotspur,Aston Villa,0,8.0,10.0,2019.0,3.0,1.0,H,1.3,5.25,10.0,881.55,140.4,0.73,0.09,0.18,H,1.837186,2,0.143318,0.188803,0.667879
4,2019-10-08 00:00:00,Watford,Brighton and Hove Albion,0,8.0,10.0,2019.0,0.0,3.0,A,1.9,3.4,4.0,214.52,180.99,0.51,0.24,0.25,H,0.169961,0,0.296876,0.259675,0.443449
5,2019-10-08 00:00:00,West Ham United,Manchester City,0,8.0,10.0,2019.0,0.0,5.0,A,12.0,6.5,1.22,299.03,1140.0,0.1,0.74,0.16,A,-1.33824,0,0.493815,0.249764,0.256421
6,2019-11-08 00:00:00,Leicester City,Wolverhampton Wanderers,0,8.0,11.0,2019.0,0.0,0.0,D,2.2,3.2,3.4,343.13,276.98,0.47,0.26,0.27,H,0.214164,1,0.291778,0.258707,0.449515
7,2019-11-08 00:00:00,Manchester United,Chelsea,0,8.0,11.0,2019.0,4.0,0.0,H,2.1,3.3,3.5,644.63,697.5,0.37,0.38,0.25,H,-0.078826,2,0.326497,0.263831,0.409672
8,2019-11-08 00:00:00,Newcastle United,Arsenal,0,8.0,11.0,2019.0,0.0,1.0,A,4.5,3.75,1.72,225.97,570.38,0.31,0.45,0.25,A,-0.925901,0,0.436911,0.260645,0.302444
9,17/08/2019,Arsenal,Burnley,0,8.0,17.0,2019.0,2.0,1.0,H,1.3,5.5,10.0,570.38,180.68,0.64,0.16,0.21,H,1.149575,2,0.196837,0.224622,0.578541


To identify the most likely outcome, we first identify the largest value in the three columns, predA, predD and predH, which we call Maxprob, and then we create a value for the model's prediction (logitpred) were for the value of Maxprob.

In [14]:
# Result prediction

season19_20['Maxprob'] =season19_20[['predA','predD','predH']].max(axis=1)
season19_20['logitpred']=np.where(season19_20['Maxprob']==season19_20['predA'],'A',\
                               np.where(season19_20['Maxprob']==season19_20['predD'],'D','H'))
season19_20['logittrue']= np.where(season19_20['logitpred'] == season19_20['FTR'],1,0)
season19_20

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,B365res,lhTMratio,winvalue,predA,predD,predH,Maxprob,logitpred,logittrue
0,2019-10-08 00:00:00,AFC Bournemouth,Sheffield United,0,8.0,10.0,2019.0,1.0,1.0,D,1.95,3.6,3.6,281.7,62.33,0.55,0.21,0.24,H,1.5084,1,0.167224,0.206556,0.626221,0.626221,H,0
1,2019-10-08 00:00:00,Burnley,Southampton,0,8.0,10.0,2019.0,3.0,0.0,H,2.62,3.2,2.75,180.68,209.7,0.45,0.3,0.26,H,-0.14895,2,0.335116,0.264595,0.400289,0.400289,H,1
2,2019-10-08 00:00:00,Crystal Palace,Everton,0,8.0,10.0,2019.0,0.0,0.0,D,3.0,3.25,2.37,207.5,457.2,0.36,0.38,0.26,A,-0.78999,1,0.418441,0.26296,0.318599,0.418441,A,0
3,2019-10-08 00:00:00,Tottenham Hotspur,Aston Villa,0,8.0,10.0,2019.0,3.0,1.0,H,1.3,5.25,10.0,881.55,140.4,0.73,0.09,0.18,H,1.837186,2,0.143318,0.188803,0.667879,0.667879,H,1
4,2019-10-08 00:00:00,Watford,Brighton and Hove Albion,0,8.0,10.0,2019.0,0.0,3.0,A,1.9,3.4,4.0,214.52,180.99,0.51,0.24,0.25,H,0.169961,0,0.296876,0.259675,0.443449,0.443449,H,0
5,2019-10-08 00:00:00,West Ham United,Manchester City,0,8.0,10.0,2019.0,0.0,5.0,A,12.0,6.5,1.22,299.03,1140.0,0.1,0.74,0.16,A,-1.33824,0,0.493815,0.249764,0.256421,0.493815,A,1
6,2019-11-08 00:00:00,Leicester City,Wolverhampton Wanderers,0,8.0,11.0,2019.0,0.0,0.0,D,2.2,3.2,3.4,343.13,276.98,0.47,0.26,0.27,H,0.214164,1,0.291778,0.258707,0.449515,0.449515,H,0
7,2019-11-08 00:00:00,Manchester United,Chelsea,0,8.0,11.0,2019.0,4.0,0.0,H,2.1,3.3,3.5,644.63,697.5,0.37,0.38,0.25,H,-0.078826,2,0.326497,0.263831,0.409672,0.409672,H,1
8,2019-11-08 00:00:00,Newcastle United,Arsenal,0,8.0,11.0,2019.0,0.0,1.0,A,4.5,3.75,1.72,225.97,570.38,0.31,0.45,0.25,A,-0.925901,0,0.436911,0.260645,0.302444,0.436911,A,1
9,17/08/2019,Arsenal,Burnley,0,8.0,17.0,2019.0,2.0,1.0,H,1.3,5.5,10.0,570.38,180.68,0.64,0.16,0.21,H,1.149575,2,0.196837,0.224622,0.578541,0.578541,H,1


## Evaluating our forecasts

Having generated our forecasts, we want to compare their reliability compared to the betting odds and the forecasts of Nate Silver's 538. We first define a df which consists of those games played in 2020 only (excluding the games from the calendar year 2019 and also the games which have not been played). These are rows 198 to 287 in the data (noting that the first row of data is labeled row 0 in Python). To define a subset of rows 198 to 287 in the data, you write [198:288] - which means that the second row number is not actually included in the subset:

In [15]:
trunc19_20 = season19_20[198:288].copy()
trunc19_20.describe()

Unnamed: 0,notplayed,month,day,year,FTHG,FTAG,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,lhTMratio,winvalue,predA,predD,predH,Maxprob,logittrue
count,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0
mean,0.0,1.744444,13.111111,2020.0,1.566667,0.988889,2.804444,4.314667,4.75589,403.882889,405.895,0.466,0.229778,0.304667,0.00973,1.244444,0.329074,0.243114,0.427812,0.495478,0.466667
std,0.0,0.966415,9.046005,0.0,1.218306,1.075722,2.054971,1.497823,3.844079,295.648269,312.390011,0.189741,0.047593,0.171111,1.118824,0.825317,0.128325,0.027735,0.140547,0.094113,0.501683
min,0.0,1.0,1.0,2020.0,0.0,0.0,1.1,3.1,1.16,62.33,62.33,0.1,0.09,0.03,-2.906341,0.0,0.092307,0.139811,0.126155,0.369142,0.0
25%,0.0,1.0,7.0,2020.0,1.0,0.0,1.61,3.425,2.39,207.5,187.6175,0.35,0.21,0.1625,-0.744887,1.0,0.232444,0.23492,0.324065,0.421253,0.0
50%,0.0,2.0,12.0,2020.0,1.0,1.0,2.1,3.8,3.5,281.7,279.34,0.46,0.25,0.29,0.024577,1.0,0.314014,0.253575,0.423634,0.476793,0.0
75%,0.0,2.0,21.0,2020.0,2.0,1.75,2.975,4.475,5.75,626.0675,626.0675,0.6025,0.26,0.3975,0.768672,2.0,0.412362,0.262633,0.526293,0.547577,1.0
max,0.0,8.0,29.0,2020.0,4.0,6.0,13.0,11.0,19.0,1140.0,1140.0,0.88,0.3,0.75,2.733636,2.0,0.699727,0.265804,0.767882,0.767882,1.0


As we did with our within-sample forecasts, we will compare both the success rate in predicting the outcome, and the Brier Score.

First the success rate of the forecasts:

In [16]:
# Model success rate
trunc19_20['logittrue'].mean()

0.4666666666666667

A success rate of 46.7% is slightly below the 50% success rate we found with the within-sample predictions, but our main interest is the contrast with the betting odds. We need to identify successes and failures for the bookmaker odds:

In [17]:
trunc19_20['B365true']= np.where(trunc19_20['B365res'] == trunc19_20['FTR'],1,0)
trunc19_20

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,B365res,lhTMratio,winvalue,predA,predD,predH,Maxprob,logitpred,logittrue,B365true
198,2020-01-01 00:00:00,Arsenal,Manchester United,0,1.0,1.0,2020.0,2.0,0.0,H,2.55,3.6,2.62,570.38,644.63,0.35,0.26,0.4,H,-0.122374,2,0.331836,0.264327,0.403837,0.403837,H,1,1
199,2020-01-01 00:00:00,Brighton and Hove Albion,Chelsea,0,1.0,1.0,2020.0,1.0,1.0,D,3.6,3.6,1.95,180.99,697.5,0.22,0.23,0.55,A,-1.349061,1,0.495317,0.249406,0.255277,0.495317,A,0,0
200,2020-01-01 00:00:00,Burnley,Aston Villa,0,1.0,1.0,2020.0,1.0,2.0,A,1.75,3.8,4.33,180.68,140.4,0.5,0.25,0.25,H,0.252232,0,0.287429,0.25782,0.454751,0.454751,H,0,0
201,2020-01-01 00:00:00,Manchester City,Everton,0,1.0,1.0,2020.0,2.0,1.0,H,1.25,6.5,10.0,1140.0,457.2,0.79,0.14,0.07,H,0.913663,2,0.218371,0.235315,0.546314,0.546314,H,1,1
202,2020-01-01 00:00:00,Newcastle United,Leicester City,0,1.0,1.0,2020.0,0.0,3.0,A,5.0,3.8,1.66,225.97,343.13,0.24,0.25,0.51,A,-0.417707,0,0.369142,0.265804,0.365054,0.369142,A,1,1
203,2020-01-01 00:00:00,Norwich City,Crystal Palace,0,1.0,1.0,2020.0,1.0,1.0,D,2.5,3.4,2.75,81.54,207.5,0.37,0.28,0.35,H,-0.934038,1,0.438023,0.260485,0.301491,0.438023,A,0,0
204,2020-01-01 00:00:00,Southampton,Tottenham Hotspur,0,1.0,1.0,2020.0,1.0,0.0,H,3.3,3.5,2.1,209.7,881.55,0.27,0.23,0.49,A,-1.436004,2,0.507386,0.246406,0.246208,0.507386,A,0,0
205,2020-01-01 00:00:00,Watford,Wolverhampton Wanderers,0,1.0,1.0,2020.0,2.0,1.0,H,3.0,3.4,2.3,214.52,276.98,0.37,0.28,0.35,A,-0.255542,2,0.34843,0.265402,0.386168,0.386168,H,1,0
206,2020-01-01 00:00:00,West Ham United,AFC Bournemouth,0,1.0,1.0,2020.0,4.0,0.0,H,1.9,3.75,3.8,299.03,281.7,0.44,0.26,0.31,H,0.059701,2,0.309813,0.261792,0.428396,0.428396,H,1,1
207,2020-02-01 00:00:00,Liverpool,Sheffield United,0,1.0,2.0,2020.0,2.0,0.0,H,1.2,6.5,13.0,959.18,62.33,0.83,0.13,0.04,H,2.733636,2,0.092307,0.139811,0.767882,0.767882,H,1,1


Now we can calculate the bookmaker success rate:

In [18]:
trunc19_20['B365true'].mean()

0.4888888888888889

So the bookmaker forecasts are slightly better than our model forecasts - the difference is two games out of 90 correctly forecast. 

Now let's compare the Brier Scores. First, we define a variable for each possible outcome, with the value 1 if this was the actual outcome and zero otherwise:

In [19]:
# Outcome value for calculating Brier Score

trunc19_20['Houtcome']= np.where(trunc19_20['FTR']=='H',1,0)
trunc19_20['Doutcome']= np.where(trunc19_20['FTR']=='D',1,0)
trunc19_20['Aoutcome']= np.where(trunc19_20['FTR']=='A',1,0)
trunc19_20

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,B365res,lhTMratio,winvalue,predA,predD,predH,Maxprob,logitpred,logittrue,B365true,Houtcome,Doutcome,Aoutcome
198,2020-01-01 00:00:00,Arsenal,Manchester United,0,1.0,1.0,2020.0,2.0,0.0,H,2.55,3.6,2.62,570.38,644.63,0.35,0.26,0.4,H,-0.122374,2,0.331836,0.264327,0.403837,0.403837,H,1,1,1,0,0
199,2020-01-01 00:00:00,Brighton and Hove Albion,Chelsea,0,1.0,1.0,2020.0,1.0,1.0,D,3.6,3.6,1.95,180.99,697.5,0.22,0.23,0.55,A,-1.349061,1,0.495317,0.249406,0.255277,0.495317,A,0,0,0,1,0
200,2020-01-01 00:00:00,Burnley,Aston Villa,0,1.0,1.0,2020.0,1.0,2.0,A,1.75,3.8,4.33,180.68,140.4,0.5,0.25,0.25,H,0.252232,0,0.287429,0.25782,0.454751,0.454751,H,0,0,0,0,1
201,2020-01-01 00:00:00,Manchester City,Everton,0,1.0,1.0,2020.0,2.0,1.0,H,1.25,6.5,10.0,1140.0,457.2,0.79,0.14,0.07,H,0.913663,2,0.218371,0.235315,0.546314,0.546314,H,1,1,1,0,0
202,2020-01-01 00:00:00,Newcastle United,Leicester City,0,1.0,1.0,2020.0,0.0,3.0,A,5.0,3.8,1.66,225.97,343.13,0.24,0.25,0.51,A,-0.417707,0,0.369142,0.265804,0.365054,0.369142,A,1,1,0,0,1
203,2020-01-01 00:00:00,Norwich City,Crystal Palace,0,1.0,1.0,2020.0,1.0,1.0,D,2.5,3.4,2.75,81.54,207.5,0.37,0.28,0.35,H,-0.934038,1,0.438023,0.260485,0.301491,0.438023,A,0,0,0,1,0
204,2020-01-01 00:00:00,Southampton,Tottenham Hotspur,0,1.0,1.0,2020.0,1.0,0.0,H,3.3,3.5,2.1,209.7,881.55,0.27,0.23,0.49,A,-1.436004,2,0.507386,0.246406,0.246208,0.507386,A,0,0,1,0,0
205,2020-01-01 00:00:00,Watford,Wolverhampton Wanderers,0,1.0,1.0,2020.0,2.0,1.0,H,3.0,3.4,2.3,214.52,276.98,0.37,0.28,0.35,A,-0.255542,2,0.34843,0.265402,0.386168,0.386168,H,1,0,1,0,0
206,2020-01-01 00:00:00,West Ham United,AFC Bournemouth,0,1.0,1.0,2020.0,4.0,0.0,H,1.9,3.75,3.8,299.03,281.7,0.44,0.26,0.31,H,0.059701,2,0.309813,0.261792,0.428396,0.428396,H,1,1,1,0,0
207,2020-02-01 00:00:00,Liverpool,Sheffield United,0,1.0,2.0,2020.0,2.0,0.0,H,1.2,6.5,13.0,959.18,62.33,0.83,0.13,0.04,H,2.733636,2,0.092307,0.139811,0.767882,0.767882,H,1,1,1,0,0


Now we derive the bookmaker probabilities from the betting odds. The outcome probability equals 1/(decimal odds). However, if you make this calculation and sum the three possibilities the total is greater than one. This is called the 'overround', or the 'vig' - and represents the profit of the bookmaker. To calculate the implied probability from the betting odds you have to divide by the sum of the three numbers, so that your final probabilities add up to 1 (100%).

We calculate these probabilities below:

In [20]:
trunc19_20['B365HPr']= 1/(trunc19_20['B365H'])/(1/(trunc19_20['B365H'])+ 1/(trunc19_20['B365D'])+ 1/(trunc19_20['B365A']))
trunc19_20['B365DPr']= 1/(trunc19_20['B365D'])/(1/(trunc19_20['B365H'])+ 1/(trunc19_20['B365D'])+ 1/(trunc19_20['B365A']))
trunc19_20['B365APr']= 1/(trunc19_20['B365A'])/(1/(trunc19_20['B365H'])+ 1/(trunc19_20['B365D'])+ 1/(trunc19_20['B365A']))
trunc19_20

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,B365res,lhTMratio,winvalue,predA,predD,predH,Maxprob,logitpred,logittrue,B365true,Houtcome,Doutcome,Aoutcome,B365HPr,B365DPr,B365APr
198,2020-01-01 00:00:00,Arsenal,Manchester United,0,1.0,1.0,2020.0,2.0,0.0,H,2.55,3.6,2.62,570.38,644.63,0.35,0.26,0.4,H,-0.122374,2,0.331836,0.264327,0.403837,0.403837,H,1,1,1,0,0,0.37291,0.264144,0.362946
199,2020-01-01 00:00:00,Brighton and Hove Albion,Chelsea,0,1.0,1.0,2020.0,1.0,1.0,D,3.6,3.6,1.95,180.99,697.5,0.22,0.23,0.55,A,-1.349061,1,0.495317,0.249406,0.255277,0.495317,A,0,0,0,1,0,0.26,0.26,0.48
200,2020-01-01 00:00:00,Burnley,Aston Villa,0,1.0,1.0,2020.0,1.0,2.0,A,1.75,3.8,4.33,180.68,140.4,0.5,0.25,0.25,H,0.252232,0,0.287429,0.25782,0.454751,0.454751,H,0,0,0,0,1,0.536284,0.246973,0.216743
201,2020-01-01 00:00:00,Manchester City,Everton,0,1.0,1.0,2020.0,2.0,1.0,H,1.25,6.5,10.0,1140.0,457.2,0.79,0.14,0.07,H,0.913663,2,0.218371,0.235315,0.546314,0.546314,H,1,1,1,0,0,0.759124,0.145985,0.094891
202,2020-01-01 00:00:00,Newcastle United,Leicester City,0,1.0,1.0,2020.0,0.0,3.0,A,5.0,3.8,1.66,225.97,343.13,0.24,0.25,0.51,A,-0.417707,0,0.369142,0.265804,0.365054,0.369142,A,1,1,0,0,1,0.187693,0.246965,0.565342
203,2020-01-01 00:00:00,Norwich City,Crystal Palace,0,1.0,1.0,2020.0,1.0,1.0,D,2.5,3.4,2.75,81.54,207.5,0.37,0.28,0.35,H,-0.934038,1,0.438023,0.260485,0.301491,0.438023,A,0,0,0,1,0,0.37816,0.278059,0.343782
204,2020-01-01 00:00:00,Southampton,Tottenham Hotspur,0,1.0,1.0,2020.0,1.0,0.0,H,3.3,3.5,2.1,209.7,881.55,0.27,0.23,0.49,A,-1.436004,2,0.507386,0.246406,0.246208,0.507386,A,0,0,1,0,0,0.284553,0.268293,0.447154
205,2020-01-01 00:00:00,Watford,Wolverhampton Wanderers,0,1.0,1.0,2020.0,2.0,1.0,H,3.0,3.4,2.3,214.52,276.98,0.37,0.28,0.35,A,-0.255542,2,0.34843,0.265402,0.386168,0.386168,H,1,0,1,0,0,0.313804,0.276886,0.40931
206,2020-01-01 00:00:00,West Ham United,AFC Bournemouth,0,1.0,1.0,2020.0,4.0,0.0,H,1.9,3.75,3.8,299.03,281.7,0.44,0.26,0.31,H,0.059701,2,0.309813,0.261792,0.428396,0.428396,H,1,1,1,0,0,0.498339,0.252492,0.249169
207,2020-02-01 00:00:00,Liverpool,Sheffield United,0,1.0,2.0,2020.0,2.0,0.0,H,1.2,6.5,13.0,959.18,62.33,0.83,0.13,0.04,H,2.733636,2,0.092307,0.139811,0.767882,0.767882,H,1,1,1,0,0,0.783133,0.144578,0.072289


Now we can calculate and compare Brier Scores:

In [21]:
# Model Brier score

Brierlogit = ((trunc19_20['predH'] - trunc19_20['Houtcome'])**2 +(trunc19_20['predD'] - trunc19_20['Doutcome'])**2 +\
             (trunc19_20['predA'] - trunc19_20['Aoutcome'])**2).sum()/89
Brierlogit

0.6199325184748382

In [22]:
# Bookie Brier score (lower is better)

BrierB365 = ((trunc19_20['B365HPr'] - trunc19_20['Houtcome'])**2 +(trunc19_20['B365DPr'] - trunc19_20['Doutcome'])**2 +\
             (trunc19_20['B365APr'] - trunc19_20['Aoutcome'])**2).sum()/89
BrierB365

0.5872181372691697

Once again, we find that the bookmaker Brier Score is slightly lower (better) than the Brier Score of our model, but the gap is even smaller than we found using the within-sample model.

A useful comparison is to generate the crosstabs against the actual result of the bookmaker forecasts and the model forecasts:

In [23]:
pd.crosstab(trunc19_20['FTR'], trunc19_20['B365res'],dropna= True)

B365res,A,H
FTR,Unnamed: 1_level_1,Unnamed: 2_level_1
A,11,11
D,7,17
H,11,33


In [24]:
pd.crosstab(trunc19_20['FTR'], trunc19_20['logitpred'],dropna= True)

logitpred,A,H
FTR,Unnamed: 1_level_1,Unnamed: 2_level_1
A,9,13
D,10,14
H,11,33


First, note that there were 24 draws out of the 90 games played, and neither bookmaker nor model forecast any draws as the most likely outcome. It seems therefore pointless to compare the middle row of each crosstab. 

Looking at the first row of each crosstab, the bookmaker (B365res) correctly identified 50% of the away wins (11 of the 22), while our model fared less well, getting only 9 of these 22 cases right). 

Looking at the third row of each crosstab, the bookmaker (B365res) correctly identified 75% of the home wins (33 of the 44), and our model had exactly the same success rate.

So we can see that both bookmaker and model are good at forecasting home wins, so-so when it comes to away wins, and useless for predicting draws.

**FiveThirtyEight** is a website devoted to statistical analysis, mostly of political and sports data. It was founded by Nate Silver, who built his reputation on accurate forecasting of US election results based on published opinion poll data (538 is the number of electoral college votes in the US Presidential election- the winner of the election is the candidate who can obtain a majority of these votes, not the popular vote). He also wrote a hugely successful book, The Signal and the Noise, which explores practical applications of statistical prediction, and why humans are notoriously bad at interpreting data to make forecasts (the subtitle of the book is "Why so many predictions fail").

The FiveThirtyEight "Club Soccer Predictions" started in 2017 and now generates game by game predictions for 36 leagues, in the same format that we have followed here- the probability of a home win, draw, and an away win. These probabilities are typically published a week or two ahead of the games being played. The website contains an explanation of how these predictions are generated: https://fivethirtyeight.com/methodology/how-our-club-soccer-predictions-work/. 

The FiveThirtyEight model is a good deal more complicated than our very simple model - the references in the article cite Harmonic means, Massey’s method, the Monte Carlo method, the Poisson process and Ranked probability scores. However, they also rely on TransferMarkt data as a measure of team quality. However, the exact details of the model are not published in a way that would allow anyone to replicate their results. The replicability problem is widespread in the world of statistical analysis, which often makes it difficult to judge the reliability of the model. Here, at least, we have the published forecasts, and so we test the reliability of the model results.

So let's examine how the FiveThirtyEight model compares with the bookmakers and our simple model. We have the FiveThirtyEight outcome probabilities in our dfs, so we just need to generate the most likely outcome and the Brier Score from the data.

In [25]:
# Most likely outcome

trunc19_20['538res']= np.where((trunc19_20['538hpr']>trunc19_20['538dpr']) & (trunc19_20['538hpr']>trunc19_20['538apr']),'H',\
                            np.where((trunc19_20['538dpr']>trunc19_20['538hpr']) & (trunc19_20['538dpr']>trunc19_20['538apr']),'D',\
                                    np.where((trunc19_20['538apr']>trunc19_20['538hpr']) & (trunc19_20['538apr']>trunc19_20['538dpr']),'A',"")))
trunc19_20

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,B365res,lhTMratio,winvalue,predA,predD,predH,Maxprob,logitpred,logittrue,B365true,Houtcome,Doutcome,Aoutcome,B365HPr,B365DPr,B365APr,538res
198,2020-01-01 00:00:00,Arsenal,Manchester United,0,1.0,1.0,2020.0,2.0,0.0,H,2.55,3.6,2.62,570.38,644.63,0.35,0.26,0.4,H,-0.122374,2,0.331836,0.264327,0.403837,0.403837,H,1,1,1,0,0,0.37291,0.264144,0.362946,A
199,2020-01-01 00:00:00,Brighton and Hove Albion,Chelsea,0,1.0,1.0,2020.0,1.0,1.0,D,3.6,3.6,1.95,180.99,697.5,0.22,0.23,0.55,A,-1.349061,1,0.495317,0.249406,0.255277,0.495317,A,0,0,0,1,0,0.26,0.26,0.48,A
200,2020-01-01 00:00:00,Burnley,Aston Villa,0,1.0,1.0,2020.0,1.0,2.0,A,1.75,3.8,4.33,180.68,140.4,0.5,0.25,0.25,H,0.252232,0,0.287429,0.25782,0.454751,0.454751,H,0,0,0,0,1,0.536284,0.246973,0.216743,H
201,2020-01-01 00:00:00,Manchester City,Everton,0,1.0,1.0,2020.0,2.0,1.0,H,1.25,6.5,10.0,1140.0,457.2,0.79,0.14,0.07,H,0.913663,2,0.218371,0.235315,0.546314,0.546314,H,1,1,1,0,0,0.759124,0.145985,0.094891,H
202,2020-01-01 00:00:00,Newcastle United,Leicester City,0,1.0,1.0,2020.0,0.0,3.0,A,5.0,3.8,1.66,225.97,343.13,0.24,0.25,0.51,A,-0.417707,0,0.369142,0.265804,0.365054,0.369142,A,1,1,0,0,1,0.187693,0.246965,0.565342,A
203,2020-01-01 00:00:00,Norwich City,Crystal Palace,0,1.0,1.0,2020.0,1.0,1.0,D,2.5,3.4,2.75,81.54,207.5,0.37,0.28,0.35,H,-0.934038,1,0.438023,0.260485,0.301491,0.438023,A,0,0,0,1,0,0.37816,0.278059,0.343782,H
204,2020-01-01 00:00:00,Southampton,Tottenham Hotspur,0,1.0,1.0,2020.0,1.0,0.0,H,3.3,3.5,2.1,209.7,881.55,0.27,0.23,0.49,A,-1.436004,2,0.507386,0.246406,0.246208,0.507386,A,0,0,1,0,0,0.284553,0.268293,0.447154,A
205,2020-01-01 00:00:00,Watford,Wolverhampton Wanderers,0,1.0,1.0,2020.0,2.0,1.0,H,3.0,3.4,2.3,214.52,276.98,0.37,0.28,0.35,A,-0.255542,2,0.34843,0.265402,0.386168,0.386168,H,1,0,1,0,0,0.313804,0.276886,0.40931,H
206,2020-01-01 00:00:00,West Ham United,AFC Bournemouth,0,1.0,1.0,2020.0,4.0,0.0,H,1.9,3.75,3.8,299.03,281.7,0.44,0.26,0.31,H,0.059701,2,0.309813,0.261792,0.428396,0.428396,H,1,1,1,0,0,0.498339,0.252492,0.249169,H
207,2020-02-01 00:00:00,Liverpool,Sheffield United,0,1.0,2.0,2020.0,2.0,0.0,H,1.2,6.5,13.0,959.18,62.33,0.83,0.13,0.04,H,2.733636,2,0.092307,0.139811,0.767882,0.767882,H,1,1,1,0,0,0.783133,0.144578,0.072289,H


In [26]:
# now identify the forecast successes of FiveThirtyEight

trunc19_20['538true']= np.where(trunc19_20['538res'] == trunc19_20['FTR'],1,0)
trunc19_20

Unnamed: 0,date,Home team,away team,notplayed,month,day,year,FTHG,FTAG,FTR,B365H,B365D,B365A,TMhome,TMaway,538hpr,538dpr,538apr,B365res,lhTMratio,winvalue,predA,predD,predH,Maxprob,logitpred,logittrue,B365true,Houtcome,Doutcome,Aoutcome,B365HPr,B365DPr,B365APr,538res,538true
198,2020-01-01 00:00:00,Arsenal,Manchester United,0,1.0,1.0,2020.0,2.0,0.0,H,2.55,3.6,2.62,570.38,644.63,0.35,0.26,0.4,H,-0.122374,2,0.331836,0.264327,0.403837,0.403837,H,1,1,1,0,0,0.37291,0.264144,0.362946,A,0
199,2020-01-01 00:00:00,Brighton and Hove Albion,Chelsea,0,1.0,1.0,2020.0,1.0,1.0,D,3.6,3.6,1.95,180.99,697.5,0.22,0.23,0.55,A,-1.349061,1,0.495317,0.249406,0.255277,0.495317,A,0,0,0,1,0,0.26,0.26,0.48,A,0
200,2020-01-01 00:00:00,Burnley,Aston Villa,0,1.0,1.0,2020.0,1.0,2.0,A,1.75,3.8,4.33,180.68,140.4,0.5,0.25,0.25,H,0.252232,0,0.287429,0.25782,0.454751,0.454751,H,0,0,0,0,1,0.536284,0.246973,0.216743,H,0
201,2020-01-01 00:00:00,Manchester City,Everton,0,1.0,1.0,2020.0,2.0,1.0,H,1.25,6.5,10.0,1140.0,457.2,0.79,0.14,0.07,H,0.913663,2,0.218371,0.235315,0.546314,0.546314,H,1,1,1,0,0,0.759124,0.145985,0.094891,H,1
202,2020-01-01 00:00:00,Newcastle United,Leicester City,0,1.0,1.0,2020.0,0.0,3.0,A,5.0,3.8,1.66,225.97,343.13,0.24,0.25,0.51,A,-0.417707,0,0.369142,0.265804,0.365054,0.369142,A,1,1,0,0,1,0.187693,0.246965,0.565342,A,1
203,2020-01-01 00:00:00,Norwich City,Crystal Palace,0,1.0,1.0,2020.0,1.0,1.0,D,2.5,3.4,2.75,81.54,207.5,0.37,0.28,0.35,H,-0.934038,1,0.438023,0.260485,0.301491,0.438023,A,0,0,0,1,0,0.37816,0.278059,0.343782,H,0
204,2020-01-01 00:00:00,Southampton,Tottenham Hotspur,0,1.0,1.0,2020.0,1.0,0.0,H,3.3,3.5,2.1,209.7,881.55,0.27,0.23,0.49,A,-1.436004,2,0.507386,0.246406,0.246208,0.507386,A,0,0,1,0,0,0.284553,0.268293,0.447154,A,0
205,2020-01-01 00:00:00,Watford,Wolverhampton Wanderers,0,1.0,1.0,2020.0,2.0,1.0,H,3.0,3.4,2.3,214.52,276.98,0.37,0.28,0.35,A,-0.255542,2,0.34843,0.265402,0.386168,0.386168,H,1,0,1,0,0,0.313804,0.276886,0.40931,H,1
206,2020-01-01 00:00:00,West Ham United,AFC Bournemouth,0,1.0,1.0,2020.0,4.0,0.0,H,1.9,3.75,3.8,299.03,281.7,0.44,0.26,0.31,H,0.059701,2,0.309813,0.261792,0.428396,0.428396,H,1,1,1,0,0,0.498339,0.252492,0.249169,H,1
207,2020-02-01 00:00:00,Liverpool,Sheffield United,0,1.0,2.0,2020.0,2.0,0.0,H,1.2,6.5,13.0,959.18,62.33,0.83,0.13,0.04,H,2.733636,2,0.092307,0.139811,0.767882,0.767882,H,1,1,1,0,0,0.783133,0.144578,0.072289,H,1


In [27]:
# 538 model success rate

trunc19_20['538true'].mean()

0.4888888888888889

So FiveThirtyEight had a forecast success rate that was identical to the bookmakers and slightly better than our much simpler model. Now, let's calculate the FiveThirtyEight Brier Score:

In [28]:
# 538 Brier score (lower is better)

Brier538 = ((trunc19_20['538hpr'] - trunc19_20['Houtcome'])**2 +(trunc19_20['538dpr'] - trunc19_20['Doutcome'])**2 +\
             (trunc19_20['538apr'] - trunc19_20['Aoutcome'])**2).sum()/89
Brier538

0.5926

Recall that the Brier Score for the bookmaker was 0.587 and for our simple model it was 0.620. Thus, the FiveThirtyEight Brier Score was marginally worse than the bookmaker's and slightly better than that of our simple model.

Now let's compare all of the crosstabs:

In [29]:
pd.crosstab(trunc19_20['FTR'], trunc19_20['538res'],margins = True,dropna= True)

538res,A,H,All
FTR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,9,13,22
D,9,15,24
H,9,35,44
All,27,63,90


In [30]:
pd.crosstab(trunc19_20['FTR'], trunc19_20['B365res'],margins = True,dropna= True)

B365res,A,H,All
FTR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,11,11,22
D,7,17,24
H,11,33,44
All,29,61,90


In [31]:
pd.crosstab(trunc19_20['FTR'], trunc19_20['logitpred'],margins = True,dropna= True)

logitpred,A,H,All
FTR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,9,13,22
D,10,14,24
H,11,33,44
All,30,60,90


Like the bookmaker and our model, FiveThirtyEight never gives the highest probability to a draw.

In terms of away wins, the FiveThirtyEight model's performance is the same as the performance of our model, getting less than half of the results correct (9 out of 22). However, when it comes to home wins, FiveThirtyEight performed better than the bookmaker (63 out of 90).

What is striking, of course, is that all three sets of predictions look rather similar. Indeed, we can confirm this by counting the number of highest probability forecasts that were the same:

In [32]:
# Percentage of highest probability forecasts that were the same for the bookmaker and our model

same_B365_logit = np.where(trunc19_20['B365res'] == trunc19_20['logitpred'],1,0).sum()/90
same_B365_logit

0.8555555555555555

In [33]:
# Percentage of highest probability forecasts that were the same for the bookmaker and the FiveThirtyEight model

same_B365_538 = np.where(trunc19_20['B365res'] == trunc19_20['538res'],1,0).sum()/90
same_B365_538

0.8888888888888888

Overall, the three sources produce very similar forecasts.

## Self Test

The similarities can be confirmed by looking at the correlation between the Brier scores for each set of forecasts. Produce a correlation matrix for the Brier Scores.

In [None]:
#Your Code Here

## Conclusions

In this notebook, we have moved from within-sample modeling to out-of-sample modeling which can be used to generate forecasts. We found that our simple model based only on the TM value ratio and home advantage performed very close to the odds of the bookmakers, and to a much more complex model produced by FiveThirtyEight.

If you didn't want to go to all the trouble of generating the statisical models, an even simpler rule of thumb would work well- the team most likely to win is the team with the higher TM value. The bigger the gap, the more likely it is that higher TM value team wins. This can be offset to some degree by home advantage - the exact extent could be judged by trial and error. And such a method would also have the advantage of giving some guidance on draws, which are more likely when the TM values are close. 

Why do the bookmakers and the models fail to predict draws? One answer is that the models tend to favour outcomes that occur more frequently, and although there are many draws in soccer, it is much less frequent than a home win, and usually less frequent than an away win. One might imagine that a model such as FiveThirtyEight's which actually forecasts goals scored by each team rather than winning and losing, would be more likely to predict draws, but as we have seen, it isn't.

Our simple model described here will extend quite easily to most other football leagues, since the key to prediction is the variation in TM values, which is quite large in most leagues. This reflects inequalities in the purchasing power of teams. Thus, it can also extend quite easily to leagues where inequalities in spending power is also great - this model works fairly well for Major League Baseball, for example. However, it will not work well for the NFL, since its hard salary cap ensures limited variation in spending between teams, and therefore limited ability to predict based on this difference.

Our model also generates forecasts for the games not played (to date). In the next session we will use our model forecasts to estimate how many points each team would have won from the unplayed games, and then produce a league table based on these outcomes, which can be compared to the league table at the time league play was suspended.

To do this we need to save season19_20 df which contains our forecasts for the full season:

In [34]:
season19_20.to_excel("../../Data/Week 3/forecasts19_20.xlsx")