# AFL Match Outcome Prediction

The purpose of the current set of notebooks is to explore the historical data and modelling relevant to predicting the outcomes of given AFL matches. In particular, if we know an upcoming match is between team A and team B, then we wish to estimate the probabilities that team A wins (and team B loses), team B wins (and team A loses), or the match is a draw.

## Background

### Australian Rules Football

Australian Rules Football is a game played between two opposing teams, say team A and team B. Each match takes place over four quarters, with each quarter having approximately 30 minutes duration. During each quarter, team A attempts to kick or hand-pass the football towards their scoring end of the oval, and team B attempts to send the ball to the opposite end of the oval. The scoring ends swap after each quarter.

The scoring area consists of four (almost) equally spaced posts, forming the "goal" area between the inner two posts (the goal posts) and two "behind" areas between the outer two posts (the behind posts) and the inner two posts. If the football is kicked between the goal posts (without touching either post), then the scoring team scores a goal worth 6 points.
If the ball is kicked between a behind post and a goal post (without touching), then the scoring team scores a behind worth 1 point. If the ball touches or hits a goal post, then this scores as a behind. If the ball touches or hits a behind post, then this results in an "out-of-bounds" free (penalty kick-out) for the opposing team, with no change in score. If the ball is touched by hand (by either team) before going in for a goal, then this is also regarded as a behind. There are further rules regarding whether team B 'scoring' in team A's area results in a behind or a free for team A.

At the end of the fourth quarter, i.e. at the end of the match, if team A and team B have the same point scores, then the match is a draw (worth 2 match points to each team). Otherwise, the team with the most points wins (gaining 4 match points), and the opposing team loses (gaining 0 match points). We shall sometimes refer to a non-drawn game as having an *outright* or *definite* result.

Each week during the football season, known as a "round", pairs of teams (within the same league, obviously) play matches. At the end of each round, all teams are ranked on the league ladder by match points. Ties are broken based on the accumulated number of points scored for (i.e. by) and against each team. At the end of the so-called "minor" rounds, the teams in the top-half of the ladder go into the "finals" rounds. Ultimately, only two teams oppose each other in the grand final match of the season, with the winning team becoming the season champions.

### AFL Teams

The Australian Football League (AFL) was oficially formed in 1990, largely as a nationalised rebranding of the Victorian Footbal League (VFL).

Between the seasons of 1996 and 1997, there were many changes including the formation of new teams, the merger of existing teams, and the renaming of existing teams. Hence, for convenience, we shall typically consider AFL data from 1997 onwards. In particular, we shall focus on the current 18 teams, listed below.

<table>
    <tr><th>Team name</th><th>Comments</th></tr>
    <tr><td>Adelaide Crows</td><td>Formed in 1991</td></tr>
    <tr><td>Brisbane Lions</td><td>Formed in 1997 by merger of Brisbane Bears with Fitzroy</td></tr>
    <tr><td>Carlton Blues</td></tr>
    <tr><td>Collingwood Magpies</td></tr>
    <tr><td>Essendon Bombers</td></tr>
    <tr><td>Fremantle Dockers</td><td>Formed in 1995</td></tr>
    <tr><td>Geelong Cats</td></tr>
    <tr><td>Gold Coast Suns</td><td>Formed in 2011</td></tr>
    <tr><td>Greater Western Sydney (GWS) Giants</td><td>Formed in 2012</td></tr>
    <tr><td>Hawthorn Hawks</td></tr>
    <tr><td>Melbourne Demons</td></tr>
    <tr><td>North Melbourne Kangaroos</td><td>Named Kangaroos from 1999-2007</td></tr>
    <tr><td>Port (Adelaide) Power</td><td>Formed in 1997</td></tr>
    <tr><td>Richmond Tigers</td></tr>
    <tr><td>St Kilda Saints</td></tr>
    <tr><td>Sydney Swans</td><td>Formed in 1982 from South Melbourne</td></tr>
    <tr><td>West Coast Eagles</td><td>Formed in 1987</td></tr>
    <tr><td>Western Bulldogs</td><td>Renamed in 1997 from Footscray</td></tr>
</table>

## Match Data

The data files used in this project were formed by saving the entire match data (over all seasons) separately for each team, found as web pages on [AFL Tables](https://afltables.com/afl/afl_index.html "afltables.com").

Note that these data files contain both AFL matches and VFL matches (prior to the official formation of the AFL in 1990). We ignore the VFL data.

Also note that the AFL data contain more team names than our list of the current 18 teams (above). In particular, the data file for North Melbourne contains matches for the team listed both under North Melbourne and under Kangaroos. For simplicity, teams that have been renamed over time (i.e. Kangaroos and Footscray) will be mapped to their current names.

However, teams that resulted from the merger of previous teams remain
problematic. For example, the Brisbane Lions first appeared in 1997,
formed from the merger of Fitzroy and the Brisbane Bears. We cannot simply rename Fitzroy and the Brisbane Bears as the Brisbane Lions, 
because then prior to 1997 the Brisbane Lions would appear in two different places in the league rankings, and would apparently play matches against themselves. To avoid this awkwardness, we simply ignore matches prior to 1997 for convenience.

The remaining difficulty is that, despite our devices above, the number of teams competing in any given season varies year by year.
For instance, the Gold Coast Suns were formed in 2011, and the Greater Western Sydney Giants were formed in 2012. Hence, the 
current 18 teams only appear together from 2012 onwards.
We cannot avoid this issue using pre-2012 data, and must make allowances for the league
teams in each season, where this affects our analyses.

## Issues and Observations

We briefly discuss various issues that will affect our analyses, in particular our assignments of  probabilities to the various match outcomes (i.e. a win, draw, or loss).

### The effect of draws

Unlike many other sporting games, Australian Rules Football may validly result in a draw between the two teams competing in any given match. However, since draws historically account for fewer than 1% of match outcomes, they are clearly very difficult to predict. 

In fact, a draw is somewhat anomalous in that it could have been avoided if the circumstances of an actual drawn match had been slightly different. One can imagine, for example, a situation in which a ball that bounced awkwardly into the goal had instead bounced the other way (resulting in no score), or if a ball that hit the goal post (thereby scoring only a behind) had instead sailed between the posts (scoring a goal). Alternative realities aside, a draw, i.e. a zero difference between the scores of the two teams at the end of a match, is essentially an unstable equilibrium.

Thus, although we need to account for draws, how do we deal with them in the data? One approach is to regard a draw as both a win and a loss for each team. Indeed, although an outright result gives the winning team 4 match points (with 0 match points given to the losing team), for a draw both teams gain 2 match points (i.e. half each). Hence, we could
regard a draw between team A and team B as both a win for team A and a win for team B, but weighted at 0.5 each (so that the total number of matches is not altered in the data).

Another approach is to simply discard draws, which will have little effect since they only account for 1% of the data. However, we do not recommend discarding data within any given season (although we will discard entire seasons). For instance, removing draws from the dataset would make it impossible to correctly compute the number of match points for each team in the league ranking. Additionally, neglecting draws would lose relevant information, namely that a draw between teams A and B indicates that those teams were
of roughly equal strength (at least for that match).

Consequently, we preserve drawn matches in the historical data, and take the loss of accuracy in being unable to predict them. In fact, in some modelling approaches we could ignore draws as a possible outcome, and simply predict the probability of a win for team A against team B.

### Causality and independence of observations 

It is apparent that the outcome of a past match in year $X$ might affect the predicted outcome of a future match in year $Y\ge X$. It is less apparent, but still true, that the outcome of the future match might affect the *prediction* of the outcome a past match once it becomes known. However, causality prevents the future match from actually affecting the outcome of the past match. Thus, *outcomes* have a time-directionality and obey causality, but *predictions* do not have to do so. 

To put this more clearly, suppose we predict the probability that team A wins a match at some time $T=t$. Subsequently, we might observe the outcome of a later match at time $T>t$, and use this knowledge to re-predict a different value for the probability of team A winning the match at time $T=t$.
This new knowledge obviously cannot change the actual outcome of that match.

Given that predictions need not obey causality, we are free to predict the outcome of a match at time $T=t$ using all of the observations we have available at times $T<t$ and $T>t$.
This is what we might typically call *retrospective* prediction,
i.e. prediction after observing all of the data.
However, to really be useful in practice, we desire
*prospective* modelling, where we restrict ourselves to predictions at time $T=t$ using only past observations for times $T<t$.

Thus, we might usefully adopt the Markov assumption that future matches (for $T>t$) are conditionally independent of
past matches ($T<t$) given the present ($T=t$). In practice, this means that to predict the outcome of a match at time $T=t$, we first summarise all of the available information for $T<t$, and use just this summary information.

Just to be precise, if we are, for example, counting events over time, then clearly these counts are *not* independent, since if we let
$c^{(t)}$ represent the outcome of a match at time $t$, and $\phi^{(t)}$ represent the historical features prior to the match, then
$\phi^{(t)}=h(\phi^{(t-1)},c^{(t-1)})$ for some deterministic function $h$. 
However, if we **only** predict the outcome of a future match, e.g. $c^{(t)}$, based on the current features, 
e.g. $\phi^{(t)}$, then this prediction **is** independent of all past counts, since the current count already encapsulates the past counts.
In other words, we have
\begin{eqnarray}
P\left(c^{(t)}\mid\phi^{(t)},c^{(t-1)},\phi^{(t-1)}\right)
& = &
P\left(c^{(t)}\mid c^{(t-1)},\phi^{(t-1)}\right)
~\doteq~
P\left(c^{(t)}\mid\phi^{(t)}\right)\,,
\end{eqnarray}
since our model (by definition) depends only on the
current features $\phi^{(t)}=h(\phi^{(t-1)},c^{(t-1)})$.

Hence, we see that (prediction of) the future is conditionally independent of the past
given the present, which satisfies the Markov assumption. Thus, 
letting $\mathbf{C}=[c^{(1)},c^{(2)},\ldots]$ be the vector of match outcomes, and
letting $\Phi=[\phi^{(1)},\phi^{(2)},\ldots]$ be the matrix of match features,
we take the pairs $(c^{(t)},\phi^{(t)})$ and $(c^{(t-1)},\phi^{(t-1)})$ to be independent in the sense that
\begin{eqnarray}
P\left(\mathbf{C}\mid\phi^{(1)}\right) & = &
P\left(c^{(1)}\mid\phi^{(1)}\right)
\,P\left(c^{(2)}\mid c^{(1)},\phi^{(1)}\right)
\,P\left(c^{(3)}\mid c^{(2)},c^{(1)},\phi^{(1)}\right)
\cdots
\\& \doteq &
P\left(c^{(1)}\mid\phi^{(1)}\right)\,
P\left(c^{(2)}\mid\phi^{(2)}\right)\,
P\left(c^{(3)}\mid\phi^{(3)}\right)\cdots
\\& \doteq &
P(\mathbf{C}\mid\Phi)\,.
\end{eqnarray}

### Temporal heterogeneity

How much credence should we give to data observed in the past, compared to data observed in (or close to) the present? Should every season be treated equally, or should we discount (i.e. down-weight) or even neglect seasons long in the past?

We know, in particular, that the players listed to play in each team vary from season to season, e.g. due to older players retiring, or under-performing players being dropped and younger (but possibly more inexperienced) players being drafted in their place. Even on a match-by-match basis within a given season, some team players will be selected to play, and others will not (due to injury, inexperience, availability, etc.). Thus, should team A in season $X$ be treated as being the same team A in season $Y$? If not, could we even model the differences?

Empirically, it turns out that treating all past seasons equally seems to lead to a degredation in the accuracy of predicting the current season, compared to, say, using just the results of the previous season. As discussed above, we might expect that the player list of last year's team (say, for some team A) is similar to the current year's list. However, the player list in the year before that will be less similar, and the dissimilarity will increase the further one looks into the past.
Consequently, it should not be suprising that the match results for past seasons might be misleading for the current season. 

However, there is a counter-argument that must be considered. If we assume that there exists some time-independent, underlying ability of a team, then there might be a reversion-to-the-mean effect that is revealed by averaging over all past seasons. In other words, a team might have good seasons and bad seasons, but on average will have an 'average' season.

The empirical result that seasons long past should be discounted or discarded, however, suggests that there is no time-independent effect, and the fact that a team's composition varies over time suggests we shouldn't expect one.

### Data scarcity

We now have quite a bit of AFL data for years 1997 to the present (or nearly to the present, depending upon how often we update the match data). However, when we start delving into the data at the level of team A versus team B, we begin to run into the problem of data scarcity. This problem is exacerbated when we try to model rare events. For example, drawn matches definitely do occur, but not very often. Thus, given all historical matches between two particular teams A and B, we might never have observed a draw, but we must not rule out the possibility of a draw in the future.

The problem of data scarcity becomes even worse if we attempt to model effects that vary over time. Consequently, 
it is more
tractable to ignore time-varying effects, such as changes in composition and ability of teams, and to assume that all model parameters remain constant over time, i.e. are temporally homogeneous. Note that this does not prevent us from modelling temporal sequences, but merely asserts that all temporal sequences are generated by the same fixed process with time-invariant dependencies.
However, given the warning of the previous 
[section](#Temporal-heterogeneity "Section: Temporal heterogeneity") 
about the doubtful existence of time-invariant properties, it would probably be prudent to restrict explicit temporal modelling to just the current
season under consideration, rather than seeking a 
solution over the long-term.

Further note that since there are more historical matches (in any given season) near the end of the season compared to the start of the season, our early predictions will typically need to be tempered with prior estimates to offset intra-seasonal data scarcity.
It is (potentially) reasonable to use data from the previous season to form these prior estimates.
However, empirically it is found that we should restrict our models to using prior probabilities 
(see the [section](#Backoff-and-smoothing "Section: Backoff and smoothing") on smoothing) rather than prior counts, since using prior counts will swamp the current counts observed early in the season.
In particular, using prior counts exacerbates an effect noted in the previous 
[section](#Temporal-heterogeneity "Section: Temporal heterogeneity"), namely that teams may have good seasons and bad seasons. Hence, a good season followed by an average or bad season will give overly-optimistic priors,
and a bad season followed by an average or good season will give overly-pesimistic priors.

### Model validation

In order to fairly test our ideas, we should partition the observed data into *training* data, that can be used to estimate the parameters of our various models, and *testing* data, on which we can test the accuracies of the models' predictions. These two data sets must be independent of each other.

The partitioning of temporal data is not straightforward. Ideally, we would like to segment the data into contiguous blocks, and assign some blocks to the training set and some to the testing set, preferably at random. Mitigating against this strategy is the problem of meaningfully extracting historical quantities of interest, particularly if these quantities are based on temporal relationships.

To put it more concretely, suppose we built a system that took in data from year $X$ to year $Y$, and used it to predict the outcomes of matches in year $Y+1$. That is, we segregated the data between years $X$ to $Y$ and year $Y+1$ into the training and testing sets, respectively.
Now, once the football season is over for year $Y+1$, may we add that year's data to the system, in order to better predict year $Y+2$? If we pursue this course, then we have the problem that testing data ultimately become training data.

Alternatively, suppose we wanted to predict the outcomes of the finals rounds using known outcomes the minor rounds, for each season. Then we might simply assign some entire seasons to the training data and some to the testing data. However, if season $Y+1$ ends up in the training set, and season $Y$ ends up in the testing set, then are we allowed to use the temporal sequence $Y\rightarrow Y+1$ in our modelling?

To answer these questions, we return to our 
[earlier](#Causality-and-independence-of-observations "Section: Causality and independence of observations") 
assumption of temporal independence of observations.
We showed that if we extract all historical match features, i.e. 
$\Phi=[\phi^{(1)},\phi^{(2)},\ldots]$,
then each case $(c^{(t)},\phi^{(t)})$, i.e. match outcome and match features at time $t$, is indepdendent
of the other cases. Consequently, the individual cases may be arbitrarily partitioned into training and testing sets.

Conceptually, it no longer matters whether we compute the features in advance of
partitioning the data, or instead partition the data first and then compute the features
on the fly. We are permitted to use cases in the *testing* data as historical context
for computing features of the *training* data, and vice versa.

## Basics of Modelling

Here we discuss some of the common assumptions and interpretations
of our various predictive models with respect to the historical data.

### Probabilistic classification and accuracy

Each match has a result or outcome in the
set $\mathcal{C}=\{\mathtt{win},\mathtt{draw},\mathtt{loss}\}$
of outcomes (or classes). Thus, we take each model to be in the form
of a *probabilistic classifier*, giving a probability estimate for
each outcome in $\mathcal{C}$. For convenience, these probabilities
will always be represented in the order $\mathbf{p}\doteq[p_\mathtt{win},p_\mathtt{draw},p_\mathtt{loss}]$.
Note that since a win for one team is a loss for the opposing team, we 
(arbitrarily but consistently) interpret each match prediction and outcome with respect to
the 'for' team versus the 'against' team.

The accuracy of a probabilistic model can be measured by how
well the estimated probabilites of each match outcome agree with the known outcomes.
Thus, for a given match, we let $\hat{\mathbf{p}}$ denote the estimated probabilities of outcomes 
$\mathcal{C}$, and let $\bar{\mathbf{p}}$ denote the true probabilities,
i.e. $\bar{p}_c=\delta(c=c^{(t)})$ for known outcome $c^{(t)}$,
where $\delta(X)=1$ (or $0$) if proposition $X$ is true
(or false, respectively).

Some typical measures of predictive accuracy or inaccuracy thus include:

* *absolute error*: $\sum_{c\in\mathcal{C}}\mid\hat{p}_c-\bar{p}_c\mid$;
* *square error*: $\sum_{c\in\mathcal{C}}(\hat{p}_c-\bar{p}_c)^2$;
* *zero-one accuracy*: $\sum_{c\in\mathcal{C}}\bar{p}_c
\delta(\hat{p}_c=\hat{p}_\mathtt{max})$;
* *cross-entropy*: $\sum_{c\in\mathcal{C}}\bar{p}_c\log\hat{p}_c$;

where $\hat{p}_\mathtt{max}\doteq\mathtt{max}(\hat{\mathbf{p}})$.

### Indifference to team ordering

Next,
we note that each match specifies a 'for' team (say, team A) and an 'against' team (say, team B). These labels are purely arbitrary, and independent of other match characteristics, such as
which team is to play at home and which to play away. For simplicity,
each model will always predict the match outcome with respect to the 'for' team. Consequently, a label of $\mathtt{win}$ represents
a win for the 'for' team and a loss for the 'against' team, and
conversely a label of $\mathtt{loss}$ represents a loss for the 'for' team and a win for the 'against' team.

Where it becomes necessary to distinguish the results by team, we shall append a team subscript to the outcome, such that
$\mathtt{win}_A=\mathtt{loss}_B$, $\mathtt{loss}_A=\mathtt{win}_B$
and $\mathtt{draw}_A=\mathtt{draw}_B$.

Due to the fact that the 'for' and 'against' labels are arbitrary,
each model must be indifferent to the order of the match teams,
in the following sense. Let model $\mathcal{M}$ be defined such that
\begin{eqnarray}
\mathcal{M}(A,B) & = & [P(\mathtt{win}_A),P(\mathtt{draw}_A),P(\mathtt{loss}_A)]
\\&=&
[P(\mathtt{loss}_B),P(\mathtt{draw}_B),P(\mathtt{win}_B)]\,.
\end{eqnarray}
Then it follows that we must have
\begin{eqnarray}
\mathcal{M}(B,A) & = & [P(\mathtt{win}_B),P(\mathtt{draw}_B),P(\mathtt{loss}_B)]
\\&=&
[P(\mathtt{loss}_A),P(\mathtt{draw}_A),P(\mathtt{win}_A)]\,.
\end{eqnarray}
More concisely, if function $f$ estimates the
probability of a win, such that
\begin{eqnarray}
P(\mathtt{win}_A) & = & P(\mathtt{loss}_B) = f(A,B)\,,
\end{eqnarray}
and function $g$ estimates the probability of a draw, such that
\begin{eqnarray}
P(\mathtt{draw}_A) & = & P(\mathtt{draw}_B) = g(A,B) = g(B,A)\,,
\end{eqnarray}
then
\begin{eqnarray}
\mathcal{M}(A,B) & = & [f(A,B),\;g(A,B),\;f(B,A)]\,.
\end{eqnarray}

### Marginal models

One type of model that offers simplicity is the *marginal model*, given by
\begin{eqnarray}
\mathcal{M}(A,*) & = & 
[P(\mathtt{win}_A),P(\mathtt{draw}_A),P(\mathtt{loss}_A)]\,,
\end{eqnarray}
which estimates the respective probabilities of a win, draw or loss
for team A against *any* arbitrary opponent. 
Similarly, the converse marginal model is
\begin{eqnarray}
\mathcal{M}(*,B) & = & 
[P(\mathtt{loss}_B),P(\mathtt{draw}_B),P(\mathtt{win}_B)]\,,
\end{eqnarray}
which estimates the respective probabilities of a win, draw or loss
for any arbitrary opponent *against* team B. 

A simple, additive model that obeys our restriction of team indifference is thus
\begin{eqnarray}
\mathcal{M}_\mathtt{add}(A,B) & \doteq & \frac{\mathcal{M}(A,*)+\mathcal{M}(*,B)}{2}\,.
\end{eqnarray}
Alternatively, a simple, multiplicative model that incorporates
a prior is given by
\begin{eqnarray}
\mathcal{M}_\mathtt{mult}(A,B) & \doteq & 
\mathcal{M}(A,*)\otimes \mathcal{M}(*,B)\oslash \mathcal{M}(*,*)
\\
& \doteq &
\frac{\left[
\frac{P(\mathtt{win}_A)\,P(\mathtt{loss}_B)}{P(\mathtt{win})},
\frac{P(\mathtt{draw}_A)\,P(\mathtt{draw}_B)}{P(\mathtt{draw})},
\frac{P(\mathtt{loss}_A)\,P(\mathtt{win}_B)}{P(\mathtt{loss})}
\right]}
{
\frac{P(\mathtt{win}_A)\,P(\mathtt{loss}_B)}{P(\mathtt{win})}+
\frac{P(\mathtt{draw}_A)\,P(\mathtt{draw}_B)}{P(\mathtt{draw})}+
\frac{P(\mathtt{loss}_A)\,P(\mathtt{win}_B)}{P(\mathtt{loss})}
}
\,,
\end{eqnarray}
where $\otimes$ denotes element-wise vector multiplication with renormalisation,
and $\oslash$ denotes element-wise vector division with renormalisation.

A typical prior model $\mathcal{M}_\mathtt{prior}(A,B)=\mathcal{M}(*,*)$ obeys
$P(\mathtt{win})=P(\mathtt{loss})=\frac{1}{2}[1-P(\mathtt{draw})]$, since there are no *a priori* reasons to choose team A over team B.


### Combined models

Another useful modelling technique is to combine a collection 
$(\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_M)$ of sub-models, where each sub-model utilises a different subset of the match features. 

An additive combination of these sub-models takes the form
\begin{eqnarray}
\mathcal{M}_\mathtt{add}(A,B) & \doteq &
\sum_{k=1}^{M}w_k\,\mathcal{M}_k(A,B)\,,
\end{eqnarray}
where $w_k\ge 0$ and $\sum_{k=1}^{M}w_k=1$.

The sub-model weights may either be chosen in advance, or estimated from the training data via iterative posterior updates (see [Appendix A](A_additive_weights.ipynb "Appendix A: Additively Weighted Models")), namely
\begin{eqnarray}
w_k & \leftarrow & \frac{1}{N}\sum_{d=1}^{N}
\frac{w_{k}\,\mathcal{M}_k(A^{(d)},B^{(d)})[c^{(d)}]}
{\mathcal{M}_\mathtt{add}(A^{(d)},B^{(d)})[c^{(d)}]}
\,,
\end{eqnarray}
where
$A^{(d)}$ and $B^{(d)}$ denote the 'for' and 'against' teams, respectively, for the $d$-th match, 
and $\hat{\mathbf{p}}[c^{(d)}]$ returns the predicted probability of the true outcome $c^{(d)}$ of the match.

The multiplicative combination of sub-models has some stricter conditions, namely that the feature sets for each sub-model are non-overlapping and uncorrelated. Under these conditions, the (unweighted) multiplicative model is given by
\begin{eqnarray}
\mathcal{M}_\mathtt{mult}(A,B) & \doteq & 
\mathcal{M}_\mathtt{prior}(A,B)\,\otimes_{k=1}^{M}\,
\left[\mathcal{M}_k(A,B)\oslash\mathcal{M}_\mathtt{prior}(A,B)\right]\,.
\end{eqnarray}

This combined model is only approximate if sub-models share common features, or if features are correlated across different sub-models. In order to help reduce the effects of dependence or correlation, we could instead use the weighted form
\begin{eqnarray}
\mathcal{M}_\mathtt{mult}(A,B) & \doteq & 
\mathcal{M}_\mathtt{prior}(A,B)\,\otimes_{k=1}^{M}\,
\left[\mathcal{M}_k(A,B)\oslash\mathcal{M}_\mathtt{prior}(A,B)\right]^{w_k}\,.
\end{eqnarray}
However, unlike for the additive model, there are no obvious constraints on the weights.
Instead, upon taking logarithms, we see that the weighted multiplicative model is equivalent to an additive logistic classifier with features of the form
$\psi=\left[\log\left\{\mathcal{M}_k(A,B)\oslash\mathcal{M}_\mathtt{prior}(A,B)
\right\}\right]_{k=1}^{M}$. Thus, we may instead train a logistic classifier using arbitrary weight regularisation, e.g. $L_2$ and/or $L_1$ norms.


### Backoff and smoothing

When attempting to predict the outcome of a match between two teams, say team A and team B, it is useful to consider any previous matches played by team A against team B.
However, as noted in an earlier 
[section](#Data-scarcity "Section: Data scarcity") 
on the scarcity of data, one problem with this approach is that there may be few or even no such matches played prior to the match in question. For instance, we might be considering the first season one of the teams has
ever played, or we might be considering the first round in a given season and ignoring
past seasons.

As an example, suppose we count the previous games played by team A against B, and wish to compute the model
\begin{eqnarray}
\mathcal{M}(A,B) & \doteq &
\frac{[c(\mathtt{win}_{A,B}),c(\mathtt{draw}_{A,B}),c(\mathtt{loss}_{A,B})]}
{c(\mathtt{win}_{A,B})+c(\mathtt{draw}_{A,B})+c(\mathtt{loss}_{A,B})}
\,,
\end{eqnarray}
where $c(X_{A,B})$ counts the number of matches played between teams A and B with a result of $X$ for team A.
What do we do if all of the counts are zero, i.e. team A has not previously played
against team B (at least within the historical context under consideration)?

One solution is to *back-off* from unobserved quantities,
e.g. counting matches played by both teams together, to observed quantities,
e.g. counting matches played by each team separately.
This is the basis of the marginal models described in an earlier 
[section](#Marginal-models "Section: Marginal models"), 
where if we cannot reliably compute $\mathcal{M}(A,B)$ directly, then we could instead back-off
to 
\begin{eqnarray}
\mathcal{M}_\mathtt{backoff}(A,B) & \doteq &
\frac{1}{2}\mathcal{M}(A,*)+\frac{1}{2}\mathcal{M}(*,B)
\,,
\end{eqnarray}
where
\begin{eqnarray}
\mathcal{M}(A,*) & \doteq &
\frac{[c(\mathtt{win}_{A}),c(\mathtt{draw}_{A}),c(\mathtt{loss}_{A})]}
{c(\mathtt{win}_{A})+c(\mathtt{draw}_{A})+c(\mathtt{loss}_{A})}
\,,
\end{eqnarray}
and
\begin{eqnarray}
\mathcal{M}(*,B) & \doteq &
\frac{[c(\mathtt{loss}_{B}),c(\mathtt{draw}_{B}),c(\mathtt{win}_{B})]}
{c(\mathtt{loss}_{B})+c(\mathtt{draw}_{B})+c(\mathtt{win}_{B})}
\,,
\end{eqnarray}
such that $c(X_T)$ counts the number of matches played by team $T$ (against any opponent) with the outcome of $X$ for team $T$.

However, now what if team B has (within the historical context) not played *any* games?
In this case, we could back-off from $\mathcal{M}(*,B)$ to the prior model
$\mathcal{M}(*,*)=[p_\mathtt{win},p_\mathtt{draw},p_\mathtt{loss}]$.
Alternatively, another solution is to *smooth* the counts, for example
\begin{eqnarray}
\mathcal{M}_\mathtt{smooth}(*,B) & \doteq &
\frac{[c(\mathtt{loss}_{B})+p_\mathtt{win},
c(\mathtt{draw}_{B})+p_\mathtt{draw},c(\mathtt{win}_{B})+p_\mathtt{loss}]}
{c(\mathtt{loss}_{B})+c(\mathtt{draw}_{B})+c(\mathtt{win}_{B})+1}
\,.
\end{eqnarray}
Note that in this case, if all of the counts are zero, then the smoothed model reduces
to a backoff model. Hence, in general, a (variable) combination of back-off and smoothing may be applied to each model, depending upon circumstances. 

The additional benefit of smoothing here is that if the counts are not all zero but are too few to directly produce reliable estimates, then the smoothed counts should prove more reliable. For example, if team B has played previous matches but has never drawn a match, then smoothing with the prior will allow for the (downgraded) possibility of a draw in the future.

## Feature Extraction

Earlier, we discussed the problem of 
[data scarcity](#Data-scarcity "Section: Data scarcity") 
and the need for 
[marginal models](#Marginal-models "Section: Marginal models") 
that predict the outcomes of matches between a specified team and any other, arbitrary team.
Useful types of information for such models include *environmental* features and *historical* features.

### Environmental features

Environmental features indicate expected information about the match conditions, such as 
weather (e.g. cloudy or sunny, windy or calm, rainy or dry),
light conditions (e.g. day or night match), and ground conditions (e.g. physical dimensions, soft or hard turf), et cetera.

These features affect team performance during the match. For instance, empirical observations suggest that some teams seem to be able to play better in the rain than other teams, giving them an advantage. However, it not clear how strong these effects might be, and further analysis is required. Additionally, weather prediction is uncertain, and some degree of model sophistication would be required to allow for this
uncertainty. Typically, it is easier to use simple models and ignore the environmental effects.

One strong environmental feature, however, is the so-called *home-ground advantage*.
It has been noted in the literature for many different sporting games that there is a distinct effect whereby the team playing on their home ground (the 'home' team) has 
an advantage against the opposing side (the 'away' team, who must travel to the match ground). It is still unclear as to whether this advantage is physical or psychological. 

Physical effects might include better familiarity of the 'home' team with the peculiarities of their oval (e.g. dimensions, surface hardness, boggy patches, etc.).
Alternatively, the 'away' team might suffer from fatigue caused by having to travel (possibly interstate) to another ground.

Psychologically, it could simply be a matter of confidence (again, familiarity with
the ground), or pride for the 'home' team playing in front of the local fans, and thus
a determination not to lose face.

Regardless of the cause, AFL matches also demonstrate the home-ground advantage, such that the home team wins about 55% of its matches (averaged across seasons and teams). Consequently, we must allow for this effect in our predictive modelling, and also in the analysis of the accuracy of such models.

### Historical features

Historical features involve information we may usefully extract from past matches.
However, we cautioned 
[earlier](#Temporal-heterogeneity "Section: Temporal heterogeneity") 
that matches long in the past might not be relevant, due largely to the fact that teams change in composition over time. Consequently, we expect  temporal variability of teams' relative strengths in offense and defense. This makes such effects difficult to model, especially due to the problem of 
[data scarcity](#Data-scarcity "Section: Data scarcity").

In practice, previous analysis of historical match results (not shown here) suggests that, when predicting the outcome of a given match in a given season, the previous matches within that season are most relevant. Additionally, the results of matches from the preceding season are relevant as prior information for the purposes of model
backoff and smoothing.

Useful historical statistics include match scores, match outcomes, and league rankings.
For marginal models, the score statistics count the total number of points scored by a team against all other teams (the 'for' score), and the number of points scored against that team by all other teams (the 'against' score). The 'for' score indirectly measures of the offensive strength of the team, and the 'against' score measures the team's defensive strength (or lack thereof).

Similarly, the outcome statistics count the number of wins, draws and losses each team has had against all other opponents. Again, the wins indirectly measure offensive strength, and the losses measure defensive strength. Draws indicate that the two opposing teams were about equaly matched on the day (subject to the effect of free kicks awarded by umpires). However, wins and losses do not indicate how close these matches were, and thus the outcome statistics are less finely-grained than the score
statistics.

Finally, the most coarse-grained of the statistics are the league rankings. Effectively, a higher ranking (i.e. smaller rank index) indicates a stronger team, and a lower ranking
(i.e. larger rank index) indicates a weaker team. Note that the rankings are primarily
computed from the match points (4 points for a win, and 2 points for a loss).
Ties in the number of match points are settled by awarding the higher ranking to the
team with the higher score *percentage* (which, for the AFL, is 100% times the
number of score points 'for' a team divided by the number of points 'against').
Thus, the rankings incorporate both score and outcome statistics.

Note that an 'upset' win against a higher ranking opponent should perhaps be given more weight than an 'expected' win against a lower ranking opponent.

### Graph features

A special case of the historical features arises because each match may be considered as an edge between the two opposing teams, with all the teams forming vertices in a graph.
The outcome of a match determines the direction of the edge, where a draw is either undirected or bidirectional, depending upon what analytics are to be extracted.

Thus, for a *directed* gaph, we may compute both *centrality* and *prestige* features for each team.
Essentially, centrality measures the effect of out-edges from a vertex, whereas
prestige measures the effect of in-edges. Consequently, the normalised eigenvector scores, and the related PageRank scores, actually measure prestige rather than centrality. For an undirected graph, the prestige scores are exacly equal to the centrality scores.

The weight of each edge in a match graph is typically obtained from one of the types of historical statistics. Furthermore, multiple edges between the same two teams (in the same direction) may be amalgamated into a single edge by combining the edge weights,
usually by either summing the edge weights, or taking their mean.

However, caution should be used in both the application and interpretation of graph (or vertex) statistics. For example, standard PageRank adds implicit teleportation edges between all vertices, but in a sporting context a team may **not** oppose itself.
Also, normalised eigenvector scores measure the effect of in-edges but not out-edges, and thus measure the *gain* in prestige of each team, e.g. due to wins, but not the
*loss* of prestige, e.g. due to losses (see [Appendix B](B_graph_analytics.ipynb "Appendix B: Graph Analytics") for more details).