# Feature Extraction

The purpose of this notebook is extract a variety of machine learning features from the historical AFL match data, to be used in predicting the outcomes of future matches.

As noted in the [introduction](1_introduction.ipynb#Feature-Extraction "Introduction: Feature Extraction"), for reasons of data scarcity we shall typically consider marginal models, i.e. models predicting the match outcome (win, draw or loss) of a given team versus any abitrary opponent.

Prior to a given match, one set of useful historical statistics includes the number of games previously won, drawn and lost by each team. Also of importance are the numbers of points scored by and against each team. Similarly, the league ranking of each team might be predictive of future outcomes. In general, any such summary statistics might be useful for match prediction.

## Load the data

In [1]:
import sys
import os

sys.path.append(os.path.join("..", "python"))

In [2]:
import match_tools

In [3]:
from datetime import datetime

import numpy as np
import pandas as pd

In [4]:
df_matches = pd.read_csv(os.path.join("..", "data", "matches.csv"))

In [5]:
date_fn = lambda s: datetime.strptime(s, match_tools.DATETIME_FORMAT)
df_matches['timestamp'] = df_matches.datetime.apply(date_fn)

In [6]:
df_matches

Unnamed: 0,season,round,datetime,venue,for_team,for_is_home,for_goals1,for_behinds1,for_goals2,for_behinds2,...,against_behinds3,against_goals4,against_behinds4,against_total_score,against_match_points,against_is_win,against_is_draw,against_is_loss,edge_type,timestamp
0,1990,R1,Sat 31-Mar-1990 2:10 PM,M.C.G.,Melbourne,False,6,2,4,1,...,4,3,4,89,0,False,False,True,defeated,1990-03-31 14:10:00
1,1990,R1,Sat 31-Mar-1990 2:10 PM,Waverley Park,Geelong,True,5,3,2,3,...,7,10,6,192,4,True,False,False,lost-to,1990-03-31 14:10:00
2,1990,R1,Sat 31-Mar-1990 2:10 PM,Princes Park,Carlton,True,6,5,4,4,...,3,6,5,104,4,True,False,False,lost-to,1990-03-31 14:10:00
3,1990,R1,Sat 31-Mar-1990 2:10 PM,Windy Hill,Essendon,True,7,4,6,7,...,3,2,4,60,0,False,False,True,defeated,1990-03-31 14:10:00
4,1990,R1,Sat 31-Mar-1990 7:40 PM,Carrara,Brisbane Bears,True,4,3,3,2,...,3,3,2,74,0,False,False,True,defeated,1990-03-31 19:40:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6165,2022,R18,Sat 16-Jul-2022 5:30 PM,Perth Stadium,Fremantle,True,3,3,3,2,...,4,5,4,82,4,True,False,False,lost-to,2022-07-16 17:30:00
6166,2022,R18,Sat 16-Jul-2022 7:25 PM,M.C.G.,Carlton,True,4,1,1,0,...,2,2,5,85,4,True,False,False,lost-to,2022-07-16 19:25:00
6167,2022,R18,Sun 17-Jul-2022 1:10 PM,M.C.G.,Hawthorn,True,2,3,7,0,...,2,3,0,77,0,False,False,True,defeated,2022-07-17 13:10:00
6168,2022,R18,Sun 17-Jul-2022 2:50 PM,Traeger Park,Melbourne,True,0,4,5,3,...,3,3,3,69,0,False,False,True,defeated,2022-07-17 14:50:00


In [7]:
all_teams = sorted(set(df_matches.for_team) | set(df_matches.against_team))
print(all_teams)

['Adelaide', 'Brisbane Bears', 'Brisbane Lions', 'Carlton', 'Collingwood', 'Essendon', 'Fitzroy', 'Footscray', 'Fremantle', 'Geelong', 'Gold Coast', 'Greater Western Sydney', 'Hawthorn', 'Kangaroos', 'Melbourne', 'North Melbourne', 'Port Adelaide', 'Richmond', 'St Kilda', 'Sydney', 'West Coast', 'Western Bulldogs']


## End-of-season statistics

As noted in the [introduction](1_introduction.ipynb#Backoff-and-smoothing "Introduction: Backoff and smoothing"), for reasons of data scarcity we 
typically need to modify our marginal models, using backoff and smoothing of the counts of various events. For example, when predicting the outcomes of matches in a given season, it might be useful to use the overall team results from previous seasons as prior information.

In principal, we will allow for the aggregation of statistics over any subset of matches, even matches over multiple seasons. However, if we select matches across different seasons then we must allow for the possibility of seasonal variations in the number of teams and the team names. In essence, we will treat multi-season data as if all teams played against each other in the same 'composite' season.

For convenience, we shall here consider only matches within in each season. However, we shall aggragate over all matches, including both the minor rounds and the finals rounds (with team eliminations). 

### Initialise data structures

In order to summarise seasonal data, we need to know the season and each team in the league for that season.
Since the number of teams varies over time, we should also keep track of the number of teams.

In [8]:
def get_season_matches(df_matches, season):
    """
    Obtains a single season of matches.
    """
    return df_matches[df_matches.season == season]

In [9]:
all_seasons = sorted(set(df_matches.season))

In [10]:
d_features = dict()
for season in all_seasons:
    df_season_matches = get_season_matches(df_matches, season)
    d_features[season] = match_tools.init_team_features(df_season_matches)

In [11]:
d_features[all_seasons[0]]

Unnamed: 0,season,team,teams
0,1990,Brisbane Bears,14
1,1990,Carlton,14
2,1990,Collingwood,14
3,1990,Essendon,14
4,1990,Fitzroy,14
5,1990,Footscray,14
6,1990,Geelong,14
7,1990,Hawthorn,14
8,1990,Melbourne,14
9,1990,North Melbourne,14


### Wins, draws and losses

The relative strength of a team's offense can be indirectly measured by the number of wins, and the relative weakness of its defense can be measured by the number of losses. If we count a draw as a half-win and a half-loss, then we can also compute the proportion of adjusted wins, which estimates the probability of the team
winning against an arbitrary opponent.

In [12]:
for season in all_seasons:
    df_season_matches = get_season_matches(df_matches, season)
    match_tools.add_wins_features(d_features[season], df_season_matches)

In [13]:
d_features[all_seasons[0]]

Unnamed: 0,season,team,teams,games,wins,draws,losses,wins_ratio
0,1990,Brisbane Bears,14,22,4,0,18,0.181818
1,1990,Carlton,14,22,11,0,11,0.5
2,1990,Collingwood,14,26,19,1,6,0.75
3,1990,Essendon,14,25,18,0,7,0.72
4,1990,Fitzroy,14,22,7,0,15,0.318182
5,1990,Footscray,14,22,12,0,10,0.545455
6,1990,Geelong,14,22,8,0,14,0.363636
7,1990,Hawthorn,14,23,14,0,9,0.608696
8,1990,Melbourne,14,24,17,0,7,0.708333
9,1990,North Melbourne,14,22,12,0,10,0.545455


### Points for and against 

The relative strength of a team's offense can also be indirectly measured by the number of points scored by the team, and the relative weakness of its defense can be measured by the number of points scored against it.
Hence, the proportion of points scored by a team against all opponents provides a subjective estimate of the probability of the team winning against an arbitrary opponent.

In [14]:
for season in all_seasons:
    df_season_matches = get_season_matches(df_matches, season)
    match_tools.add_points_features(d_features[season], df_season_matches)

In [15]:
d_features[all_seasons[0]]

Unnamed: 0,season,team,teams,games,wins,draws,losses,wins_ratio,points_for,points_against,points_ratio
0,1990,Brisbane Bears,14,22,4,0,18,0.181818,1733,2426,0.416687
1,1990,Carlton,14,22,11,0,11,0.5,2277,2187,0.510081
2,1990,Collingwood,14,26,19,1,6,0.75,2798,2077,0.573949
3,1990,Essendon,14,25,18,0,7,0.72,2742,2079,0.568762
4,1990,Fitzroy,14,22,7,0,15,0.318182,1874,2389,0.439597
5,1990,Footscray,14,22,12,0,10,0.545455,2016,2031,0.498147
6,1990,Geelong,14,22,8,0,14,0.363636,2248,2398,0.483857
7,1990,Hawthorn,14,23,14,0,9,0.608696,2478,2075,0.544257
8,1990,Melbourne,14,24,17,0,7,0.708333,2512,2260,0.526404
9,1990,North Melbourne,14,22,12,0,10,0.545455,2519,2210,0.532671


### Scores for and against

For the purposes of modelling, it might also be useful to know the breakdown of the points scored for and against each team. Recall that a goal is worth 6 points and a behind is worth 1 point.
In addition, note that each goal and each behind represents a scoring shot, and thus goal accuracy might be measured as the proportion of scoring shots being goals. Since there are two behind areas each of about equal size to the goal area, one might expect random scoring to produce an accuracy of around 33%.

In [16]:
prev_columns = list(next(iter(d_features.values())).columns)
for season in all_seasons:
    df_season_matches = get_season_matches(df_matches, season)
    match_tools.add_scores_features(d_features[season], df_season_matches)
    # Quick sanity check
    d = d_features[season]
    assert all(d.points_for == 6 * d.goals_for + d.behinds_for)
    assert all(d.points_against == 6 * d.goals_against + d.behinds_against)
cur_columns = list(next(iter(d_features.values())).columns)

In [17]:
num_extra_columns = len(cur_columns) - len(prev_columns)
columns = ['season', 'team'] + cur_columns[-num_extra_columns:]
d_features[all_seasons[0]][columns]

Unnamed: 0,season,team,goals_for,behinds_for,accuracy_for,goals_against,behinds_against,accuracy_against,goals_ratio,behinds_ratio
0,1990,Brisbane Bears,247,251,0.495984,348,338,0.507289,0.415126,0.426146
1,1990,Carlton,335,267,0.556478,312,315,0.497608,0.517774,0.458763
2,1990,Collingwood,404,374,0.51928,294,313,0.484349,0.578797,0.544396
3,1990,Essendon,396,366,0.519685,297,297,0.5,0.571429,0.552036
4,1990,Fitzroy,264,290,0.476534,345,319,0.519578,0.433498,0.47619
5,1990,Footscray,287,294,0.493976,290,291,0.499139,0.4974,0.502564
6,1990,Geelong,320,328,0.493827,349,304,0.534456,0.478326,0.518987
7,1990,Hawthorn,362,306,0.541916,299,281,0.515517,0.547655,0.521295
8,1990,Melbourne,369,298,0.553223,326,304,0.51746,0.530935,0.495017
9,1990,North Melbourne,365,329,0.525937,316,314,0.501587,0.535977,0.511664


In [18]:
columns = ['season', 'team', 'wins', 'draws', 'losses', 'wins_ratio',
            'points_ratio', 'goals_ratio', 'behinds_ratio']
d_features[all_seasons[0]][columns]

Unnamed: 0,season,team,wins,draws,losses,wins_ratio,points_ratio,goals_ratio,behinds_ratio
0,1990,Brisbane Bears,4,0,18,0.181818,0.416687,0.415126,0.426146
1,1990,Carlton,11,0,11,0.5,0.510081,0.517774,0.458763
2,1990,Collingwood,19,1,6,0.75,0.573949,0.578797,0.544396
3,1990,Essendon,18,0,7,0.72,0.568762,0.571429,0.552036
4,1990,Fitzroy,7,0,15,0.318182,0.439597,0.433498,0.47619
5,1990,Footscray,12,0,10,0.545455,0.498147,0.4974,0.502564
6,1990,Geelong,8,0,14,0.363636,0.483857,0.478326,0.518987
7,1990,Hawthorn,14,0,9,0.608696,0.544257,0.547655,0.521295
8,1990,Melbourne,17,0,7,0.708333,0.526404,0.530935,0.495017
9,1990,North Melbourne,12,0,10,0.545455,0.532671,0.535977,0.511664


### League ranking

The league rankings give a measure of the relative strength of each team, with a higher rank (i.e. smaller rank index) indicating a stronger team, and a lower rank (i.e. larger rank index) indicating a weaker team.

In practice, the league rankings are computed round by round only during the minor rounds. At the end of the minor rounds, the top-half ranked teams enter into the finals rounds, with team match-ups dictated by ranking. 
However, since we have chosen to consider all matches, from both the minor rounds and finals rounds, then we shall compute an overall ranking at the end of each season.

Also note that, since the number of teams varies per season, so too does the maximum rank. Thus, we should compute an adjusted rank that is comparable across different seasons. For convenience, we map the top-ranking team to a score of 1.0 and the bottom-ranking team to a score of 0.0.

In [19]:
prev_columns = list(next(iter(d_features.values())).columns)
for season in all_seasons:
    match_tools.add_rank_features(d_features[season])
cur_columns = list(next(iter(d_features.values())).columns)

In [20]:
num_extra_columns = len(cur_columns) - len(prev_columns)
columns = ['season', 'team', 'teams', 'wins_ratio'] + cur_columns[-num_extra_columns:]
d_features[all_seasons[0]][columns]

Unnamed: 0,season,team,teams,wins_ratio,rank,rank_score
0,1990,Brisbane Bears,14,0.181818,14,0.0
1,1990,Carlton,14,0.5,8,0.461538
2,1990,Collingwood,14,0.75,1,1.0
3,1990,Essendon,14,0.72,2,0.923077
4,1990,Fitzroy,14,0.318182,12,0.153846
5,1990,Footscray,14,0.545455,7,0.538462
6,1990,Geelong,14,0.363636,10,0.307692
7,1990,Hawthorn,14,0.608696,5,0.692308
8,1990,Melbourne,14,0.708333,4,0.769231
9,1990,North Melbourne,14,0.545455,6,0.615385


### Prestige scores

We now take each match in a given season as an edge in a graph with the teams as vertices. From this graph we may compute various analytics, including vertex scores.
Here we compute the flow prestige scores 
(see [Appendix B](B_graph_analytics.ipynb#Steady-state-flow-scores "Appendix B: Steady-state flow scores")).

Note, however, that the prestige scores depend upon the number of teams, and hence are not directly comparable
across seasons. We therefore also compute renormalised prestige scores 
(see 
[Appendix B](B_graph_analytics.ipynb#Probabilitistic-modelling "Appendix B: Probabilitistic modelling"))
to estimate the probability of each team winning against an arbitrary opponent.

In [21]:
prev_columns = list(next(iter(d_features.values())).columns)
for season in all_seasons:
    df_season_matches = get_season_matches(df_matches, season)
    match_tools.add_prestige_features(d_features[season], df_season_matches)
cur_columns = list(next(iter(d_features.values())).columns)

In [22]:
num_extra_columns = len(cur_columns) - len(prev_columns)
columns = ['season', 'team', 'rank', 'wins_ratio'] + cur_columns[-num_extra_columns:]
d_features[all_seasons[0]][columns]

Unnamed: 0,season,team,rank,wins_ratio,wins_prestige,adj_wins_prestige,points_prestige,adj_points_prestige
0,1990,Brisbane Bears,14,0.181818,0.003774,0.046936,0.04903,0.401289
1,1990,Carlton,8,0.5,0.055508,0.433112,0.073333,0.507091
2,1990,Collingwood,1,0.75,0.232418,0.797419,0.096696,0.581871
3,1990,Essendon,2,0.72,0.172837,0.73092,0.09617,0.580403
4,1990,Fitzroy,12,0.318182,0.019412,0.204675,0.057674,0.443097
5,1990,Footscray,7,0.545455,0.053453,0.423341,0.069013,0.490749
6,1990,Geelong,10,0.363636,0.035746,0.325203,0.066317,0.480076
7,1990,Hawthorn,5,0.608696,0.078157,0.524303,0.081022,0.53405
8,1990,Melbourne,4,0.708333,0.119063,0.63729,0.075421,0.514669
9,1990,North Melbourne,6,0.545455,0.056194,0.436305,0.080525,0.532384


We observe that the prestige scores mostly align (inversely) with the rank. 
However, some teams have higher prestige than their computed rank. This is because the prestige score
not only accounts for close wins or losses, but also awards more prestige to a weak team being competative against a stronger team, and conversely penalises a strong team for poor performance.

### Save the seasonal features

In [23]:
def combine_season_features(d_features):
    """
    Concatenates all per-season features into a combined DataFrame.
    """
    return pd.concat(
        [d_features[season] for season in sorted(d_features.keys())],
        ignore_index=True
    )

In [24]:
df_seasonal = combine_season_features(d_features)
df_seasonal.to_csv(os.path.join("..", "data", "end_season_features.csv"), index=False)

## Within-season statistics

In the [previous](#End-of-season-statistics "Section: End-of-season statistics")
section, we computed aggregated features for each team from all matches within a season.
In contrast, in this section we shall compute aggregated features from all matches in the same season
that are *prior* to each match of interest.

### Compute match features

For every match, we attempt to compute the features from within-season historical data prior to that match.
If there are no prior matches, then no match features are computed.

In [25]:
def get_prior_matches(df_season_matches, timestamp):
    """
    Obtains all matches in the current season prior to
    the given date-time.
    """
    return df_season_matches[df_season_matches.timestamp < timestamp]

In [26]:
df_intra_features = None
for season in all_seasons:
    df_season_matches = get_season_matches(df_matches, season)
    for match in df_season_matches.itertuples():
        df_prior_matches = get_prior_matches(df_season_matches, match.timestamp)
        if len(df_prior_matches) == 0:
            # First match in season - no features
            continue
        match_id = (match.datetime, match.round, match.venue)
        df_features = match_tools.compute_features(df_prior_matches)
        for team in [match.for_team, match.against_team]:
            df_team = df_features[df_features.team == team]
            for features in df_team.itertuples(index=False):
                team_features = match_id + features
                if df_intra_features is None:
                    columns = ['datetime', 'round', 'venue'] + list(df_features.columns)
                    df_intra_features = pd.DataFrame(columns=columns)
                idx = len(df_intra_features)
                df_intra_features.loc[idx,:] = team_features

In [27]:
df_intra_features

Unnamed: 0,datetime,round,venue,season,team,teams,games,wins,draws,losses,...,behinds_against,accuracy_against,goals_ratio,behinds_ratio,rank,rank_score,wins_prestige,adj_wins_prestige,points_prestige,adj_points_prestige
0,Fri 06-Apr-1990 7:40 PM,R2,M.C.G.,1990,North Melbourne,14,1,0,0,1,...,10,0.6875,0.352941,0.62963,9,0.384615,0.0,0.0,0.0,0.0
1,Fri 06-Apr-1990 7:40 PM,R2,M.C.G.,1990,Richmond,14,1,0,0,1,...,19,0.5,0.344828,0.424242,11,0.230769,0.0,0.0,0.357488,0.878539
2,Sat 07-Apr-1990 2:10 PM,R2,Moorabbin Oval,1990,St Kilda,14,1,1,0,0,...,15,0.423077,0.666667,0.444444,5,0.692308,0.0,0.0,0.0,0.0
3,Sat 07-Apr-1990 2:10 PM,R2,Moorabbin Oval,1990,West Coast,14,1,1,0,0,...,6,0.571429,0.636364,0.727273,3,0.846154,0.0,0.0,0.0,0.0
4,Sat 07-Apr-1990 2:10 PM,R2,M.C.G.,1990,Geelong,14,1,0,0,1,...,24,0.538462,0.282051,0.314286,14,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11790,Sun 17-Jul-2022 1:10 PM,R18,M.C.G.,2022,West Coast,18,16,2,0,14,...,190,0.583333,0.355932,0.396825,17,0.058824,0.008529,0.127583,0.031821,0.358452
11791,Sun 17-Jul-2022 2:50 PM,R18,Traeger Park,2022,Melbourne,18,16,12,0,4,...,156,0.485149,0.573913,0.558074,2,0.941176,0.097317,0.646986,0.070612,0.563625
11792,Sun 17-Jul-2022 2:50 PM,R18,Traeger Park,2022,Port Adelaide,18,16,8,0,8,...,175,0.477612,0.52381,0.502841,12,0.352941,0.036242,0.38998,0.057722,0.510135
11793,Sun 17-Jul-2022 4:40 PM,R18,Docklands,2022,Essendon,18,16,5,0,11,...,183,0.545906,0.445844,0.490251,16,0.117647,0.018756,0.245248,0.046731,0.454557


### Save the data

In [28]:
df_intra_features.to_csv(os.path.join("..", "data", "within_season_features.csv"), index=False)