# Data Extraction

This notebook parses team data-files containing results of past Australian Football League (AFL) matches, and creates a single data file representing a combined, temporal graph of all matches across all (selected) seasons.

See the [introduction](1_introduction.ipynb#Background "Introduction: Background") for further background information about the AFL and Australian Rules football.

## Parse the Matches

The data-files were formed by saving the entire match data (over all seasons) separately for each team, found as web pages on [AFL Tables](https://afltables.com/afl/afl_index.html "afltables.com"). Note that it doesn't matter what file ending is used, e.g. `.htm` or `.html`. 

However, the filenames **must** match the team names, since the data-file for each team records only the opposing team names. In the case of teams that have changed name over time (excluding those that have merged with other teams), we name the data-file after the modern team name, and manually remap this to the older team name(s) when appropriate.

Also note that these data-files contain matches for both the AFL and its predecessor, the VFL (Victorian Football League). We extract only the AFL data from 1990 onwards.

In [1]:
import sys
import os

sys.path.append(os.path.join("..", "python"))

In [2]:
import pandas as pd

import match_parser

In [3]:
team_files = match_parser.get_team_files(os.path.join("..", "matches"))

In [4]:
matches = {}
for team_file in team_files:
    print("Parsing:", team_file)
    team_name = match_parser.parse_team_name(team_file)
    team_matches = match_parser.parse_team_seasons(team_file, min_season=1990)
    matches[team_name] = team_matches
print(f"Parsed {len(matches)} teams.")

Parsing: ..\matches\Adelaide.html
Parsing: ..\matches\Brisbane Bears.htm
Parsing: ..\matches\Brisbane Lions.htm
Parsing: ..\matches\Carlton.htm
Parsing: ..\matches\Collingwood.htm
Parsing: ..\matches\Essendon.htm
Parsing: ..\matches\Fitzroy.htm
Parsing: ..\matches\Fremantle.htm
Parsing: ..\matches\Geelong.htm
Parsing: ..\matches\Gold Coast.htm
Parsing: ..\matches\Greater Western Sydney.htm
Parsing: ..\matches\Hawthorn.htm
Parsing: ..\matches\Melbourne.htm
Parsing: ..\matches\North Melbourne.htm
Parsing: ..\matches\Port Adelaide.html
Parsing: ..\matches\Richmond.html
Parsing: ..\matches\St Kilda.html
Parsing: ..\matches\Sydney.html
Parsing: ..\matches\West Coast.html
Parsing: ..\matches\Western Bulldogs.html
Parsed 20 teams.


## Construct the Graph

For convenience, we may consider each team as a vertex in a graph, and each match as an
edge between vertices. In order that each match is represented exactly once, we arbitrarily designate one of the teams to be the *'for'* team, and the opposing team to be
the *'against'* team. Hence, each edge is directed from the 'for' team to the 'against' team, and the match outcome (i.e. win, draw or loss) is specified with respect to the
'for' team.

### Team naming

We have to deal with the issue of teams changing names over time, as listed in the
[introduction](1_introduction.ipynb#AFL-Teams "Introduction: AFL Teams").
The major problem here is due to the way the data are recorded. In particular, we nominally have a data-file of matches for each team, but that team's name does not explicitly appear in the data-file, only the names of the opposing teams. In addition, we are missing some match data-files, in particular matches for the Kangaroos (whih implicitly appear in the North Melbourne data-file) and
Footscray (which implicitly appear in the Western Bulldogs data-file).
Thus, since neither of these teams have explicit data-files, we have to guess the correct team names from the available filenames and the season.

In [5]:
all_teams = set(matches.keys())
for team_matches in matches.values():
    for df_matches in team_matches.values():
        all_teams |= set(df_matches.Opponent)
all_teams = sorted(list(all_teams))
print(f"Found {len(all_teams)} teams")
print(all_teams)

Found 22 teams
['Adelaide', 'Brisbane Bears', 'Brisbane Lions', 'Carlton', 'Collingwood', 'Essendon', 'Fitzroy', 'Footscray', 'Fremantle', 'Geelong', 'Gold Coast', 'Greater Western Sydney', 'Hawthorn', 'Kangaroos', 'Melbourne', 'North Melbourne', 'Port Adelaide', 'Richmond', 'St Kilda', 'Sydney', 'West Coast', 'Western Bulldogs']


As expected, observe that North Melbourne and Kangaroos both appear in the raw data, despite being the same team.
Similarly, Footscray is the older name of the Western Bulldogs. Thus, we could map the older names to the newer names, or the newer names to the older names (adjusted for the right season).

Note, however, that Fitzroy merged with the Brisbane Bears in 1997 to become the Brisbane Lions.
Hence, we cannot use the new team name before 1997, otherwise the Brisbane Lions would have played matches against the Brisbane Lions.
Hence, for now we keep the historically accurate names. However, in subsequent analyses we could choose to neglect seasons before 1997, and could then transform to the new team names.

### Team ordering

Since the match data-files are indexed by team name, this means that each match appears
twice over, i.e. once each in the respective data-files of the opposing teams.
In order to prevent edge duplication in our match graph, we first stipulate a canonical ordering of the teams, i.e. $T_1\prec T_2\prec T_3\cdots$. Then, for each match
with some team A versus some team B, if $A\prec B$ then we designate A as the 'for' team and B as the 'against' team. Conversely, if $A\succ B$, then we designate
A as the 'against' team and B as the 'for' team.

Note that although the canonical team ordering is arbitrary, the resulting 'for' and 'against' designations are deterministic and consistent.

### Edge and node attributes

For each match, we know environmental information such as the season, the match round within the season, the venue (i.e. the oval), and the date/time of each match.

For each team in the match, we also know if the venue is the team's home ground, and we know the numbers of goals and behinds scored in each quarter. We label each of the four quarters of a match by an integer suffix.

Note that, for the time being, there is no vertex information defined, other than the team name.

### Process the matches

In [6]:
df_edges = match_parser.extract_match_data(matches, use_old_names=True)

In [7]:
print(df_edges.columns)

Index(['season', 'round', 'datetime', 'venue', 'for_team', 'for_is_home',
       'for_goals1', 'for_behinds1', 'for_goals2', 'for_behinds2',
       'for_goals3', 'for_behinds3', 'for_goals4', 'for_behinds4',
       'for_total_score', 'for_match_points', 'for_is_win', 'for_is_draw',
       'for_is_loss', 'against_team', 'against_is_home', 'against_goals1',
       'against_behinds1', 'against_goals2', 'against_behinds2',
       'against_goals3', 'against_behinds3', 'against_goals4',
       'against_behinds4', 'against_total_score', 'against_match_points',
       'against_is_win', 'against_is_draw', 'against_is_loss', 'edge_type'],
      dtype='object')


In [8]:
all_teams = set(df_edges.for_team) | set(df_edges.against_team)
all_teams = sorted(list(all_teams))
print(f"Found {len(all_teams)} teams")
print(all_teams)

Found 22 teams
['Adelaide', 'Brisbane Bears', 'Brisbane Lions', 'Carlton', 'Collingwood', 'Essendon', 'Fitzroy', 'Footscray', 'Fremantle', 'Geelong', 'Gold Coast', 'Greater Western Sydney', 'Hawthorn', 'Kangaroos', 'Melbourne', 'North Melbourne', 'Port Adelaide', 'Richmond', 'St Kilda', 'Sydney', 'West Coast', 'Western Bulldogs']


Observe that the old team names have been retained. We have 18 modern teams, plus two renamings, plus two older teams that merged.

In [9]:
df_edges

Unnamed: 0,season,round,datetime,venue,for_team,for_is_home,for_goals1,for_behinds1,for_goals2,for_behinds2,...,against_goals3,against_behinds3,against_goals4,against_behinds4,against_total_score,against_match_points,against_is_win,against_is_draw,against_is_loss,edge_type
0,2022,R1,Sun 20-Mar-2022 3:40 PM,Adelaide Oval,Adelaide,True,2,0,3,2,...,0,4,3,4,83,4,True,False,False,lost-to
1,2022,R2,Sat 26-Mar-2022 1:45 PM,M.C.G.,Adelaide,False,1,5,2,1,...,7,2,1,3,100,4,True,False,False,lost-to
2,2022,R3,Fri 01-Apr-2022 7:50 PM,Adelaide Oval,Adelaide,True,2,1,5,2,...,2,5,2,2,92,0,False,False,True,defeated
3,2022,R4,Sun 10-Apr-2022 1:10 PM,Docklands,Adelaide,False,4,2,4,2,...,3,5,3,2,103,4,True,False,False,lost-to
4,2022,R5,Sat 16-Apr-2022 4:05 PM,Adelaide Oval,Adelaide,True,4,1,5,3,...,4,7,2,0,82,0,False,False,True,defeated
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6165,1990,R13,Sat 30-Jun-1990 2:10 PM,Waverley Park,Footscray,False,1,0,4,4,...,2,6,3,5,73,0,False,False,True,defeated
6166,1990,R14,Sat 07-Jul-1990 2:10 PM,Moorabbin Oval,Footscray,False,3,5,3,0,...,2,4,3,0,78,0,False,False,True,defeated
6167,1990,R15,Sat 14-Jul-1990 2:10 PM,Western Oval,Footscray,True,2,4,5,2,...,4,6,2,4,96,0,False,False,True,defeated
6168,1990,R16,Fri 20-Jul-1990 7:40 PM,W.A.C.A.,Footscray,False,0,4,3,3,...,4,3,5,4,109,4,True,False,False,lost-to


### Order the matches

Note that the matches have been extracted in an arbitrary order.
For convenience, we reorder the matches from the earliest to the latest.
Note that the date-times are in local format. It is not yet clear whether these have been defined
centrally (e.g. Melbourne/Sydney time), or vary by state.

In [10]:
from datetime import datetime

In [11]:
date_fn = lambda s: datetime.strptime(s,match_parser.DATETIME_FORMAT)

df_edges['_datetime'] = df_edges['datetime'].apply(date_fn)
df_edges.sort_values('_datetime', ascending=True, inplace=True)
df_edges.drop(columns='_datetime', inplace=True)

In [12]:
df_edges

Unnamed: 0,season,round,datetime,venue,for_team,for_is_home,for_goals1,for_behinds1,for_goals2,for_behinds2,...,against_goals3,against_behinds3,against_goals4,against_behinds4,against_total_score,against_match_points,against_is_win,against_is_draw,against_is_loss,edge_type
5164,1990,R1,Sat 31-Mar-1990 2:10 PM,M.C.G.,Melbourne,False,6,2,4,1,...,4,4,3,4,89,0,False,False,True,defeated
4195,1990,R1,Sat 31-Mar-1990 2:10 PM,Waverley Park,Geelong,True,5,3,2,3,...,9,7,10,6,192,4,True,False,False,lost-to
2055,1990,R1,Sat 31-Mar-1990 2:10 PM,Princes Park,Carlton,True,6,5,4,4,...,6,3,6,5,104,4,True,False,False,lost-to
3191,1990,R1,Sat 31-Mar-1990 2:10 PM,Windy Hill,Essendon,True,7,4,6,7,...,1,3,2,4,60,0,False,False,True,defeated
852,1990,R1,Sat 31-Mar-1990 7:40 PM,Carrara,Brisbane Bears,True,4,3,3,2,...,1,3,3,2,74,0,False,False,True,defeated
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3319,2022,R18,Sat 16-Jul-2022 5:30 PM,Perth Stadium,Fremantle,True,3,3,3,2,...,3,4,5,4,82,4,True,False,False,lost-to
1442,2022,R18,Sat 16-Jul-2022 7:25 PM,M.C.G.,Carlton,True,4,1,1,0,...,3,2,2,5,85,4,True,False,False,lost-to
4498,2022,R18,Sun 17-Jul-2022 1:10 PM,M.C.G.,Hawthorn,True,2,3,7,0,...,2,2,3,0,77,0,False,False,True,defeated
4871,2022,R18,Sun 17-Jul-2022 2:50 PM,Traeger Park,Melbourne,True,0,4,5,3,...,2,3,3,3,69,0,False,False,True,defeated


## Perform Sanity Checking

### Check team names

Since we have decided to use the historically accurate team names, we should check that this aim has
been achieved.

In [13]:
def get_team_seasons(df, team):
    return df.loc[
        (df.for_team == team) | (df.against_team == team),
        'season'
    ]

In [14]:
_seasons = get_team_seasons(df_edges, 'Kangaroos')
assert all((_seasons >= 1999) & (_seasons <= 2007))
_seasons = get_team_seasons(df_edges, 'North Melbourne')
assert all((_seasons < 1999) | (_seasons > 2007))

_seasons = get_team_seasons(df_edges, 'Western Bulldogs')
assert all(_seasons >= 1997)
_seasons = get_team_seasons(df_edges, 'Footscray')
assert all(_seasons < 1997)

_seasons = get_team_seasons(df_edges, 'Brisbane Lions')
assert all(_seasons >= 1997)
_seasons = get_team_seasons(df_edges, 'Fitzroy')
assert all(_seasons < 1997)
_seasons = get_team_seasons(df_edges, 'Brisbane Bears')
assert all(_seasons < 1997)

### Check goals, behinds and points

Each goal is worth 6 points, and each 'behind' is worth 1 point.

In [15]:
for_goals = (
    df_edges.for_goals1 + df_edges.for_goals2 
    + df_edges.for_goals3 + df_edges.for_goals4
)
for_behinds = (
    df_edges.for_behinds1 + df_edges.for_behinds2 
    + df_edges.for_behinds3 + df_edges.for_behinds4
)
for_scores = 6 * for_goals + for_behinds

In [16]:
ind = for_scores == df_edges.for_total_score
assert sum(ind) == len(ind)

In [17]:
against_goals = (
    df_edges.against_goals1 + df_edges.against_goals2 
    + df_edges.against_goals3 + df_edges.against_goals4
)
against_behinds = (
    df_edges.against_behinds1 + df_edges.against_behinds2 
    + df_edges.against_behinds3 + df_edges.against_behinds4
)
against_scores = 6 * against_goals + against_behinds

In [18]:
ind = against_scores == df_edges.against_total_score
assert sum(ind) == len(ind)

### Check matches versus wins, draws and losses

In [19]:
for team in all_teams:
    df = df_edges.loc[df_edges.for_team == team]
    assert len(df) == sum(df.for_is_win) + sum(df.for_is_draw) + sum(df.for_is_loss)
    df = df_edges.loc[df_edges.against_team == team]
    assert len(df) == sum(df.against_is_win) + sum(df.against_is_draw) + sum(df.against_is_loss)

### Check known venues

Occasionally a team changes its home ground to another oval. More frequently, ovals change names due to sponsorship deals. In order to assess any statistics relating to the
match ground, we require a geographical understanding of the names and locations of the various ovals.

In [20]:
df_venues = pd.read_csv(os.path.join("..", "data", "venues.csv"))

In [21]:
df_venues

Unnamed: 0,venue,from,to,latitude,longitude,ground,suburb,state,aliases
0,Adelaide Oval,2011,,-34.9156,138.5961,Adelaide Oval,Adelaide,SA,
1,Bellerive Oval,2012,2019.0,-42.8773,147.3735,Bellerive Oval,Bellerive,TAS,Blundstone Arena
2,Blacktown,2012,2012.0,-33.769444,150.859167,Blacktown International Sportspark Oval,Rooty Hill,NSW,Blacktown ISP Oval; Blacktown ISP
3,Bruce Stadium,1995,1995.0,-35.25,149.102778,Canberra Stadium,Bruce,ACT,GIO Stadium Canberra;GIO Stadium;Bruce Stadium...
4,Carrara,1987,,-28.0063,153.3669,Carrara Stadium,Gold Coast,QLD,
5,Cazaly's Stadium,2011,,-16.9358,145.7492,Cazaly's Stadium,Westcourt,QLD,
6,Docklands,2000,,-37.8165,144.9474,Docklands Stadium,Melbourne,VIC,Marvel Stadium;Etihad Stadium;Telstra Dome;Col...
7,Eureka Stadium,2017,2019.0,-37.53841,143.84803,Eureka Stadium,Wendouree,VIC,Mars Stadium;Northern Oval #1;AUSTAR Arena
8,Football Park,1991,2013.0,-34.88,138.495556,Football Park,West Lakes,SA,AAMI Stadium
9,Gabba,1981,,-27.4859,153.0381,Brisbane Cricket Ground,Brisbane,QLD,


In [22]:
df = pd.merge(df_edges, df_venues, on='venue', how='left')

In [23]:
assert not any(df.latitude.isna())

## Save the Graph

In [24]:
df_edges.to_csv(os.path.join("..", "data", "matches.csv"), index=False)