# AFL Model - Part 1 - Data Cleaning

These tutorials will walk you through how to construct your own basic AFL model, using publicly available data. The output will be odds for each team to win, which will be shown on [The Hub](https://www.betfair.com.au/hub/tools/models/afl-prediction-model/).

In this notebook we will walk you through the basics of cleaning this dataset and how we have done it. If you want to get straight to feature creation or modelling, feel free to jump ahead!

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re

We will first explore the DataFrames, and then create functions to wrangle them and clean them into more consistent sets of data.

In [723]:
# Read/clean each DataFrame
match_results = pd.read_csv("data/afl_match_results.csv")
odds = pd.read_csv("data/afl_odds.csv")

# Read the historical player stats DataFrames together and append them
player_stats = pd.read_csv("data/player_stats_2010.csv")
player_stats_2018 = pd.read_csv("data/player_stats_2018.csv")
col_order = player_stats.columns
player_stats = player_stats.append(player_stats_2018)[col_order]

Have a look at the structure of the DataFrames. Notice that for the odds DataFrame, each game is split between two rows, whilst for the match_results each game is on one row. We will have to get around this by splitting the games up onto two rows, as this will allow our feature transformation functions to be applied more easily later on. For the player_stats DataFrame we will aggregate these stats into each game on separate rows.

In [724]:
match_results.head(3)

Unnamed: 0,Game,Date,Round,Home.Team,Home.Goals,Home.Behinds,Home.Points,Away.Team,Away.Goals,Away.Behinds,Away.Points,Venue,Margin,Season,Round.Type,Round.Number
0,1,1897-05-08,R1,Fitzroy,6,13,49,Carlton,2,4,16,Brunswick St,33,1897,Regular,1
1,2,1897-05-08,R1,Collingwood,5,11,41,St Kilda,2,4,16,Victoria Park,25,1897,Regular,1
2,3,1897-05-08,R1,Geelong,3,6,24,Essendon,7,5,47,Corio Oval,-23,1897,Regular,1


In [725]:
odds.tail(3)

Unnamed: 0,trunc,event_name,path,selection_name,odds
4009,2018-06-23,Match Odds,AFL/Western Bulldogs v North Melbourne,Western Bulldogs,4.1127
4010,2018-06-24,Match Odds,AFL/Collingwood v Carlton,Carlton,14.7519
4011,2018-06-24,Match Odds,AFL/Collingwood v Carlton,Collingwood,1.0734


In [726]:
player_stats.tail(3)

Unnamed: 0,Date,Season,Round,Venue,Player,Team,Opposition,Status,GA,Match_id,...,FA,AF,SC,CCL,SCL,SI,MG,TO,ITC,T5
5541,2018-07-01,2018,Round 15,Optus Stadium,Darcy Gardiner,Brisbane,Fremantle,Away,0,9639,...,2,59,56,0.0,0.0,1.0,189.0,1.0,3.0,0.0
5542,2018-07-01,2018,Round 15,Optus Stadium,Daniel McStay,Brisbane,Fremantle,Away,1,9639,...,2,71,84,0.0,0.0,7.0,184.0,1.0,1.0,1.0
5543,2018-07-01,2018,Round 15,Optus Stadium,Oscar McInerney,Brisbane,Fremantle,Away,0,9639,...,1,63,62,0.0,1.0,3.0,96.0,1.0,1.0,0.0


First, we will write functions to make the odds data look a bit nicer, with home and away team columns and a date column. To do this we will use the [regex](https://docs.python.org/3/howto/regex.html) module to extract the team names from the path column, as well as the to_datetime function from pandas. We will also replace all the inconsistent team names with consistent team names

In [727]:
def odds_wrangling(df):
    # Create a date column
    df['Date'] = pd.to_datetime(df['trunc']).dt.date
    
    # Grab the home and away teams using regex from the match_results column
    df['home_team'] = df['path'].str.extract('(([\w\s]+) v ([\w\s]+))', expand=True)[1].str.strip()
    df['away_team'] = df['path'].str.extract('(([\w\s]+) v ([\w\s]+))', expand=True)[2].str.strip()
    df['match_details'] = df['path'].str.extract('(([\w\s]+) v ([\w\s]+))', expand=True)[0].str.strip()
    
    # Drop unneeded columns
    df = df.drop(columns=['path', 'trunc', 'event_name', 'match_details'])
    
    # Rename column
    df = df.rename(columns={'selection_name': 'Team'})
    return df

def clean_odds(df):
    # Clean team names to be consistent across DataFrames
    df = df.replace(
    {
        'Adelaide Crows': 'Adelaide',
        'Brisbane Lions': 'Brisbane',
        'Carlton Blues': 'Carlton',
        'Collingwood Magpies': 'Collingwood',
        'Essendon Bombers': 'Essendon',
        'Fremantle Dockers': 'Fremantle',
        'GWS Giants': 'GWS',
        'Geelong Cats': 'Geelong',
        'Gold Coast Suns': 'Gold Coast',
        'Greater Western Sydney': 'GWS',
        'Greater Western Sydney Giants': 'GWS',
        'Hawthorn Hawks': 'Hawthorn',
        'Melbourne Demons': 'Melbourne', 
        'North Melbourne Kangaroos': 'North Melbourne',
        'Port Adelaide Magpies': 'Port Adelaide',
        'Port Adelaide Power': 'Port Adelaide', 
        'P Adelaide': 'Port Adelaide',
        'Richmond Tigers': 'Richmond',
        'St Kilda Saints': 'St Kilda', 
        'Sydney Swans': 'Sydney',
        'West Coast Eagles': 'West Coast',
        'Wetsern Bulldogs': 'Western Bulldogs',
        'Western Bullbogs': 'Western Bulldogs'
    }
    )
    return df

In [728]:
# Apply the wrangling and cleaning
odds = odds_wrangling(odds)
odds = clean_odds(odds)
odds.tail()

Unnamed: 0,Team,odds,Date,home_team,away_team
4007,Hawthorn,1.041,2018-06-23,Hawthorn,Gold Coast
4008,North Melbourne,1.3182,2018-06-23,Western Bulldogs,North Melbourne
4009,Western Bulldogs,4.1127,2018-06-23,Western Bulldogs,North Melbourne
4010,Carlton,14.7519,2018-06-24,Collingwood,Carlton
4011,Collingwood,1.0734,2018-06-24,Collingwood,Carlton


We now have a DataFrame that looks nice and easy to join with our other DataFrames. Let's clean the names so the join is easier, then we'll fix up the match_details DataFrame.

In [729]:
def match_results_wrangling(df):
    # Create DataFrame which includes all the home teams' statistics, as well as the stats for the away team (Opposition)
    df_home = pd.DataFrame(
        {
            'Game': df['Game'],
            'Date': df['Date'],
            'Round': df['Round.Number'],
            'Team': df['Home.Team'],
            'Goals': df['Home.Goals'],
            'Behinds': df['Home.Behinds'],
            'Points': df['Home.Points'],
            'Margin': df['Margin'],
            'Venue': df['Venue'],
            'Home?': 1,
            'Opposition': df['Away.Team'],
            'Opposition Goals': df['Away.Goals'],
            'Opposition Behinds': df['Away.Behinds'],
            'Opposition Points': df['Away.Points']
    })
    # Create DataFrame which includes all the away teams' statistics, as well as the stats for the home team (Opposition)
    df_away = pd.DataFrame(
        {
            'Game': df['Game'],
            'Date': df['Date'],
            'Round': df['Round.Number'],
            'Team': df['Away.Team'],
            'Goals': df['Away.Goals'],
            'Behinds': df['Away.Behinds'],
            'Points': df['Away.Points'],
            'Margin': - df['Margin'],
            'Venue': df['Venue'],
            'Home?': 0,
            'Opposition': df['Home.Team'],
            'Opposition Goals': df['Home.Goals'],
            'Opposition Behinds': df['Home.Behinds'],
            'Opposition Points': df['Home.Points']
    })
    
    # Append the DataFrames together, then sort by the Game ID so that we have the same game on consecutive rows
    df = df_home.append(df_away).sort_values(by='Game').reset_index(drop=True)
    
    # Change the Date column to a Datetime object
    df['Date'] = pd.to_datetime(df['Date']).dt.date
    return df

# Define a function which cleans the match_results DataFrame
def clean_match_results(df):
    # Clean team names to be consistent across DataFrames
    df = df.replace(
    {
        'Brisbane Lions': 'Brisbane',
        'Footscray': 'Western Bulldogs'
    }
    )  
    return df

In [730]:
match_results = match_results_wrangling(match_results)
match_results = clean_match_results(match_results)
match_results.head()

Unnamed: 0,Behinds,Date,Game,Goals,Home?,Margin,Opposition,Opposition Behinds,Opposition Goals,Opposition Points,Points,Round,Team,Venue
0,13,1897-05-08,1,6,1,33,Carlton,4,2,16,49,1,Fitzroy,Brunswick St
1,4,1897-05-08,1,2,0,-33,Fitzroy,13,6,49,16,1,Carlton,Brunswick St
2,11,1897-05-08,2,5,1,25,St Kilda,4,2,16,41,1,Collingwood,Victoria Park
3,4,1897-05-08,2,2,0,-25,Collingwood,11,5,41,16,1,St Kilda,Victoria Park
4,6,1897-05-08,3,3,1,-23,Essendon,5,7,47,24,1,Geelong,Corio Oval


Now we have both the odds DataFrame and match_results DataFrame ready for feature creation! Finally, we will aggregate the player_stats DataFrame stats for each game rather than individual player stats. For this DataFrame we have regular stats, such as disposals, marks etc. and Advanced Stats, such as Tackles Inside 50 and Metres Gained. However these advanced stats are only available from 2015, so we will not be using them in this tutorial - as there isn't enough data from 2015 to train our models.

Let's now aggregate the player_stats DataFrame.

In [731]:
def player_stats_wrangling(df):
    # Aggregate the stats
    agg_stats = df.groupby(by=['Date', 'Season', 'Round', 'Team', 'Opposition', 'Status'], as_index=False).sum()

    # Drop irrelevant columns such as Disposal Efficiency and Time On Ground which are meaningless when aggregated
    agg_stats = agg_stats.drop(columns=['DE', 'TOG', 'Match_id'])
    
    # Change the Date column to a Datetime object
    agg_stats['Date'] = pd.to_datetime(agg_stats['Date']).dt.date
    return agg_stats

In [732]:
agg_stats = player_stats_wrangling(player_stats)

We now have a three fully prepared DataFrames which are almost ready to be analysed and for a model to be built on! Let's have a look at how they look and then merge them together into our final DataFrame.

In [733]:
odds.tail(3)

Unnamed: 0,Team,odds,Date,home_team,away_team
4009,Western Bulldogs,4.1127,2018-06-23,Western Bulldogs,North Melbourne
4010,Carlton,14.7519,2018-06-24,Collingwood,Carlton
4011,Collingwood,1.0734,2018-06-24,Collingwood,Carlton


In [734]:
match_results.tail(3)

Unnamed: 0,Behinds,Date,Game,Goals,Home?,Margin,Opposition,Opposition Behinds,Opposition Goals,Opposition Points,Points,Round,Team,Venue
30649,11,2018-07-01,15325,19,1,17,North Melbourne,12,16,108,125,15,Essendon,Docklands
30650,10,2018-07-01,15326,9,1,-55,Brisbane,11,18,119,64,15,Fremantle,Perth Stadium
30651,11,2018-07-01,15326,18,0,55,Fremantle,10,9,64,119,15,Brisbane,Perth Stadium


In [735]:
agg_stats.tail(3)

Unnamed: 0,Date,Season,Round,Team,Opposition,Status,GA,CP,UP,ED,...,FA,AF,SC,CCL,SCL,SI,MG,TO,ITC,T5
3479,2018-07-01,2018,Round 15,Melbourne,St Kilda,Home,12,143,220,258,...,23,1540,1579,20.0,17.0,101.0,6114.0,71.0,69.0,13.0
3480,2018-07-01,2018,Round 15,North Melbourne,Essendon,Away,11,137,221,268,...,11,1611,1587,18.0,15.0,120.0,5507.0,71.0,72.0,9.0
3481,2018-07-01,2018,Round 15,St Kilda,Melbourne,Away,11,137,255,284,...,28,1519,1716,13.0,19.0,129.0,5794.0,68.0,73.0,6.0


In [736]:
# Create a function which merges the DataFrames
def merge_dfs(odds_df, match_results_df, agg_stats_df):
    # Before we merge the DataFrames, let's filter out games that aren't played between teams in our agg_stats_df
    teams = agg_stats_df['Team'].unique()
    odds_df = odds_df[(odds_df['home_team'].isin(teams)) & (odds_df['away_team'].isin(teams))]
    
    # Merge the odds DataFrame with match_results
    df = pd.merge(odds_df, match_results_df, how='inner', on=['Team', 'Date'])
    
    # Merge that df with agg_stats
    df = pd.merge(df, agg_stats_df, how='inner', on=['Team', 'Date'])
    
    # Sort the values so that each game is ordered by Date
    df = df.sort_values(by=['Game', 'Home?']).reset_index(drop=True)
    
    # Drop duplicate columns and rename these
    df = df.drop(columns=['Round_y', 'Opposition_y']).rename(columns={'Opposition_x': 'Opposition', 'Round_x': 'Round'})
    return df

In [737]:
afl_data = merge_dfs(odds, match_results, agg_stats)

In [739]:
afl_data.tail(3)

Unnamed: 0,Team,odds,Date,home_team,away_team,Behinds,Game,Goals,Home?,Margin,...,FA,AF,SC,CCL,SCL,SI,MG,TO,ITC,T5
3083,Western Bulldogs,4.1127,2018-06-23,Western Bulldogs,North Melbourne,9,15316,11,1,-2,...,23,1528,1700,9.0,18.0,92.0,6015.0,82.0,85.0,20.0
3084,Carlton,14.7519,2018-06-24,Collingwood,Carlton,5,15317,9,0,-20,...,30,1483,1536,13.0,14.0,71.0,4970.0,70.0,61.0,8.0
3085,Collingwood,1.0734,2018-06-24,Collingwood,Carlton,13,15317,11,1,20,...,19,1609,1764,9.0,22.0,94.0,5487.0,61.0,71.0,8.0


Great! We now have a clean looking datset with each row representing one team in a game. Let's now eliminate the outliers from a dataset. We know that Essendon had a doping scandal which resulted in their entire team being banned for a year in 2016, so let's remove all of their 2016 games. To do this we will filter based on the team and season, and then invert this with ~.

In [740]:
# Define a function which eliminates outliers
def outlier_eliminator(df):
    # Eliminate Essendon 2016 games
    essendon_filter_criteria = ~(((df['Team'] == 'Essendon') & (df['Season'] == 2016)) | ((df['Opposition'] == 'Essendon') & (df['Season'] == 2016)))
    df = df[essendon_filter_criteria]
    
    # Reset index
    df = df.reset_index(drop=True)
    return df

In [741]:
afl_data = outlier_eliminator(afl_data)

Our data is now fully ready to be explored and for features to be created, which we will walk you through in our next tutorial, [AFL Feature Creation Tutorial](02. afl_feature_creation_tutorial.ipynb).