# Introduction
This tutorial will walk you through how to create your own basic AFL model, using publicly available data. The output will be the odds for each team to win. The aim of this tutorial is not to teach you how to code; but rather for someone with very little programming experience to understand exactly what is being done.

For tutorials on Python, and data science, I highly recommend checking out [Datacamp](https://www.datacamp.com/) or [Dataquest](https://www.dataquest.io/). These are free resources that will provide you with the basics of Python and data science fundamentals.

The tutorial will be relatively short. For the full process of a typical data science project, feel free to read any of the predictive model walkthroughs on [Betfair's Github](https://github.com/betfair-datascientists/predictive-models).

## Read in The Data
First, we will read in the data and explore the columns within the dataset. I have condensed the columns down to some key statistics for simplicity, but full AFL datasets are available using the [fitzRoy](https://github.com/jimmyday12/fitzRoy) R package or elsewhere online.

I have collated the data and we will only use a subset of that data for this presentation. Specifically, we'll use the median supercoach score and the elos of each team as features in the model.

In [23]:
# Import libraries
import pandas as pd
import h2o
from h2o.automl import H2OAutoML

In [24]:
# Read in the data
afl = pd.read_csv('afl_modelling.csv')

In [26]:
# Look at the last 9 rows
afl.tail(9)

Unnamed: 0,date,game_id,season,round,venue,team_1,team_2,home_team_1,margin,winner,supercoach_1,supercoach_2,elo_1,elo_2
1363,2019-03-21,1364,2019,1,M.C.G.,Carlton,Richmond,1,-33,0,76.0,74.5,1192.232026,1661.503449
1364,2019-03-22,1365,2019,1,M.C.G.,Collingwood,Geelong,1,-7,0,78.0,87.0,1527.182906,1544.24487
1365,2019-03-23,1366,2019,1,Adelaide Oval,Adelaide,Hawthorn,1,-32,0,72.0,72.0,1583.123031,1594.130262
1366,2019-03-23,1367,2019,1,Gabba,Brisbane,West Coast,1,44,1,85.0,64.5,1272.898111,1550.133198
1367,2019-03-23,1368,2019,1,Docklands,Sydney,Western Bulldogs,0,-17,0,84.0,68.5,1635.928361,1467.294962
1368,2019-03-23,1369,2019,1,M.C.G.,Port Adelaide Power,Melbourne,0,26,1,74.5,64.5,1504.777529,1566.884537
1369,2019-03-24,1370,2019,1,Sydney Showground,Essendon,GWS,0,-72,0,55.5,84.0,1485.319903,1605.717735
1370,2019-03-24,1371,2019,1,Docklands,St Kilda,Gold Coast,1,1,1,84.5,79.5,1378.118006,1202.430331
1371,2019-03-24,1372,2019,1,Perth Stadium,North Melbourne,Fremantle,0,-82,0,72.0,74.5,1425.514864,1457.565917


As we can see, we have the median supercoach scores for each team, as well as the margin, result and each team's Elo score before the game. Let's take a quick look at the correlation between median supercoach scores and Elo scores with margin, so we can see if these look like they could be good predictors of margin/winning.

In [28]:
# Check the correlations of our columns to margin
afl.corr().margin

game_id        -0.006955
season         -0.011825
round           0.036840
home_team_1     0.139195
margin          1.000000
winner          0.792105
supercoach_1    0.749591
supercoach_2   -0.743541
elo_1           0.394782
elo_2          -0.384503
Name: margin, dtype: float64

It looks like supercoach and Elo are highly correlated with margin. These could be quality predictors of margin/winning. However, we need to ensure that we only use statistics that we have access to before the game. Currently we're using supercoach scores from after the game. This leaks the result. Instead we need to use average supercoach scores from <i>previous games</i> to predict the <i>current game</i>.

## Feature Creation
A "feature" is the stat in the data which we believe will allow our model to predict the outcome better. For example, we can use Elo as a feature because it will obviously allow the model to learn that higher Elos are more likely to beat low Elos.

We're only going to use two features in this model: Elo for each team, and average supercoach scores over the past 10 games.

Let's calculate the average supercoach score for each team for their past 10 matches. To do this, we need to reshape the data from having two teams on each row to only one team on each row (similar to going from a wide dataframe to a long dataframe). We can then find each team's average supercoach scores for the past 10 matches, and then join these stats back to our original dataframe.

### Convert DataFrame from Wide to Long
We need to reshape the dataframe by appending the team 2's data to team 1's data. This is because if use the current shape to calculate moving averages for the home team, we won't include the games where the team plays away. The image below depicts how we will be reshaping the data.

![Wide to Long](wide_to_long.png "Wide to Long")

In [11]:
# Split the dataframe from wide to long

# Get team 1's stats in a dataframe

# Get team 2's stats in a dataframe

# Append these dataframes together to get a long dataframe

In [12]:
# View the last 5 rows of the long df

## Calculating Moving Averages for Supercoach Scores
As you can see, we now have a dataframe with one team on each row. Let's calculate the average supercoach score for each team over the past 10 games. This piece of code groups the data by the team and then calculates a "rolling" average (an average over a certain amount of games) for the last 10 games. It then shifts the data up so as to not include the current game in the calculations.

In [13]:
# Create supercoach average for the last 10 games played. Call it supercoach_average_last_10

Now let's look at each team's average last 10 supercoach.

In [14]:
# Look at each team's last average 10 supercoach scores and order it from highest to lowest

As we can see, the top sides generally have a higher supercoach average, whereas the bottom sides like Gold Coast, have a lower average supercoach score. Essendon are at the top as they are natually a high scoring supercoach team and also had a strong last 10 games of the 2018 season. Richmond are next, which is expected as they were favourites going into the finals. Our data doesn't include 2018 finals. If it did, we could expect Collingwood and West Coast to be higher up the ladder supercoach average ladder.

## Join Features To Original DataFrame
Let's now merge these average stats back to our original dataframe and use them to create a simple machine learning model. We will then predict the first round of 2019, to see what our model would've given us.

In [15]:
# Put dataframe back into individual rows

In [16]:
# View last 5 rows

## Modelling With H2O
Now that we have our features (elo for each team and average supercoach score), let's create a model. To simplify things, we're going to use H2O's AutoML function, which automatically creates many models, tunes each model and returns the best model with the highest/lowest metric. For example, if you want a model to be created based on the highest accuracy of choosing the winner, you could use accuracy as your metric. First, let's create a training set, which is the data which the model will be "trained" on. We will then use the model to predict the first round of 2019. 

In [17]:
# Initialise h2o session

# Create a training set - all the data up to 2018

In [18]:
# Create an AutoML object

# Train the model

In [19]:
# Predict the first round

In [20]:
# View the round 1 predictions

In [21]:
# Add the round 1 predictions to the test dataframe

In [22]:
# View the predictions