# AFL player match stats - wrangling
I am interested in analysing publically available AFL data to look for insights into brownlow votes. First, I need to access player match data and wrangle it together. I began with data available from [AFL Tables](https://afltables.com/afl/afl_index.html), then supplemented it with some additional data from [Footywire](https://www.footywire.com/).

## Import libraries

In [75]:
import pandas as pd
pd.set_option('display.max_columns', None)

from bs4 import BeautifulSoup
import requests
import urllib.parse
import numpy as np
import itertools
import csv
import random
import time
import re
from selenium.webdriver import Chrome, ChromeOptions
from selenium.common.exceptions import TimeoutException, WebDriverException

## <u>Table of Contents</u> <a id='L'></a>

<font size = 4><b>[1. AFL Tables](#1)</b></font> <br>
&nbsp;&nbsp; <b>[1.1 Read and clean tables for one game](#1.1)</b> <br>
&nbsp;&nbsp;&nbsp;&nbsp; [1.2.1 Retrieve web links for 2022 games](#1.2.1) <br>
&nbsp;&nbsp;&nbsp;&nbsp; [1.2.2 Retrieve web links for 2015-2022 games](#1.2.2) <br>
&nbsp;&nbsp; <b>[1.2 Retrieve web links](#1.2)</b> <br>
&nbsp;&nbsp; <b>[1.3 Wrangle many tables](#1.3)</b> <br>
<font size = 4><b>[2. Footywire](#2)</b></font> <br>
&nbsp;&nbsp; <b>[2.1 Read and clean tables for one game](#2.1)</b> <br>
&nbsp;&nbsp; <b>[2.2 Retrieve web links](#2.2)</b> <br>
&nbsp;&nbsp;&nbsp;&nbsp; [2.2.1 Retrieve web links for 2022 games](#2.2.1) <br>
&nbsp;&nbsp;&nbsp;&nbsp; [2.2.2 Retrieve web links for 2015-2022 games](#2.2.2) <br>
&nbsp;&nbsp; <b>[2.3 Wrangle many tables](#2.3)</b> <br>
<font size = 4><b>[3. Merge AFL tables and Footywire dataframes](#3)</b></font> <br>
<font size = 4><b>[4. Add player positions](#4)</b></font> <br>
&nbsp;&nbsp; <b>[4.1 Add match URLs](#4.1)</b> <br>
&nbsp;&nbsp; <b>[4.2 Get player field positions](#4.2)</b> <br>
&nbsp;&nbsp;&nbsp;&nbsp; [4.2.1 Get field positions for each player in each match](#4.2.1) <br>
<font size = 4><b>[5 Add AFLCA votes](#5)</b></font> <br>
&nbsp;&nbsp; <b>[5.1 Get all AFLCA votes](#5.1)</b> <br>

## [1. AFL Tables](#L) <a id='1'></a>

### [1.1 Read and clean tables for one game](#L) <a id='1.1'></a>

In [2]:
pd.set_option('display.max_columns', None)

# Read sample game - first game of 2022 between Melbourne and Western Bulldogs
df = pd.read_html("https://afltables.com/afl/stats/games/2022/071120220316.html")

When we read a page using pandas, it will return a list of dataframes, but we only want the two tables of players stats, one for each team. These tables will typically be at index 2 and 4 of the list.

In [3]:
# Preview the table for the home team in the match
df[2].head()

Unnamed: 0_level_0,Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game],Melbourne Match Statistics [Season][Game by Game]
Unnamed: 0_level_1,#,Player,KI,MK,HB,DI,GL,BH,HO,TK,RB,IF,CL,CG,FF,FA,BR,CP,UP,CM,MI,1%,BO,GA,%P
0,12 ↑,"Bedford, Toby",5,3,4,9,,1.0,,2.0,1.0,1.0,,1.0,2.0,,,4.0,5.0,,1.0,2.0,1.0,,55.0
1,17,"Bowey, Jake",8,2,1,9,1.0,,,,2.0,1.0,1.0,1.0,,,,3.0,6.0,,,2.0,,,73.0
2,10,"Brayshaw, Angus",12,6,11,23,,,,3.0,2.0,2.0,,4.0,1.0,1.0,,4.0,19.0,,,1.0,,,83.0
3,50,"Brown, Ben",9,8,4,13,3.0,3.0,,,,1.0,,1.0,,1.0,,6.0,7.0,2.0,6.0,2.0,,,86.0
4,31,"Fritsch, Bayley",8,4,1,9,2.0,2.0,,1.0,,2.0,,2.0,,1.0,,3.0,5.0,1.0,3.0,,,1.0,81.0


There are a few things that we will want to clean up such as the NaN values. It is also worth noting that the the player stats tables have a MultiIndex column (e.g., "Melbourne Match Statistics [Season][Game by Game]", and then each column name).

In [4]:
type(df[2].columns)

pandas.core.indexes.multi.MultiIndex

Let's clean up the table by dropping the first row of headings as it will not be needed

In [5]:
# Drop top table heading
df[2] = df[2].droplevel(0, axis=1) # Melbourne
df[4] = df[4].droplevel(0, axis=1) # Western Bulldogs

In [6]:
df[2].head()

Unnamed: 0,#,Player,KI,MK,HB,DI,GL,BH,HO,TK,RB,IF,CL,CG,FF,FA,BR,CP,UP,CM,MI,1%,BO,GA,%P
0,12 ↑,"Bedford, Toby",5,3,4,9,,1.0,,2.0,1.0,1.0,,1.0,2.0,,,4.0,5.0,,1.0,2.0,1.0,,55.0
1,17,"Bowey, Jake",8,2,1,9,1.0,,,,2.0,1.0,1.0,1.0,,,,3.0,6.0,,,2.0,,,73.0
2,10,"Brayshaw, Angus",12,6,11,23,,,,3.0,2.0,2.0,,4.0,1.0,1.0,,4.0,19.0,,,1.0,,,83.0
3,50,"Brown, Ben",9,8,4,13,3.0,3.0,,,,1.0,,1.0,,1.0,,6.0,7.0,2.0,6.0,2.0,,,86.0
4,31,"Fritsch, Bayley",8,4,1,9,2.0,2.0,,1.0,,2.0,,2.0,,1.0,,3.0,5.0,1.0,3.0,,,1.0,81.0


We can include columns for club name, opposition club, round and year to help with identifying matches.

In [7]:
# Record club name in each row
df[2]['Club'] = df[0][1][1]
df[4]['Club'] = df[0][1][2]

# Record opponent club name in each row
df[2]['Opponent'] = df[0][1][2]
df[4]['Opponent'] = df[0][1][1]

# Record year and round in each row
df[2]['Round'] = df[4]['Round'] = df[0][1][0].split('Round: ')[1].split(" Venue")[0]
df[2]['Year'] = df[4]['Year'] = df[0][1][0].split('-')[2].split(' ')[0]

In [8]:
# Check the dataframe for team 2: Western Bulldogs
df[4]

Unnamed: 0,#,Player,KI,MK,HB,DI,GL,BH,HO,TK,RB,IF,CL,CG,FF,FA,BR,CP,UP,CM,MI,1%,BO,GA,%P,Club,Opponent,Round,Year
0,4,"Bontempelli, Marcus",13,7,4,17,1,,,7.0,3.0,4.0,1.0,3.0,1.0,1.0,,6.0,12.0,1.0,,3.0,,2.0,72.0,Western Bulldogs,Melbourne,1,2022
1,12,"Cordy, Zaine",1,,2,3,,,1.0,3.0,,,1.0,2.0,1.0,2.0,,1.0,1.0,,,4.0,,1.0,84.0,Western Bulldogs,Melbourne,1,2022
2,9,"Crozier, Hayden",14,12,5,19,1,,,3.0,3.0,2.0,,3.0,,2.0,,3.0,18.0,1.0,,2.0,1.0,,75.0,Western Bulldogs,Melbourne,1,2022
3,31,"Dale, Bailey",17,6,8,25,1,,,,6.0,3.0,1.0,2.0,1.0,,,4.0,17.0,,,2.0,,,83.0,Western Bulldogs,Melbourne,1,2022
4,35,"Daniel, Caleb",14,6,12,26,,,,2.0,7.0,2.0,,2.0,,1.0,,5.0,17.0,,,1.0,2.0,,80.0,Western Bulldogs,Melbourne,1,2022
5,5,"Dunkley, Josh",14,4,15,29,,1.0,,5.0,2.0,4.0,3.0,3.0,1.0,1.0,,12.0,17.0,,,3.0,1.0,1.0,84.0,Western Bulldogs,Melbourne,1,2022
6,44,"English, Tim",16,6,4,20,,1.0,18.0,2.0,4.0,1.0,8.0,2.0,8.0,1.0,,13.0,7.0,2.0,1.0,2.0,,,86.0,Western Bulldogs,Melbourne,1,2022
7,43,"Gardner, Ryan",6,3,1,7,,,,1.0,,,,1.0,1.0,,,1.0,6.0,,,5.0,,,100.0,Western Bulldogs,Melbourne,1,2022
8,29,"Hannan, Mitch",5,,3,8,1,,,,,1.0,,3.0,1.0,,,6.0,2.0,,,,,,72.0,Western Bulldogs,Melbourne,1,2022
9,7,"Hunter, Lachie",7,4,6,13,,,,3.0,,1.0,,,1.0,,,3.0,10.0,,,2.0,,,88.0,Western Bulldogs,Melbourne,1,2022


In [9]:
# Add result: whether the team won, lost or drew
team1_score = df[0][5][1].split('.')[2]
team2_score = df[0][5][2].split('.')[2]

if team1_score > team2_score:
    df[2]['Result'] = 'W'
    df[4]['Result'] = 'L'
elif team2_score > team1_score:
    df[2]['Result'] = 'L'
    df[4]['Result'] = 'W'
elif team1_score == team2_score:
    df[2]['Result'] = 'D'
    df[4]['Result'] = 'D'

In [10]:
# Add game margin
df[2]['Margin'] = int(team1_score) - int(team2_score)
df[4]['Margin'] = int(team2_score) - int(team1_score)

There are also soms rows at the bottom of the table that we don't need: rushed behinds and stat totals for each team.

In [11]:
# Check the bottom of the table
df[2].tail()

Unnamed: 0,#,Player,KI,MK,HB,DI,GL,BH,HO,TK,RB,IF,CL,CG,FF,FA,BR,CP,UP,CM,MI,1%,BO,GA,%P,Club,Opponent,Round,Year,Result,Margin
21,20,"Tomlinson, Adam",6,4,2,8,,,,1.0,1.0,,,2.0,,1.0,,3.0,6.0,,,4.0,,,100.0,Melbourne,Western Bulldogs,1,2022,W,26
22,7,"Viney, Jack",11,5,12,23,1,,,4.0,2.0,3.0,4.0,4.0,2.0,,,12.0,11.0,,1.0,1.0,,,79.0,Melbourne,Western Bulldogs,1,2022,W,26
23,Rushed,Rushed,Rushed,Rushed,Rushed,Rushed,Rushed,2.0,,,,,,,,,,,,,,,,,,Melbourne,Western Bulldogs,1,2022,W,26
24,Totals,Totals,213,96,137,350,14,13.0,44.0,46.0,40.0,56.0,30.0,72.0,17.0,30.0,4.0,146.0,204.0,16.0,17.0,54.0,8.0,8.0,,Melbourne,Western Bulldogs,1,2022,W,26
25,Opposition,Opposition,224,92,152,376,11,5.0,20.0,53.0,43.0,51.0,40.0,59.0,30.0,17.0,2.0,126.0,243.0,6.0,6.0,43.0,4.0,8.0,,Melbourne,Western Bulldogs,1,2022,W,26


In [12]:
# Replace NaN with 0s
df[2] = df[2].fillna(0)
df[4] = df[4].fillna(0)

In [13]:
# Drop 'Rushed behind' row and totals rows from the bottom
df[2] = df[2][(df[2]['#'] != 'Rushed') & (df[2]['#'] != 'Totals') & (df[2]['#'] != 'Opposition')]
df[4] = df[4][(df[4]['#'] != 'Rushed') & (df[4]['#'] != 'Totals') & (df[4]['#'] != 'Opposition')]

In [14]:
df[2]

Unnamed: 0,#,Player,KI,MK,HB,DI,GL,BH,HO,TK,RB,IF,CL,CG,FF,FA,BR,CP,UP,CM,MI,1%,BO,GA,%P,Club,Opponent,Round,Year,Result,Margin
0,12 ↑,"Bedford, Toby",5,3,4,9,0,1.0,0.0,2.0,1.0,1.0,0.0,1.0,2.0,0.0,0.0,4.0,5.0,0.0,1.0,2.0,1.0,0.0,55.0,Melbourne,Western Bulldogs,1,2022,W,26
1,17,"Bowey, Jake",8,2,1,9,1,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,3.0,6.0,0.0,0.0,2.0,0.0,0.0,73.0,Melbourne,Western Bulldogs,1,2022,W,26
2,10,"Brayshaw, Angus",12,6,11,23,0,0.0,0.0,3.0,2.0,2.0,0.0,4.0,1.0,1.0,0.0,4.0,19.0,0.0,0.0,1.0,0.0,0.0,83.0,Melbourne,Western Bulldogs,1,2022,W,26
3,50,"Brown, Ben",9,8,4,13,3,3.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,6.0,7.0,2.0,6.0,2.0,0.0,0.0,86.0,Melbourne,Western Bulldogs,1,2022,W,26
4,31,"Fritsch, Bayley",8,4,1,9,2,2.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,1.0,0.0,3.0,5.0,1.0,3.0,0.0,0.0,1.0,81.0,Melbourne,Western Bulldogs,1,2022,W,26
5,11,"Gawn, Max",11,4,3,14,1,0.0,34.0,1.0,2.0,5.0,4.0,9.0,2.0,6.0,0.0,12.0,3.0,2.0,0.0,3.0,0.0,0.0,97.0,Melbourne,Western Bulldogs,1,2022,W,26
6,4,"Harmes, James",6,1,11,17,1,0.0,0.0,5.0,1.0,3.0,2.0,4.0,2.0,1.0,0.0,7.0,10.0,1.0,1.0,3.0,0.0,0.0,77.0,Melbourne,Western Bulldogs,1,2022,W,26
7,29,"Hunt, Jayden",7,4,4,11,0,0.0,0.0,1.0,2.0,0.0,0.0,2.0,0.0,0.0,0.0,4.0,7.0,1.0,0.0,2.0,0.0,0.0,86.0,Melbourne,Western Bulldogs,1,2022,W,26
8,6,"Jackson, Luke",3,5,9,12,1,1.0,10.0,4.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,8.0,5.0,2.0,1.0,1.0,0.0,0.0,86.0,Melbourne,Western Bulldogs,1,2022,W,26
9,23,"Jordon, James",10,4,10,20,0,0.0,0.0,2.0,1.0,0.0,0.0,2.0,3.0,0.0,0.0,8.0,12.0,0.0,0.0,0.0,0.0,0.0,70.0,Melbourne,Western Bulldogs,1,2022,W,26


In [15]:
df[4]

Unnamed: 0,#,Player,KI,MK,HB,DI,GL,BH,HO,TK,RB,IF,CL,CG,FF,FA,BR,CP,UP,CM,MI,1%,BO,GA,%P,Club,Opponent,Round,Year,Result,Margin
0,4,"Bontempelli, Marcus",13,7,4,17,1,0.0,0.0,7.0,3.0,4.0,1.0,3.0,1.0,1.0,0.0,6.0,12.0,1.0,0.0,3.0,0.0,2.0,72.0,Western Bulldogs,Melbourne,1,2022,L,-26
1,12,"Cordy, Zaine",1,0,2,3,0,0.0,1.0,3.0,0.0,0.0,1.0,2.0,1.0,2.0,0.0,1.0,1.0,0.0,0.0,4.0,0.0,1.0,84.0,Western Bulldogs,Melbourne,1,2022,L,-26
2,9,"Crozier, Hayden",14,12,5,19,1,0.0,0.0,3.0,3.0,2.0,0.0,3.0,0.0,2.0,0.0,3.0,18.0,1.0,0.0,2.0,1.0,0.0,75.0,Western Bulldogs,Melbourne,1,2022,L,-26
3,31,"Dale, Bailey",17,6,8,25,1,0.0,0.0,0.0,6.0,3.0,1.0,2.0,1.0,0.0,0.0,4.0,17.0,0.0,0.0,2.0,0.0,0.0,83.0,Western Bulldogs,Melbourne,1,2022,L,-26
4,35,"Daniel, Caleb",14,6,12,26,0,0.0,0.0,2.0,7.0,2.0,0.0,2.0,0.0,1.0,0.0,5.0,17.0,0.0,0.0,1.0,2.0,0.0,80.0,Western Bulldogs,Melbourne,1,2022,L,-26
5,5,"Dunkley, Josh",14,4,15,29,0,1.0,0.0,5.0,2.0,4.0,3.0,3.0,1.0,1.0,0.0,12.0,17.0,0.0,0.0,3.0,1.0,1.0,84.0,Western Bulldogs,Melbourne,1,2022,L,-26
6,44,"English, Tim",16,6,4,20,0,1.0,18.0,2.0,4.0,1.0,8.0,2.0,8.0,1.0,0.0,13.0,7.0,2.0,1.0,2.0,0.0,0.0,86.0,Western Bulldogs,Melbourne,1,2022,L,-26
7,43,"Gardner, Ryan",6,3,1,7,0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,6.0,0.0,0.0,5.0,0.0,0.0,100.0,Western Bulldogs,Melbourne,1,2022,L,-26
8,29,"Hannan, Mitch",5,0,3,8,1,0.0,0.0,0.0,0.0,1.0,0.0,3.0,1.0,0.0,0.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,72.0,Western Bulldogs,Melbourne,1,2022,L,-26
9,7,"Hunter, Lachie",7,4,6,13,0,0.0,0.0,3.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,3.0,10.0,0.0,0.0,2.0,0.0,0.0,88.0,Western Bulldogs,Melbourne,1,2022,L,-26


In [16]:
# Combine both teams
df_concat = pd.concat([df[2], df[4]])

We will also alter the format of player names to match another dataset. We will need to merge the two, so the format of the player names needs to be consistent. We will only keep the first initial, and the last name (partially shortened for hyphenated last names.

In [17]:
# Change format to initial of first name, followed by a space, then last name
df_concat['Player'] = df_concat['Player'].apply(lambda x: f"{x.split(', ')[1][0]} {x.split(', ')[0]}")
# Change format of hyphenated last names to initialise first component
df_concat['Player'] = df_concat['Player'].apply(lambda x: f"{x.split('-')[0][:3]}-{x.split('-')[1]}" if '-' in x else x)

In [18]:
df_concat

Unnamed: 0,#,Player,KI,MK,HB,DI,GL,BH,HO,TK,RB,IF,CL,CG,FF,FA,BR,CP,UP,CM,MI,1%,BO,GA,%P,Club,Opponent,Round,Year,Result,Margin
0,12 ↑,T Bedford,5,3,4,9,0,1.0,0.0,2.0,1.0,1.0,0.0,1.0,2.0,0.0,0.0,4.0,5.0,0.0,1.0,2.0,1.0,0.0,55.0,Melbourne,Western Bulldogs,1,2022,W,26
1,17,J Bowey,8,2,1,9,1,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,3.0,6.0,0.0,0.0,2.0,0.0,0.0,73.0,Melbourne,Western Bulldogs,1,2022,W,26
2,10,A Brayshaw,12,6,11,23,0,0.0,0.0,3.0,2.0,2.0,0.0,4.0,1.0,1.0,0.0,4.0,19.0,0.0,0.0,1.0,0.0,0.0,83.0,Melbourne,Western Bulldogs,1,2022,W,26
3,50,B Brown,9,8,4,13,3,3.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,6.0,7.0,2.0,6.0,2.0,0.0,0.0,86.0,Melbourne,Western Bulldogs,1,2022,W,26
4,31,B Fritsch,8,4,1,9,2,2.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,1.0,0.0,3.0,5.0,1.0,3.0,0.0,0.0,1.0,81.0,Melbourne,Western Bulldogs,1,2022,W,26
5,11,M Gawn,11,4,3,14,1,0.0,34.0,1.0,2.0,5.0,4.0,9.0,2.0,6.0,0.0,12.0,3.0,2.0,0.0,3.0,0.0,0.0,97.0,Melbourne,Western Bulldogs,1,2022,W,26
6,4,J Harmes,6,1,11,17,1,0.0,0.0,5.0,1.0,3.0,2.0,4.0,2.0,1.0,0.0,7.0,10.0,1.0,1.0,3.0,0.0,0.0,77.0,Melbourne,Western Bulldogs,1,2022,W,26
7,29,J Hunt,7,4,4,11,0,0.0,0.0,1.0,2.0,0.0,0.0,2.0,0.0,0.0,0.0,4.0,7.0,1.0,0.0,2.0,0.0,0.0,86.0,Melbourne,Western Bulldogs,1,2022,W,26
8,6,L Jackson,3,5,9,12,1,1.0,10.0,4.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,8.0,5.0,2.0,1.0,1.0,0.0,0.0,86.0,Melbourne,Western Bulldogs,1,2022,W,26
9,23,J Jordon,10,4,10,20,0,0.0,0.0,2.0,1.0,0.0,0.0,2.0,3.0,0.0,0.0,8.0,12.0,0.0,0.0,0.0,0.0,0.0,70.0,Melbourne,Western Bulldogs,1,2022,W,26


### [1.2 Retrieve web links](#L) <a id='1.2'></a>
To scrape match stats for all the matches we need to retrieve the link for each game's page.

#### [1.2.1 Retrieve web links for 2022 games](#L) <a id='1.2.1'></a>

HTTP Request:

In [19]:
# Get Request
response = requests.get('https://afltables.com/afl/seas/2022.html')

# Status Code check
response.status_code

200

Soup Object:

In [20]:
soup = BeautifulSoup(response.content, 'html.parser')

Retrieve URLs:

In [21]:
url1 = 'https://afltables.com/afl'

# Concatenate the second half of each link to get all 2022 matches
match_stats_urls = [url1 + link.get('href').split('..')[1] for link in soup.findAll('a', href=True, text='Match stats')]

# Drop the last 9 games, which are finals. Brownlow votes are not awarded in finals matches
match_stats_urls = match_stats_urls[:-9]

#### [1.2.2 Retrieve web links for 2015-2022 games](#L) <a id='1.2.2'></a>
The above gets all URLs for all home & away season matches in 2022. Let's get URLs for all home & away season matches over the past 8 years.

In [22]:
url1 = 'https://afltables.com/afl'
match_stats_urls = []

for i in range(2022,2014,-1):
    
    # Get Request
    response = requests.get(f'https://afltables.com/afl/seas/{i}.html')
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Concatenate the second half of each link to get all 2022 matches
    urls = [url1 + link.get('href').split('..')[1] for link in soup.findAll('a', href=True, text='Match stats')]

    # Drop the last 9 games, which are finals. Brownlow votes are not awarded in finals matches
    urls = urls[:-9]
    
    print(f"Number of matches in {i}: {len(urls)}")
    
    match_stats_urls += urls

Number of matches in 2022: 198
Number of matches in 2021: 198
Number of matches in 2020: 153
Number of matches in 2019: 198
Number of matches in 2018: 198
Number of matches in 2017: 198
Number of matches in 2016: 198
Number of matches in 2015: 197


Due to covid, the 2020 season was shortened to 17 rounds while a match was cancelled in 2015 due to the death of Adelaide coach Phil Walsh. 

### [1.3 Wrangle many tables](#L) <a id='1.3'></a>
Now, we create a function to complete the steps in 1.1. Note, there is also a check to make sure that the list index references the correct table, which should have MultiIndex column headers, as mentioned previously. In [this](https://afltables.com/afl/stats/games/2021/071220210402.html) game, there is an additional sundry table before the players stats, which notes a new record for the Western bulldogs, but it does not have a MultiIndex. As a result, the indices of the correct tables are pushed from 2 & 4, to 3 & 5.

In [23]:
def wrangle_stats(page_link):
    '''
    Wrangle the match stats of each player in a game into a single dataframe and clean the data.
    '''
    
    df = pd.read_html(page_link)

    # Set the indices to find team player stats
    idx1 = 2
    
    # If any sundry tables appear on stats page, the correct indices will be pushed down on the page
    while type(df[idx1].columns) != pd.core.indexes.multi.MultiIndex:
        idx1 += 1
    idx2 = idx1 + 2
    
    # Drop top table heading
    df[idx1] = df[idx1].droplevel(0, axis=1)
    df[idx2] = df[idx2].droplevel(0, axis=1)

    # Record club name in each row
    df[idx1]['Club'] = df[0][1][1]
    df[idx2]['Club'] = df[0][1][2]

    # Record opponent club name in each row
    df[idx1]['Opponent'] = df[0][1][2]
    df[idx2]['Opponent'] = df[0][1][1]
    
    # Record year and round in each row
    df[idx1]['Round'] = df[idx2]['Round'] = df[0][1][0].split('Round: ')[1].split(" Venue")[0]
    df[idx1]['Year'] = df[idx2]['Year'] = df[0][1][0].split('-')[2].split(' ')[0]
    
    team1_score = df[0][5][1].split('.')[2]
    team2_score = df[0][5][2].split('.')[2]
    
    # Add result if the player's team won or lost the match
    if team1_score > team2_score:
        df[idx1]['Result'] = 'W'
        df[idx2]['Result'] = 'L'
    elif team2_score > team1_score:
        df[idx1]['Result'] = 'L'
        df[idx2]['Result'] = 'W'
    elif team1_score == team2_score:
        df[idx1]['Result'] = 'D'
        df[idx2]['Result'] = 'D'
    
    # Add game margin
    df[idx1]['Margin'] = int(team1_score) - int(team2_score)
    df[idx2]['Margin'] = int(team2_score) - int(team1_score)
    
    # Drop 'Rushed behind' row and totals rows from the bottom
    df[idx1] = df[idx1][(df[idx1]['#'] != 'Rushed') & (df[idx1]['#'] != 'Totals') & (df[idx1]['#'] != 'Opposition')]
    df[idx2] = df[idx2][(df[idx2]['#'] != 'Rushed') & (df[idx2]['#'] != 'Totals') & (df[idx2]['#'] != 'Opposition')]
    
    # Combine two teams into one dataframe
    df_concat = pd.concat([df[idx1], df[idx2]])
    
    # Remove NaN
    df_concat = df_concat.fillna(0)
    
    # Reformat player names
    df_concat['Player'] = df_concat['Player'].apply(lambda x: " ".join(reversed(x.split(', '))))
    
#     # Format player name to first initial, followed by a space, then last name
#     df_concat['Player'] = df_concat['Player'].apply(lambda x: f"{x.split(', ')[1][0]} {x.split(', ')[0]}")
#     # Format hyphenated last names so that first component is an initial only
#     df_concat['Player'] = df_concat['Player'].apply(lambda x: f"{x.split('-')[0][:3]}-{x.split('-')[1]}" if '-' in x else x)
    
    return df_concat

In [24]:
# Get dataframe ready
df = pd.DataFrame(dtype=wrangle_stats(match_stats_urls[0]).info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46 entries, 0 to 22
Data columns (total 31 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   #         46 non-null     object 
 1   Player    46 non-null     object 
 2   KI        46 non-null     object 
 3   MK        46 non-null     object 
 4   HB        46 non-null     object 
 5   DI        46 non-null     object 
 6   GL        46 non-null     object 
 7   BH        46 non-null     float64
 8   HO        46 non-null     float64
 9   TK        46 non-null     float64
 10  RB        46 non-null     float64
 11  IF        46 non-null     float64
 12  CL        46 non-null     float64
 13  CG        46 non-null     float64
 14  FF        46 non-null     float64
 15  FA        46 non-null     float64
 16  BR        46 non-null     float64
 17  CP        46 non-null     float64
 18  UP        46 non-null     float64
 19  CM        46 non-null     float64
 20  MI        46 non-null     float64


In [25]:
# Iterate over each url, run the wrangling function to each match, and combine to the master dataframe
for match in match_stats_urls:
    df = pd.concat([df, wrangle_stats(match)])

In [26]:
df

Unnamed: 0,#,Player,KI,MK,HB,DI,GL,BH,HO,TK,RB,IF,CL,CG,FF,FA,BR,CP,UP,CM,MI,1%,BO,GA,%P,Club,Opponent,Round,Year,Result,Margin
0,12 ↑,Toby Bedford,5,3,4,9,0,1.0,0.0,2.0,1.0,1.0,0.0,1.0,2.0,0.0,0.0,4.0,5.0,0.0,1.0,2.0,1.0,0.0,55.0,Melbourne,Western Bulldogs,1,2022,W,26
1,17,Jake Bowey,8,2,1,9,1,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,3.0,6.0,0.0,0.0,2.0,0.0,0.0,73.0,Melbourne,Western Bulldogs,1,2022,W,26
2,10,Angus Brayshaw,12,6,11,23,0,0.0,0.0,3.0,2.0,2.0,0.0,4.0,1.0,1.0,0.0,4.0,19.0,0.0,0.0,1.0,0.0,0.0,83.0,Melbourne,Western Bulldogs,1,2022,W,26
3,50,Ben Brown,9,8,4,13,3,3.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,6.0,7.0,2.0,6.0,2.0,0.0,0.0,86.0,Melbourne,Western Bulldogs,1,2022,W,26
4,31,Bayley Fritsch,8,4,1,9,2,2.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,1.0,0.0,3.0,5.0,1.0,3.0,0.0,0.0,1.0,81.0,Melbourne,Western Bulldogs,1,2022,W,26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17,17,Jake Melksham,10,5,9,19,1,0.0,0.0,2.0,2.0,4.0,3.0,4.0,1.0,2.0,0.0,4.0,14.0,0.0,0.0,0.0,0.0,0.0,86.0,Essendon,Collingwood,23,2015,W,3
18,20,Jackson Merrett,10,5,2,12,1,0.0,0.0,3.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,2.0,11.0,0.0,2.0,0.0,0.0,1.0,84.0,Essendon,Collingwood,23,2015,W,3
19,16,Tayte Pears,6,4,8,14,0,0.0,0.0,1.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0,5.0,10.0,2.0,0.0,5.0,0.0,0.0,88.0,Essendon,Collingwood,23,2015,W,3
20,5,Brent Stanton,22,8,8,30,3,1.0,0.0,3.0,4.0,5.0,4.0,1.0,0.0,0.0,3.0,7.0,20.0,0.0,1.0,1.0,0.0,0.0,89.0,Essendon,Collingwood,23,2015,W,3


In [27]:
# Reformat Brisbane and GWS club names (for consistency with the next dataset)
df = df.replace({"Club": {"Brisbane Lions": "Brisbane", "Greater Western Sydney":"GWS"},
                 "Opponent": {"Brisbane Lions": "Brisbane", "Greater Western Sydney":"GWS"}})

In [28]:
# Remove arrows next to some player numbers, which represent substituting a player on/off
df['#'] = df['#'].replace({' \u2191':'',' \u2193':''}, regex=True)

## [2. Footywire](#L)<a id='2'></a>
[Footywire](https://www.footywire.com/) has some additional player statistics for games over the past 8 years such as effective disposals and metres gained, which could provide valuable insight into AFL brownlow voting, so we want to retrieve this data and add it to what we already have.

### [2.1 Read and clean tables for one game](#L) <a id='2.1'></a>

In [29]:
# Get Request
response = requests.get('https://www.footywire.com/afl/footy/ft_match_statistics?mid=10544&advv=Y') # First game of 2022

# Status Code check
response.status_code

200

In [30]:
soup = BeautifulSoup(response.content, 'html.parser')

In [31]:
# Get Footywire tables for first game of round 1, 2022
fw = pd.read_html("https://www.footywire.com/afl/footy/ft_match_statistics?mid=10544&advv=Y")

In [32]:
# Fix headings
fw[12].columns = fw[17].columns = fw[12].iloc[0]
fw[12] = fw[12][1:]
fw[17] = fw[17][1:]

In [33]:
fw[12].head(8)

Unnamed: 0,Player,CP,UP,ED,DE%,CM,GA,MI5,1%,BO,CCL,SCL,SI,MG,TO,ITC,T5,TOG%
1,C Petracca,16,22,29,76.3,0,1,2,0,0,3,6,13,869,4,3,0,87
2,C Oliver,15,20,24,75.0,0,0,0,2,1,3,3,8,549,8,7,2,81
3,J Viney,12,11,16,69.6,0,0,1,1,0,1,3,5,329,4,7,0,79
4,A Brayshaw,4,19,15,65.2,0,0,0,1,0,0,0,3,264,3,7,0,83
5,E Langdon,11,12,17,77.3,0,1,0,2,1,0,1,6,446,4,7,0,100
6,J Jordon,8,12,16,80.0,0,0,0,0,0,0,0,2,259,5,8,0,70
7,A N-Bullen,8,10,13,72.2,1,3,0,2,2,0,0,8,320,4,4,1,88
8,J Harmes,7,10,14,82.4,1,0,1,3,0,0,2,4,163,3,2,2,77


If the first team listed on the page activated a substitute during the match, then an additional line appears below that team's statistics table as a footnote. When reading the data using pandas, this will be in the form of another table, which means the index position of the second team will be pushed from 16 to 17.

In [34]:
# Check if arrow icon present in index 13 table (means substitute was activated)
fw[13][0][0][0] == '\u2197' 

True

In [35]:
# Add team names to each row
fw[12]['Club'] = fw[8]['Team'][0]
fw[17]['Club'] = fw[8]['Team'][1]

In [36]:
fw[12].head()

Unnamed: 0,Player,CP,UP,ED,DE%,CM,GA,MI5,1%,BO,CCL,SCL,SI,MG,TO,ITC,T5,TOG%,Club
1,C Petracca,16,22,29,76.3,0,1,2,0,0,3,6,13,869,4,3,0,87,Melbourne
2,C Oliver,15,20,24,75.0,0,0,0,2,1,3,3,8,549,8,7,2,81,Melbourne
3,J Viney,12,11,16,69.6,0,0,1,1,0,1,3,5,329,4,7,0,79,Melbourne
4,A Brayshaw,4,19,15,65.2,0,0,0,1,0,0,0,3,264,3,7,0,83,Melbourne
5,E Langdon,11,12,17,77.3,0,1,0,2,1,0,1,6,446,4,7,0,100,Melbourne


In [37]:
# Put both teams together
combined_fw = pd.concat([fw[12], fw[17]])

# Add additional columns (needed for merging dataframes later)
combined_fw['Round'] = soup.find('title').text.split('Round ')[1].split(' ')[0]
combined_fw['Year'] = soup.find('title').text.split(' ')[-1]

In [38]:
# Keep only columns of interest (note we already have some of the statistics from AFL tables)
combined_fw = combined_fw[['Player', 'ED', 'DE%', 'SI', 'MG', 'TO', 'ITC' , 'T5', 'Round', 'Year', 'Club']]

In [39]:
combined_fw

Unnamed: 0,Player,ED,DE%,SI,MG,TO,ITC,T5,Round,Year,Club
1,C Petracca,29,76.3,13,869,4,3,0,1,2022,Melbourne
2,C Oliver,24,75,8,549,8,7,2,1,2022,Melbourne
3,J Viney,16,69.6,5,329,4,7,0,1,2022,Melbourne
4,A Brayshaw,15,65.2,3,264,3,7,0,1,2022,Melbourne
5,E Langdon,17,77.3,6,446,4,7,0,1,2022,Melbourne
6,J Jordon,16,80,2,259,5,8,0,1,2022,Melbourne
7,A N-Bullen,13,72.2,8,320,4,4,1,1,2022,Melbourne
8,J Harmes,14,82.4,4,163,3,2,2,1,2022,Melbourne
9,M Gawn,8,57.1,8,326,4,4,0,1,2022,Melbourne
10,S May,11,78.6,1,384,3,5,0,1,2022,Melbourne


In [40]:
# Remove arrow symbols next to substitute/d players
combined_fw['Player'] = combined_fw['Player'].replace({' \u2197':'',' \u2199':''}, regex=True)

### [2.2 Retrieve web links](#L) <a id='2.2'></a>
As with AFL tables, we need to retrieve all the URLs for each match page

#### [2.2.1 Retrieve web links for 2022 games](#L) <a id='2.2.1'></a>

HTTP Request:

In [41]:
# Get Request for match summaries page 2022
response = requests.get('https://www.footywire.com/afl/footy/ft_match_list?year=2022')

# Status Code check
response.status_code

200

Soup object:

In [42]:
soup = BeautifulSoup(response.content, 'html.parser')

In [43]:
# We look for all 'tr' tags that have the class 'darkcolor' or 'lightcolor' as this is the pattern for row data in each table
table_rows = soup.findAll('tr', {'class':["darkcolor", 'lightcolor']})

Isolate the relative URLs:

In [44]:
# Narrow the search down further
table_rows[0].findAll('td', {'class': 'data'})

[<td class="data" height="24"> Wed 16 Mar 7:10pm</td>,
 <td class="data">
 <a href="th-melbourne-demons">Melbourne</a>
 v 
 <a href="th-western-bulldogs">Western Bulldogs</a>
 </td>,
 <td class="data">MCG</td>,
 <td align="center" class="data">58002</td>,
 <td align="center" class="data"><a href="ft_match_statistics?mid=10544">97-71</a></td>,
 <td class="data">
 <a href="ft_player_profile?pid=3800" rel="nofollow">J. Macrae</a> 39<br/>
 </td>,
 <td class="data">
 <a href="ft_player_profile?pid=6491" rel="nofollow">A. Naughton</a> 4<br/>
 </td>]

The link to the page we want in each row is in the 5th column (index 4), so we search for td tag, class data, index 4 and specify that we want the relative url.

In [45]:
table_rows[0].findAll('td', {'class': 'data'})[4].find('a').get('href')

'ft_match_statistics?mid=10544'

Then we store all the URLs in a list, concatenating the first part of the URL with relative URL for each page.

In [46]:
fw_urls = []
url_p1 = "https://www.footywire.com/afl/footy/"

# Find url to advanced stats page for each home and away match in 2022 season, ignore any invalid rows
for row in table_rows:
    try:
        fw_urls.append(f"{url_p1}{row.findAll('td', {'class': 'data'})[4].find('a').get('href')}&advv=Y")
    except:
        pass

In [47]:
fw_urls = fw_urls[:-9] # Cut out finals
fw_urls[:5]

['https://www.footywire.com/afl/footy/ft_match_statistics?mid=10544&advv=Y',
 'https://www.footywire.com/afl/footy/ft_match_statistics?mid=10545&advv=Y',
 'https://www.footywire.com/afl/footy/ft_match_statistics?mid=10546&advv=Y',
 'https://www.footywire.com/afl/footy/ft_match_statistics?mid=10547&advv=Y',
 'https://www.footywire.com/afl/footy/ft_match_statistics?mid=10548&advv=Y']

#### [2.2.2 Retrieve web links for 2015-2022 games](#L) <a id='2.2.2'></a>

In [48]:
table_rows = []

for i in range(2022,2014,-1):
    
    # Get Request
    response = requests.get(f'https://www.footywire.com/afl/footy/ft_match_list?year={i}')
    # Create soup object
    soup = BeautifulSoup(response.content, 'html.parser')
    # Get table rows data
    table_rows += soup.findAll('tr', {'class':["darkcolor", 'lightcolor']})

In [49]:
fw_urls = []
url_p1 = "https://www.footywire.com/afl/footy/"

# Find url to advanced stats page for each home and away match in 2022 season
for row in table_rows:
    try:
        fw_urls.append(f"{url_p1}{row.findAll('td', {'class': 'data'})[4].find('a').get('href')}&advv=Y")
    except:
        pass

In [50]:
len(fw_urls)

1610

### [2.3 Wrangle many tables](#L) <a id='2.3'> </a>

In [51]:
def wrangle_fw_stats(url):
    """
    Wrangle the footywire statistics
    """

    # Get Footywire tables
    fw = pd.read_html(url)
    
    # If substitute activated for team 1, the website shows an additional line below table 1, which pushses
    # the index for team 2 down to 17, otherwise it will be 16. '\u2197' is the unicode for the arrow symbol
    # that can be used to determine if team 1 activated a substitute in that match.
    if fw[13][0][0][0] == '\u2197':
        t2_idx = 17
    else:
        t2_idx = 16
    
    # Fix headings
    fw[12].columns = fw[t2_idx].columns = fw[12].iloc[0]
    fw[12] = fw[12][1:]
    fw[t2_idx] = fw[t2_idx][1:]
    
    # Add team names to each row
    fw[12]['Club'] = fw[8]['Team'][0]
    fw[t2_idx]['Club'] = fw[8]['Team'][1]
    
    # Put both teams together
    combined_fw = pd.concat([fw[12], fw[t2_idx]])
    
    # Add additional columns (for merging dataframes)
    combined_fw['Round'] = fw[7][0][1].split(',')[0].split(' ')[1]
    combined_fw['Year'] = fw[7][0][2].split(',')[1].split(' ')[3]
    
    # Keep only columns of interest (note we already have some of the statistics from AFL tables)
    combined_fw = combined_fw[['Player', 'ED', 'DE%', 'SI', 'MG', 'TO', 'ITC' , 'T5', 'Round', 'Year', 'Club']]
    
#     # Remove arrow symbols next to substitute/d players
#     combined_fw['Player'] = combined_fw['Player'].replace({' \u2197':'',' \u2199':''}, regex=True)
    
    # Reset index
    combined_fw.reset_index(inplace=True, drop=True)
    
    # Get full player names using BeautifulSoup
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    soup_contents = soup.findAll('tr', {'class':['lightcolor','darkcolor']})
    full_names = [x.find('a').get('title') for x in soup_contents if x.find('a') is not None]
    
    # Replace first initial names with list of full names
    combined_fw["Player"] = full_names
    
    # Get Request for basic stats page
    response = requests.get(url.strip('&advv=Y'))
    soup = BeautifulSoup(response.content, 'html.parser')

    # Get stat data
    results = soup.find('table', {'id':'match-statistics-div'}).findAll('td', {'class':'statdata'})

    # Get stat numbers only
    results = [int(x.text) for x in results]

    # Get AFL dreamteam and supercoach figures as a tuple for each player
    results = [(x,y) for (x,y) in zip(results[15::17], results[16::17])]

    # Add 0 stats for unused substitutes if necessary
    for index, row in combined_fw.iterrows():
        if row[1] == "Unused Substitute":
            results.insert(index, (0,0))
            
    # Add additional columns from basic stats page for AFL fantasy scores
    combined_fw[['AF','SC']] = results
    
    return combined_fw

In [52]:
# Get dataframe ready
fw = pd.DataFrame(dtype=wrangle_fw_stats(fw_urls[0]).info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Player  46 non-null     object
 1   ED      46 non-null     object
 2   DE%     46 non-null     object
 3   SI      46 non-null     object
 4   MG      46 non-null     object
 5   TO      46 non-null     object
 6   ITC     46 non-null     object
 7   T5      46 non-null     object
 8   Round   46 non-null     object
 9   Year    46 non-null     object
 10  Club    46 non-null     object
 11  AF      46 non-null     int64 
 12  SC      46 non-null     int64 
dtypes: int64(2), object(11)
memory usage: 4.8+ KB


In [53]:
%%time

# Run wrangling function over each table and concatenate together
for url in fw_urls:
    fw = pd.concat([fw, wrangle_fw_stats(url)])

Wall time: 1h 59min 44s


In [54]:
fw.head(60)

Unnamed: 0,Player,ED,DE%,SI,MG,TO,ITC,T5,Round,Year,Club,AF,SC
0,Christian Petracca,29,76.3,13,869,4,3,0,1,2022,Melbourne,142,163
1,Clayton Oliver,24,75,8,549,8,7,2,1,2022,Melbourne,100,112
2,Jack Viney,16,69.6,5,329,4,7,0,1,2022,Melbourne,96,93
3,Angus Brayshaw,15,65.2,3,264,3,7,0,1,2022,Melbourne,86,79
4,Ed Langdon,17,77.3,6,446,4,7,0,1,2022,Melbourne,68,103
5,James Jordon,16,80,2,259,5,8,0,1,2022,Melbourne,73,74
6,Alex Neal-Bullen,13,72.2,8,320,4,4,1,1,2022,Melbourne,85,94
7,James Harmes,14,82.4,4,163,3,2,2,1,2022,Melbourne,68,88
8,Max Gawn,8,57.1,8,326,4,4,0,1,2022,Melbourne,79,92
9,Steven May,11,78.6,1,384,3,5,0,1,2022,Melbourne,59,76


In [55]:
# Remove string values for player that did not get on the ground, to 0
fw = fw.replace("Unused Substitute", 0)

## [3. Merge "AFL Tables" and "Footywire" dataframes](#L) <a id = '3'></a>

In [56]:
# Merge two dataframes on several columns
joined = pd.merge(df, fw, how='outer', indicator='True', on=['Player', 'Club', 'Round', 'Year'])

In [57]:
joined

Unnamed: 0,#,Player,KI,MK,HB,DI,GL,BH,HO,TK,RB,IF,CL,CG,FF,FA,BR,CP,UP,CM,MI,1%,BO,GA,%P,Club,Opponent,Round,Year,Result,Margin,ED,DE%,SI,MG,TO,ITC,T5,AF,SC,True
0,12,Toby Bedford,5,3,4,9,0,1.0,0.0,2.0,1.0,1.0,0.0,1.0,2.0,0.0,0.0,4.0,5.0,0.0,1.0,2.0,1.0,0.0,55.0,Melbourne,Western Bulldogs,1,2022,W,26.0,7,77.8,1,124,1,4,1,43.0,46.0,both
1,17,Jake Bowey,8,2,1,9,1,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,3.0,6.0,0.0,0.0,2.0,0.0,0.0,73.0,Melbourne,Western Bulldogs,1,2022,W,26.0,8,88.9,1,197,2,4,0,38.0,58.0,both
2,10,Angus Brayshaw,12,6,11,23,0,0.0,0.0,3.0,2.0,2.0,0.0,4.0,1.0,1.0,0.0,4.0,19.0,0.0,0.0,1.0,0.0,0.0,83.0,Melbourne,Western Bulldogs,1,2022,W,26.0,15,65.2,3,264,3,7,0,86.0,79.0,both
3,50,Ben Brown,9,8,4,13,3,3.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,6.0,7.0,2.0,6.0,2.0,0.0,0.0,86.0,Melbourne,Western Bulldogs,1,2022,W,26.0,8,61.5,11,245,1,0,0,77.0,87.0,both
4,31,Bayley Fritsch,8,4,1,9,2,2.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,1.0,0.0,3.0,5.0,1.0,3.0,0.0,0.0,1.0,81.0,Melbourne,Western Bulldogs,1,2022,W,26.0,5,55.6,6,251,2,0,0,53.0,61.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78954,,Matthew Taberner,,,,,,,,,,,,,,,,,,,,,,,,Fremantle,,Final,2015,,,4,50,2,106,4,0,1,28.0,22.0,right_only
78955,,Matthew Priddis,,,,,,,,,,,,,,,,,,,,,,,,West Coast,,Final,2015,,,18,69.2,9,318,2,0,0,98.0,116.0,right_only
78956,,Matthew Priddis,,,,,,,,,,,,,,,,,,,,,,,,West Coast,,Final,2015,,,18,72,6,256,5,2,3,86.0,105.0,right_only
78957,,Chris Masten,,,,,,,,,,,,,,,,,,,,,,,,West Coast,,Final,2015,,,9,90,4,106,2,0,0,38.0,50.0,right_only


There were a few rows that failed to merge, due to differences in formatting or because of finals matches included in the second dataset:

In [58]:
joined[joined['True']=='right_only']['Round'].unique()

array(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23',
       'Final', 'Finals'], dtype=object)

In [59]:
# Drop finals rounds
joined = joined[(joined['Round']!='Final')&(joined['Round']!='Finals')]

In [63]:
joined[joined['True']=='left_only']['Player'].unique()

array(['Tim English', 'Mitch Hannan', 'Lachie Hunter', 'Tom Liberatore',
       'Jack Macrae', 'Lochie OBrien', 'Matthew Owies', 'Zac Williams',
       'Dan Butler', 'Mitch Owens', 'Jordan de Goey', 'Brad Close',
       'Tom Stewart', 'Nik Cox', 'Nic Martin', 'Zach Merrett',
       'Lachie Ash', 'Matt de Boer', 'Matt Flynn', 'Harry Himmelberg',
       'Josh Kelly', 'Xavier OHalloran', 'Paddy McCartin',
       'Colin ORiordan', 'Cam Rayner', 'Tom Jonas', 'Lachie Jones',
       'Ollie Wines', 'Mitch Lewis', 'Connor Macdonald', 'Jaeger OMeara',
       'Mitch Hinge', 'Reilly OBrien', 'Lachie Schultz', 'Josh Kennedy',
       'Nic Naitanui', 'Xavier ONeill', 'Sam Collins', 'Nick Holman',
       'Lachie Weller', 'Lachie Fogarty', 'Lachie Plowman',
       'Mitch Duncan', 'Josh Walker', 'Kamdyn McIntosh',
       'Nathan ODriscoll', 'Mitch Wallis', 'Paddy Ryder', 'Mark OConnor',
       'Tim OBrien', 'Harry Petty', 'Matthew Cottrell', 'Oliver Dempsey',
       'Mitch Knevitt', 'Tom Berry', 'Zach S

In [61]:
joined[joined['True']=='right_only']['Player'].unique()

array(['Jackson Macrae', 'Timothy English', 'Thomas Liberatore',
       'Lachlan Hunter', 'Mitchell Hannan', "Lochie O'Brien",
       'Zachary Williams', 'Matt Owies', 'Daniel Butler',
       'Mitchito Owens', 'Jordan De Goey', 'Thomas Stewart',
       'Bradley Close', 'Zachary Merrett', 'Nicholas Martin',
       'Nikolas Cox', 'Joshua Kelly', 'Matthew Flynn', 'Matthew De Boer',
       'Harrison Himmelberg', 'Lachlan Ash', "Xavier O'Halloran",
       'Patrick McCartin', "Colin O'Riordan", 'Cameron Rayner',
       'Oliver Wines', 'Lachlan Jones', 'Thomas Jonas',
       'Connor MacDonald', "Jaeger O'Meara", 'Mitchell Lewis',
       'Mitchell Hinge', "Reilly O'Brien", 'Lachlan Schultz',
       "Xavier O'Neill", 'Joshua Kennedy', 'Nicholas Naitanui',
       'Lachlan Weller', 'Nicholas Holman', 'Samuel Collins',
       'Lachlan Fogarty', 'Lachlan Plowman', 'Mitchell Duncan',
       'Joshua Walker', "Nathan O'Driscoll", 'Kamdyn Mcintosh',
       'Mitchell Wallis', 'Patrick Ryder', "Mark O'Co

There are still some formatting inconsistencies in the player names of the two datasets, which are manually fixed below. The differences were output to csv and mapped in excel.

In [92]:
pd.DataFrame(joined[joined['True']=='right_only']['Player'].unique()).to_csv("player_names_1.csv")
pd.DataFrame(joined[joined['True']=='right_only']['Player'].unique()).to_csv("player_names_2.csv")

In [72]:
# Open mappings prepared in excel from above differences
reader = csv.DictReader(open('AFL_player_names_mapping.csv'))
names_map = {}

for row in reader:
    names_map[row['Key']] = row['Value']

In [94]:
# Standardise names using dictionary mapping
df.replace({"Player":names_map}, inplace=True)
fw.replace({"Player":names_map}, inplace=True)

Now we can merge the dataframes again.

In [95]:
# Merge two datasets, exluding finals stats
joined = pd.merge(df, fw, how='outer', indicator='True', on=['Player', 'Club', 'Round', 'Year'])
joined = joined[(joined['Round']!='Final')&(joined['Round']!='Finals')]

In [96]:
# Check no mismatches between left and right dataframes
joined[joined['True']=='left_only']['Player'].unique()

array([], dtype=object)

In [97]:
# Check no mismatches between left and right dataframes
joined[joined['True']=='right_only']['Player'].unique()

array([], dtype=object)

In [98]:
# Drop redundant column
joined = joined.drop('True', axis=1)

In [99]:
joined

Unnamed: 0,#,Player,KI,MK,HB,DI,GL,BH,HO,TK,RB,IF,CL,CG,FF,FA,BR,CP,UP,CM,MI,1%,BO,GA,%P,Club,Opponent,Round,Year,Result,Margin,ED,DE%,SI,MG,TO,ITC,T5,AF,SC
0,12,Toby Bedford,5,3,4,9,0,1.0,0.0,2.0,1.0,1.0,0.0,1.0,2.0,0.0,0.0,4.0,5.0,0.0,1.0,2.0,1.0,0.0,55.0,Melbourne,Western Bulldogs,1,2022,W,26.0,7,77.8,1,124,1,4,1,43,46
1,17,Jake Bowey,8,2,1,9,1,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,3.0,6.0,0.0,0.0,2.0,0.0,0.0,73.0,Melbourne,Western Bulldogs,1,2022,W,26.0,8,88.9,1,197,2,4,0,38,58
2,10,Angus Brayshaw,12,6,11,23,0,0.0,0.0,3.0,2.0,2.0,0.0,4.0,1.0,1.0,0.0,4.0,19.0,0.0,0.0,1.0,0.0,0.0,83.0,Melbourne,Western Bulldogs,1,2022,W,26.0,15,65.2,3,264,3,7,0,86,79
3,50,Ben Brown,9,8,4,13,3,3.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,6.0,7.0,2.0,6.0,2.0,0.0,0.0,86.0,Melbourne,Western Bulldogs,1,2022,W,26.0,8,61.5,11,245,1,0,0,77,87
4,31,Bayley Fritsch,8,4,1,9,2,2.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,1.0,0.0,3.0,5.0,1.0,3.0,0.0,0.0,1.0,81.0,Melbourne,Western Bulldogs,1,2022,W,26.0,5,55.6,6,251,2,0,0,53,61
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68459,17,Jake Melksham,10,5,9,19,1,0.0,0.0,2.0,2.0,4.0,3.0,4.0,1.0,2.0,0.0,4.0,14.0,0.0,0.0,0.0,0.0,0.0,86.0,Essendon,Collingwood,23,2015,W,3.0,14,73.7,2,430,6,4,0,72,58
68460,20,Jackson Merrett,10,5,2,12,1,0.0,0.0,3.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,2.0,11.0,0.0,2.0,0.0,0.0,1.0,84.0,Essendon,Collingwood,23,2015,W,3.0,10,83.3,5,206,3,3,0,68,65
68461,16,Tayte Pears,6,4,8,14,0,0.0,0.0,1.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0,5.0,10.0,2.0,0.0,5.0,0.0,0.0,88.0,Essendon,Collingwood,23,2015,W,3.0,9,64.3,3,63,3,6,0,51,53
68462,5,Brent Stanton,22,8,8,30,3,1.0,0.0,3.0,4.0,5.0,4.0,1.0,0.0,0.0,3.0,7.0,20.0,0.0,1.0,1.0,0.0,0.0,89.0,Essendon,Collingwood,23,2015,W,3.0,22,73.3,9,645,5,4,1,137,130


In [102]:
# Alter column order
cols = joined.columns.tolist()

cols = cols[:2] + cols[25:27] + cols[28:26:-1] + cols[29:31] + [cols[2]] + cols[4:6] + cols[31:33] + [cols[3]] + cols[6:8] + \
[cols[23]] + [cols[33]] + cols[8:16] + cols[17:23] + [cols[24]] + cols[-6:] + [cols[16]]

joined = joined[cols]

In [103]:
joined

Unnamed: 0,#,Player,Club,Opponent,Year,Round,Result,Margin,KI,HB,DI,ED,DE%,MK,GL,BH,GA,SI,HO,TK,RB,IF,CL,CG,FF,FA,CP,UP,CM,MI,1%,BO,%P,MG,TO,ITC,T5,AF,SC,BR
0,12,Toby Bedford,Melbourne,Western Bulldogs,2022,1,W,26.0,5,4,9,7,77.8,3,0,1.0,0.0,1,0.0,2.0,1.0,1.0,0.0,1.0,2.0,0.0,4.0,5.0,0.0,1.0,2.0,1.0,55.0,124,1,4,1,43,46,0.0
1,17,Jake Bowey,Melbourne,Western Bulldogs,2022,1,W,26.0,8,1,9,8,88.9,2,1,0.0,0.0,1,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,3.0,6.0,0.0,0.0,2.0,0.0,73.0,197,2,4,0,38,58,0.0
2,10,Angus Brayshaw,Melbourne,Western Bulldogs,2022,1,W,26.0,12,11,23,15,65.2,6,0,0.0,0.0,3,0.0,3.0,2.0,2.0,0.0,4.0,1.0,1.0,4.0,19.0,0.0,0.0,1.0,0.0,83.0,264,3,7,0,86,79,0.0
3,50,Ben Brown,Melbourne,Western Bulldogs,2022,1,W,26.0,9,4,13,8,61.5,8,3,3.0,0.0,11,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,6.0,7.0,2.0,6.0,2.0,0.0,86.0,245,1,0,0,77,87,0.0
4,31,Bayley Fritsch,Melbourne,Western Bulldogs,2022,1,W,26.0,8,1,9,5,55.6,4,2,2.0,1.0,6,0.0,1.0,0.0,2.0,0.0,2.0,0.0,1.0,3.0,5.0,1.0,3.0,0.0,0.0,81.0,251,2,0,0,53,61,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68459,17,Jake Melksham,Essendon,Collingwood,2015,23,W,3.0,10,9,19,14,73.7,5,1,0.0,0.0,2,0.0,2.0,2.0,4.0,3.0,4.0,1.0,2.0,4.0,14.0,0.0,0.0,0.0,0.0,86.0,430,6,4,0,72,58,0.0
68460,20,Jackson Merrett,Essendon,Collingwood,2015,23,W,3.0,10,2,12,10,83.3,5,1,0.0,1.0,5,0.0,3.0,0.0,2.0,1.0,1.0,1.0,0.0,2.0,11.0,0.0,2.0,0.0,0.0,84.0,206,3,3,0,68,65,0.0
68461,16,Tayte Pears,Essendon,Collingwood,2015,23,W,3.0,6,8,14,9,64.3,4,0,0.0,0.0,3,0.0,1.0,0.0,0.0,1.0,2.0,1.0,0.0,5.0,10.0,2.0,0.0,5.0,0.0,88.0,63,3,6,0,51,53,0.0
68462,5,Brent Stanton,Essendon,Collingwood,2015,23,W,3.0,22,8,30,22,73.3,8,3,1.0,0.0,9,0.0,3.0,4.0,5.0,4.0,1.0,0.0,0.0,7.0,20.0,0.0,1.0,1.0,0.0,89.0,645,5,4,1,137,130,3.0


In [104]:
# Export to csv
joined.to_csv("afl_tables.csv", index=False)

## [4. Add player positions](#L) <a id = '4'></a>
We will need to use Selenium first to retrieve the page source as beautifulsoup does not work properly with javascript webpages.
An up to date webdriver is needed. We will scrape this data direct from the official [AFL website](https://www.afl.com.au/).

In [76]:
options = ChromeOptions()
options.headless = True # So we can scrape data without opening the webpage
options.add_argument("user-agent=Chrome/80.0.3987.132")
options.add_argument("--window-size=1920,1080")
driver = Chrome('chromedriver', options=options) # Webdriver is located in the same directory as this script

# Reduce page load timeouts, so they can be attempted again
driver.set_page_load_timeout(20)
driver.implicitly_wait(20)
driver.set_script_timeout(20)

In [77]:
# Get webpage for a season
driver.get('https://www.afl.com.au/fixture?Competition=1&CompSeason=43&MatchTimezone=MY_TIME&Regions=2\
&ShowBettingOdds=1&GameWeeks=1&Teams=1&Venues=13#byround')
# Get page source to create soup object
soup = BeautifulSoup(driver.page_source,'html')

### [4.1 Get match URLs](#L) <a id='4.1'></a>
Select AFL Premiership season:

In [78]:
# Find the premiership season IDs
comp_seasons = soup.find('div', {'class':'filter-list__group u-hide-until-tablet js-desktop-filters'})\
.find('div', {'class':'custom-select'}).find_next_sibling().findAll('li')

comp_seasons

[<li class="custom-select__option js-custom-select-option" data-value="52" tabindex="-1"> 2023 Toyota AFL Premiership </li>,
 <li class="custom-select__option js-custom-select-option custom-select__option--selected" data-value="43" tabindex="-1"> 2022 Toyota AFL Premiership </li>,
 <li class="custom-select__option js-custom-select-option" data-value="34" tabindex="-1"> 2021 Toyota AFL Premiership </li>,
 <li class="custom-select__option js-custom-select-option" data-value="20" tabindex="-1"> 2020 Toyota AFL Premiership </li>,
 <li class="custom-select__option js-custom-select-option" data-value="18" tabindex="-1"> 2019 Toyota AFL Premiership </li>,
 <li class="custom-select__option js-custom-select-option" data-value="14" tabindex="-1"> 2018 Toyota AFL Premiership </li>,
 <li class="custom-select__option js-custom-select-option" data-value="11" tabindex="-1"> 2017 Toyota AFL Premiership </li>,
 <li class="custom-select__option js-custom-select-option" data-value="9" tabindex="-1"> 2016

In [79]:
# Get the season ID value for the years we are interested in (2015-2022)
comp_seasons = [x.get('data-value') for x in comp_seasons][1:9]
comp_seasons

['43', '34', '20', '18', '14', '11', '9', '7']

In [81]:
# Get match relative URLs
match_urls = []

for season in comp_seasons:
    
    # Get webpage of matches for round
    driver.get(f'https://www.afl.com.au/fixture?Competition=1&CompSeason={season}&MatchTimezone=MY_TIME&Regions=2\
      &ShowBettingOdds=1&GameWeeks=1&Teams=1&Venues=13#byround')
    
    time.sleep(1)
              
    # Get page source to create soup object
    soup = BeautifulSoup(driver.page_source,'html')
    
    # Get number of H&A rounds
    round_list = soup.find('ul', {'class':'custom-radio__list js-radio-container'})\
                .findAll('label', {'class':'custom-radio__label'})
    
    # Get values (ignore finals)
    round_list = [int(x.text.strip()) for x in round_list if x.text.strip().isdigit()]
    
    for ha_round in round_list:
        
        # Get webpage of matches for round
        driver.get(f'https://www.afl.com.au/fixture?Competition=1&CompSeason={season}&MatchTimezone=MY_TIME&Regions=2\
                      &ShowBettingOdds=1&GameWeeks={ha_round}&Teams=1&Venues=13#byround')
        
        time.sleep(1.5)
        
        # Get page source to create soup object
        soup = BeautifulSoup(driver.page_source,'html')
        
        # Get match page relative URLs
        match_urls += [x.find('a').get('href') for x in soup.findAll(
                        'div', {'class':'match-list__item match-list__item--COMPLETED js-match-list-item'})]

In [82]:
# Check number of match URLS retrieved (should be 1538)
print(len(match_urls))

1538


### [4.2 Get player field positions](#L) <a id='4.2'></a>

In [83]:
# Get webpage for a match
driver.get('https://www.afl.com.au/afl/matches/915#line-ups')

# Get page source to create soup object
soup = BeautifulSoup(driver.page_source,'lxml')

# Sample data
soup.find('div', {'class':'team-lineups__positions-players'})

<div class="team-lineups__positions-players"> <span class="team-lineups__player"> <span class="team-lineups__player-number">[6] </span> Jake Lever, </span> <span class="team-lineups__player"> <span class="team-lineups__player-number">[12] </span> Daniel Talia, </span> <span class="team-lineups__player"> <span class="team-lineups__player-number">[16] </span> Luke Brown </span> </div>

In [84]:
# Get list of players from the lineups
players = soup.find('div', {'class':'team-lineups__wrapper'}).findAll('span', {'class':'team-lineups__player'})
players = [' '.join(x.text.strip(', ').split()[1:]) for x in players]

In [86]:
# Get round and year
comp_round = [soup.find('div', {'class':'mc-header__round-wrapper'}).text.split()[1]]*len(players)
comp_year = [soup.find('div', {'class':'mc-header__comp'}).text.split()[0]]*len(players)

In [87]:
# Find how many players are listed on IC (usually it will be 4 but there are some inconsistencies)
IC = len(soup.findAll('span', {'class':re.compile(r'team-lineups__position-meta-label team-lineups__position-meta-label')}, 
                      text='IC')[0].parent.findAll('span', {'class':'team-lineups__player'})
        )

In [88]:
# Broad positions (Defender, midfielder, forward, ruck and interchange for players who started on the bench)
gen_pos = ['Def']*12 + ['Mid']*6 + ['Fwd']*12 + ['Ruck'] + ['Mid']*2 + ['Ruck'] + ['Mid']*2 + ['IC']*IC*2

# Positions by fieldline as officially provided by the AFL website
pos = ['FB']*6 + ['HB']*6 + ['C']*6 + ['HF']*6 + ['FF']*6 + ['FOL']*6 + ['IC']*IC*2

# More granular positions (centre, ruck rover and rover all denoted as midfield, referring to inside-midfielders)
sub_pos = \
['Back pocket', 'Full-back', 'Back pocket', 'Back pocket', 'Full-back', 'Back pocket'] +\
['Half-back flank', 'Centre half-back', 'Half-back flank', 'Half-back flank', 'Centre half-back', 'Half-back flank'] +\
['Wing', 'Midfield', 'Wing', 'Wing', 'Midfield', 'Wing'] +\
['Half-forward flank', 'Centre half-forward', 'Half-forward flank',\
 'Half-forward flank', 'Centre half-forward', 'Half-forward flank'] +\
['Forward pocket', 'Full-forward', 'Forward pocket', 'Forward pocket', 'Full-forward', 'Forward pocket'] +\
['Ruck', 'Midfield', 'Midfield','Ruck', 'Midfield', 'Midfield'] +\
['Interchange']*IC*2

In [89]:
# Identify each team
home_team = soup.find('span', \
                      {'class':re.compile(r'team-lineups__team-name team-lineups__team-name--home active js-team-tab')})\
                        .text.strip()
away_team = soup.find('span', \
                      {'class':re.compile(r'team-lineups__team-name team-lineups__team-name--away js-team-tab')})\
                        .text.strip()

# Create team labels for players
teams = []

for x in range(6):
    teams += [home_team]*3 + [away_team]*3
    
teams += [home_team]*IC + [away_team]*IC

In [90]:
# If sub is listed as an extension to interchange bench instead of under the designation 'SUB'
if IC == 5:
    gen_pos[-6] = gen_pos[-1] = 'S'
    pos[-6] = pos[-1] = 'Sub'
    sub_pos[-6] = sub_pos[-1] = 'Substitute'

In [91]:
lineup = [(a,b,c,d,e,f,g) for a,b,c,d,e,f,g in zip(players, comp_round, comp_year, teams, gen_pos, pos, sub_pos)]

In [92]:
# Check if 'SUB' designation appears on match lineup
if 'SUB' in [x.text for x in soup.findAll('span', {'class':'team-lineups__meta-label'})]:
    try:
        subs = [x.text for x in soup.findAll('span', {'class':'team-lineups__player-name'})[-2:]]
        for i in range(2):
            lineup.append((subs[i], comp_round[0], comp_year[0], teams[i+2], 'Sub', 'S', 'Subtitute'))
    except:
        pass

In [93]:
pd.DataFrame(lineup, columns=['Player','Round','Year','Club','Pos','Gen-pos','Sub-pos'])

Unnamed: 0,Player,Round,Year,Club,Pos,Gen-pos,Sub-pos
0,Jake Lever,19,2015,Adelaide Crows,Def,FB,Back pocket
1,Daniel Talia,19,2015,Adelaide Crows,Def,FB,Full-back
2,Luke Brown,19,2015,Adelaide Crows,Def,FB,Back pocket
3,Dylan Grimes,19,2015,Richmond,Def,FB,Back pocket
4,Jake Batchelor,19,2015,Richmond,Def,FB,Full-back
5,Troy Chaplin,19,2015,Richmond,Def,FB,Back pocket
6,Rory Laird,19,2015,Adelaide Crows,Def,HB,Half-back flank
7,Kyle Hartigan,19,2015,Adelaide Crows,Def,HB,Centre half-back
8,Ricky Henderson,19,2015,Adelaide Crows,Def,HB,Half-back flank
9,Nick Vlastuin,19,2015,Richmond,Def,HB,Half-back flank


#### [4.2.1 Get field positions for each player in each match](#L) <a id='4.2.1'></a>
The following code takes a few hours to run as it needs enough time to access each match page.

In [103]:
match_lineups = []
ct = 0

# Main logic
while True:
    try:
        driver.get(f'https://www.afl.com.au{match_urls[ct]}#line-ups')
        time.sleep(1)
        # Get page source to create soup object
        soup = BeautifulSoup(driver.page_source,'lxml')

        # Get players from lineup
        players = soup.find('div', {'class':'team-lineups__wrapper'}).findAll('span', {'class':'team-lineups__player'})
        players = [' '.join(x.text.strip(', ').split()[1:]) for x in players]

        # Get round and year
        comp_round = [soup.find('div', {'class':'mc-header__round-wrapper'}).text.split()[1]]*len(players)
        comp_year = [soup.find('div', {'class':'mc-header__comp'}).text.split()[0]]*len(players)

        # Find how many players are listed on IC (usually it will be 4 but there are some inconsistencies)
        IC = len(soup.findAll('span', {'class':re.compile(r'team-lineups__position-meta-label team-lineups__position-meta-label')}, 
                      text='IC')[0].parent.findAll('span', {'class':'team-lineups__player'})
                )
        
        # Broad positions (Defender, midfielder, forward, ruck and interchange for players who started on the bench)
        gen_pos = ['Def']*12 + ['Mid']*6 + ['Fwd']*12 + ['Ruck'] + ['Mid']*2 + ['Ruck'] + ['Mid']*2 + ['IC']*IC*2

        # Positions by fieldline as officially provided by the AFL website
        pos = ['FB']*6 + ['HB']*6 + ['C']*6 + ['HF']*6 + ['FF']*6 + ['FOL']*6 + ['IC']*IC*2

        # More granular positions (centre, ruck rover and rover all denoted as midfield, referring to inside-midfielders)
        sub_pos = \
        ['Back pocket', 'Full-back', 'Back pocket', 'Back pocket', 'Full-back', 'Back pocket'] +\
        ['Half-back flank', 'Centre half-back', 'Half-back flank', 'Half-back flank', 'Centre half-back', 'Half-back flank'] +\
        ['Wing', 'Midfield', 'Wing', 'Wing', 'Midfield', 'Wing'] +\
        ['Half-forward flank', 'Centre half-forward', 'Half-forward flank',\
         'Half-forward flank', 'Centre half-forward', 'Half-forward flank'] +\
        ['Forward pocket', 'Full-forward', 'Forward pocket', 'Forward pocket', 'Full-forward', 'Forward pocket'] +\
        ['Ruck', 'Midfield', 'Midfield','Ruck', 'Midfield', 'Midfield'] +\
        ['Interchange']*IC*2
        
        # Identify each team
        home_team = soup.find('span', \
                              {'class':re.compile(r'team-lineups__team-name team-lineups__team-name--home active js-team-tab')})\
                                .text.strip()
        away_team = soup.find('span', \
                              {'class':re.compile(r'team-lineups__team-name team-lineups__team-name--away js-team-tab')})\
                                .text.strip()   
        # Assign teams to players
        teams = []
        for x in range(6):
            teams += [home_team]*3 + [away_team]*3
        teams += [home_team]*IC + [away_team]*IC

        # If sub is listed as an extension to interchange bench instead of under the designation 'SUB'
        if IC == 5:
            gen_pos[-6] = gen_pos[-1] = 'S'
            pos[-6] = pos[-1] = 'Sub'
            sub_pos[-6] = sub_pos[-1] = 'Substitute'
        
        lineup = [(a,b,c,d,e,f,g) for a,b,c,d,e,f,g in zip(players, comp_round, comp_year, teams, gen_pos, pos, sub_pos)]

        # Add subs to list of data (if substitutes used in matches for that season)
        # Check if 'SUB' designation appears on match lineup
        if 'SUB' in [x.text for x in soup.findAll('span', {'class':'team-lineups__meta-label'})]:
            try:
                subs = [x.text for x in soup.findAll('span', {'class':'team-lineups__player-name'})[-2:]]
                for i in range(2):
                    lineup.append((subs[i], comp_round[0], comp_year[0], teams[i+2], 'Sub', 'S', 'Substitute'))
            except:
                pass

        match_lineups.append(lineup)        
        
    # If certain errors are thrown, restart driver and continue where we left off
    except (TimeoutException, AttributeError, WebDriverException) as ex:
        print("Exception has been thrown. " + str(ex))
        print("Trying again")
        
        # Reset driver
        driver.quit()
        driver = Chrome('chromedriver', options=options) # Webdriver is located in the same directory as this script
        driver.set_page_load_timeout(20)
        driver.implicitly_wait(20)
        driver.set_script_timeout(20)
        continue
        
    else:
        ct +=1
        print(f'{round(100*ct/len(match_urls), 2)}% complete', end='\r')
        
        # Extraction complete condition
        if ct >= len(match_urls):
            break

Exception has been thrown. Message: timeout: Timed out receiving message from renderer: 19.437
  (Session info: headless chrome=111.0.5563.65)
Stacktrace:
Backtrace:
	(No symbol) [0x002537D3]
	(No symbol) [0x001E8B81]
	(No symbol) [0x000EB36D]
	(No symbol) [0x000DD4D3]
	(No symbol) [0x000DD241]
	(No symbol) [0x000DBC95]
	(No symbol) [0x000DC63A]
	(No symbol) [0x000E5FE5]
	(No symbol) [0x000F199E]
	(No symbol) [0x000F4DD6]
	(No symbol) [0x000DC993]
	(No symbol) [0x000F1724]
	(No symbol) [0x00151758]
	(No symbol) [0x0013B216]
	(No symbol) [0x00110D97]
	(No symbol) [0x0011253D]
	GetHandleVerifier [0x004CABF2+2510930]
	GetHandleVerifier [0x004F8EC1+2700065]
	GetHandleVerifier [0x004FC86C+2714828]
	GetHandleVerifier [0x00303480+645344]
	(No symbol) [0x001F0FD2]
	(No symbol) [0x001F6C68]
	(No symbol) [0x001F6D4B]
	(No symbol) [0x00200D6B]
	BaseThreadInitThunk [0x75C500F9+25]
	RtlGetAppContainerNamedObjectPath [0x776F7BBE+286]
	RtlGetAppContainerNamedObjectPath [0x776F7B8E+238]
	(No symbol) [

In [105]:
# Flatten list
match_lineups = [x for lineup in match_lineups for x in lineup]

# Convert list to DataFrame
df_pos = pd.DataFrame(match_lineups, columns=['Player','Round','Year','Club','Pos','Gen-pos','Sub-pos'])
df_pos.shape[0]

68463

Compared to player stats data, there is one row missing. The round 20 match in 2022 between Gold Coast and West Coast has omitted a player on West Coast's interchange bench. Cross-checking with other sources reveals that the player listed as the sub (Jai Culley) actually started on the bench and was subbed out in the fourth quarter for Hugh Dixon, who was omitted from the line-up on the official AFL page. This is fixed up below.

In [106]:
df_pos[(df_pos['Player']=="Jai Culley")]

Unnamed: 0,Player,Round,Year,Club,Pos,Gen-pos,Sub-pos
6943,Jai Culley,18,2022,West Coast Eagles,IC,IC,Interchange
7438,Jai Culley,19,2022,West Coast Eagles,Mid,FOL,Midfield
7772,Jai Culley,20,2022,West Coast Eagles,Sub,S,Substitute
8874,Jai Culley,23,2022,West Coast Eagles,IC,IC,Interchange


In [107]:
# Update Jai Culley row
df_pos.loc[(df_pos['Player']=='Jai Culley')&(df_pos['Year']=='2022')&(df_pos['Round']=='20'),
           ['Pos','Gen-pos','Sub-pos']] = ["IC","IC","Interchange"]

In [108]:
# Add missing Hugh Dixon row
df_pos.loc[df_pos.shape[0]] = ["Hugh Dixon", "20", "2022", "West Coast Eagles", "Sub", "S", "Substitute"]

In [109]:
# Save down a copy of dataframe
df_pos.to_csv('match_positions.csv', index=False)

In [110]:
driver.quit()

### [4.3 Add player positions to main DataFrame](#L) <a id='4.3'></a>

In [111]:
# Re-read the two csv files
df = pd.read_csv('afl_tables.csv')
df_pos = pd.read_csv('match_positions.csv')

We need to check the formatting of club names and standardise between the two datasets:

In [112]:
df['Club'].unique()

array(['Melbourne', 'Western Bulldogs', 'Carlton', 'Richmond', 'St Kilda',
       'Collingwood', 'Geelong', 'Essendon', 'GWS', 'Sydney', 'Brisbane',
       'Port Adelaide', 'Hawthorn', 'North Melbourne', 'Adelaide',
       'Fremantle', 'West Coast', 'Gold Coast'], dtype=object)

In [113]:
df_pos['Club'].unique()

array(['Melbourne', 'Western Bulldogs', 'Carlton', 'Richmond', 'St Kilda',
       'Collingwood', 'Geelong Cats', 'Essendon', 'GWS Giants',
       'Sydney Swans', 'Brisbane Lions', 'Port Adelaide', 'Hawthorn',
       'North Melbourne', 'Adelaide Crows', 'Fremantle',
       'West Coast Eagles', 'Gold Coast Suns'], dtype=object)

In [114]:
# Standardise team names so the DataFrames can be merged
df_pos = df_pos.replace({"Club": {
    "Geelong Cats":"Geelong",
    "GWS Giants":"GWS",
    "Sydney Swans":"Sydney",
    "Brisbane Lions":"Brisbane",
    "Adelaide Crows":"Adelaide",
    "West Coast Eagles":"West Coast",
    "Gold Coast Suns":"Gold Coast"
}})

Some manual mapping was also completed in excel to standardise some player names, needed for merging DataFrames:

In [115]:
# Open mappings prepared in excel from above differences
reader = csv.DictReader(open('AFL_player_names_mapping.csv'))
names_map = {}

for row in reader:
    names_map[row['Key']] = row['Value']

In [116]:
# Standardise names using dictionary mapping
df.replace({"Player":names_map}, inplace=True)
df_pos.replace({"Player":names_map}, inplace=True)

In [117]:
# Attempt join
merged = pd.merge(df, df_pos, how='outer', indicator='True', on=['Player', 'Club', 'Round', 'Year'])

In [118]:
merged

Unnamed: 0,#,Player,Club,Opponent,Year,Round,Result,Margin,KI,HB,DI,ED,DE%,MK,GL,BH,GA,SI,HO,TK,RB,IF,CL,CG,FF,FA,CP,UP,CM,MI,1%,BO,%P,MG,TO,ITC,T5,AF,SC,BR,Pos,Gen-pos,Sub-pos,True
0,12,Toby Bedford,Melbourne,Western Bulldogs,2022,1,W,26.0,5.0,4.0,9.0,7,77.8,3.0,0.0,1.0,0.0,1,0.0,2.0,1.0,1.0,0.0,1.0,2.0,0.0,4.0,5.0,0.0,1.0,2.0,1.0,55.0,124,1,4,1,43,46,0.0,Sub,S,Substitute,both
1,17,Jake Bowey,Melbourne,Western Bulldogs,2022,1,W,26.0,8.0,1.0,9.0,8,88.9,2.0,1.0,0.0,0.0,1,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,3.0,6.0,0.0,0.0,2.0,0.0,73.0,197,2,4,0,38,58,0.0,IC,IC,Interchange,both
2,10,Angus Brayshaw,Melbourne,Western Bulldogs,2022,1,W,26.0,12.0,11.0,23.0,15,65.2,6.0,0.0,0.0,0.0,3,0.0,3.0,2.0,2.0,0.0,4.0,1.0,1.0,4.0,19.0,0.0,0.0,1.0,0.0,83.0,264,3,7,0,86,79,0.0,Def,HB,Half-back flank,both
3,50,Ben Brown,Melbourne,Western Bulldogs,2022,1,W,26.0,9.0,4.0,13.0,8,61.5,8.0,3.0,3.0,0.0,11,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,6.0,7.0,2.0,6.0,2.0,0.0,86.0,245,1,0,0,77,87,0.0,Fwd,FF,Full-forward,both
4,31,Bayley Fritsch,Melbourne,Western Bulldogs,2022,1,W,26.0,8.0,1.0,9.0,5,55.6,4.0,2.0,2.0,1.0,6,0.0,1.0,0.0,2.0,0.0,2.0,0.0,1.0,3.0,5.0,1.0,3.0,0.0,0.0,81.0,251,2,0,0,53,61,0.0,Fwd,FF,Forward pocket,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68459,17,Jake Melksham,Essendon,Collingwood,2015,23,W,3.0,10.0,9.0,19.0,14,73.7,5.0,1.0,0.0,0.0,2,0.0,2.0,2.0,4.0,3.0,4.0,1.0,2.0,4.0,14.0,0.0,0.0,0.0,0.0,86.0,430,6,4,0,72,58,0.0,Def,HB,Half-back flank,both
68460,20,Jackson Merrett,Essendon,Collingwood,2015,23,W,3.0,10.0,2.0,12.0,10,83.3,5.0,1.0,0.0,1.0,5,0.0,3.0,0.0,2.0,1.0,1.0,1.0,0.0,2.0,11.0,0.0,2.0,0.0,0.0,84.0,206,3,3,0,68,65,0.0,Fwd,HF,Half-forward flank,both
68461,16,Tayte Pears,Essendon,Collingwood,2015,23,W,3.0,6.0,8.0,14.0,9,64.3,4.0,0.0,0.0,0.0,3,0.0,1.0,0.0,0.0,1.0,2.0,1.0,0.0,5.0,10.0,2.0,0.0,5.0,0.0,88.0,63,3,6,0,51,53,0.0,Fwd,HF,Half-forward flank,both
68462,5,Brent Stanton,Essendon,Collingwood,2015,23,W,3.0,22.0,8.0,30.0,22,73.3,8.0,3.0,1.0,0.0,9,0.0,3.0,4.0,5.0,4.0,1.0,0.0,0.0,7.0,20.0,0.0,1.0,1.0,0.0,89.0,645,5,4,1,137,130,3.0,Def,HB,Half-back flank,both


In [119]:
# Get subset of data, being players that start on the bench
benched = merged[merged['Gen-pos'].isin(['S','IC'])]

# Get unique instances of players starting on bench by player, club (for duplicate player names), and year
benched = benched[['Player','Club','Year']].drop_duplicates()

# Convert to list of tuples
benched_tuples = list(benched.itertuples(index=False, name=None))

We want as many field positions as possible. However, in each match there are typically 4 players from each team that start on the interchange bench, including substitutes (although this number can also be 3 or 5 depending on rule changes between seasons). For any game that a player started as an interchange player (IC) or substitute (S), we want to try to replace this with an assumption of their field position. The logic of the below code works as follows: 

- If the player has started on the field in at least 1 game in the same year, For each 'IC' or 'S' game, randomly assign their position based on other field positions they started in for that same year. The probability will be in proportion to how often they started in those respective field positions.
- If they have only started on the bench in that year, expand the range to every year in the dataset. Sometimes a player changes clubs between seasons and so the 'Club' restriction is relaxed if necessary.
- If there is still no data to use to assign them a field position, they will be left as 'IC' or 'S'.

In [129]:
random.seed(1)

for t in benched_tuples:
    
    # Use tuples to select data from main dataset
    df_subset = merged.loc[(merged['Player']==t[0])&
                           (merged['Club']==t[1])&
                           (merged['Year']==t[2])
                          ]
    # Select non bench positions
    df_subset = df_subset.loc[(~merged['Pos'].isin(['Sub','IC'])), ['Pos','Gen-pos','Sub-pos']]
    
    # If no valid positions for selected year, select all years for selected player
    if df_subset.empty:
        df_subset = merged.loc[(merged['Player']==t[0])&
                               (merged['Club']==t[1])
                               ]
        df_subset = df_subset.loc[(~merged['Pos'].isin(['Sub','IC'])), ['Pos','Gen-pos','Sub-pos']]
    
    # If no valid positions for selected year and player has changed club during offseason
    if df_subset.empty:
        df_subset = merged.loc[(merged['Player']==t[0])]
        df_subset = df_subset.loc[(~merged['Pos'].isin(['Sub','IC'])), ['Pos','Gen-pos','Sub-pos']]

    # Convert subset to tuples
    positions = list(df_subset.itertuples(index=False, name=None))
    
    # Iterate through main DataFrame
    for idx, row in merged.loc[(merged['Player']==t[0])&
                                 (merged['Club']==t[1])&
                                 (merged['Year']==t[2])
                                 ].iterrows():
        
        # Reassign sub/IC to a random other position the player started in that year
        try:
            merged.loc[idx, ['Pos','Gen-pos','Sub-pos']]=random.choice(positions)
        
        # If a player has only ever started on the bench, there will be no data to use
        except IndexError:
            pass

In [130]:
# Check how many remaining rows without field positions
merged.loc[(merged['Pos'].isin(['Sub','IC']))]

Unnamed: 0,#,Player,Club,Opponent,Year,Round,Result,Margin,KI,HB,DI,ED,DE%,MK,GL,BH,GA,SI,HO,TK,RB,IF,CL,CG,FF,FA,CP,UP,CM,MI,1%,BO,%P,MG,TO,ITC,T5,AF,SC,BR,Pos,Gen-pos,Sub-pos,True
734,34,Jack Williams,West Coast,North Melbourne,2022,2,L,-15.0,1.0,1.0,2.0,1,50.0,0.0,0.0,0.0,0.0,0,1.0,1.0,0.0,1.0,1.0,5.0,0.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,62.0,24,1,0,0,1,-2,0.0,IC,IC,Interchange,both
950,45,Martin Frederick,Port Adelaide,Adelaide,2022,3,L,-4.0,3.0,1.0,4.0,2,50.0,1.0,1.0,0.0,0.0,2,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,4.0,0.0,1.0,1.0,0.0,18.0,79,1,0,0,24,20,0.0,Sub,S,Substitute,both
1228,28,Neil Erasmus,Fremantle,West Coast,2022,3,L,55.0,9.0,9.0,18.0,12,66.7,4.0,0.0,1.0,0.0,7,0.0,2.0,0.0,1.0,2.0,2.0,0.0,0.0,5.0,13.0,0.0,0.0,0.0,0.0,77.0,229,3,3,0,66,58,0.0,IC,IC,Interchange,both
1436,28,Neil Erasmus,Fremantle,GWS,2022,4,W,34.0,9.0,6.0,15.0,9,60.0,2.0,0.0,0.0,0.0,2,0.0,5.0,0.0,3.0,3.0,3.0,3.0,0.0,8.0,9.0,0.0,0.0,0.0,0.0,60.0,206,3,3,0,68,70,0.0,IC,IC,Interchange,both
1496,18,Louis Butler,Western Bulldogs,Richmond,2022,4,L,-38.0,9.0,4.0,13.0,5,38.5,1.0,0.0,0.0,0.0,1,0.0,1.0,0.0,2.0,0.0,2.0,0.0,0.0,5.0,8.0,0.0,0.0,2.0,0.0,77.0,233,4,3,1,42,43,0.0,IC,IC,Interchange,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68163,46,Billy Evans,Brisbane,Western Bulldogs,2015,23,W,8.0,4.0,7.0,11.0,10,90.9,1.0,1.0,1.0,0.0,7,0.0,0.0,0.0,2.0,2.0,0.0,1.0,0.0,3.0,7.0,0.0,0.0,1.0,0.0,71.0,86,0,0,0,37,41,0.0,IC,IC,Interchange,both
68185,44,Brett Goodes,Western Bulldogs,Brisbane,2015,23,L,-8.0,5.0,6.0,11.0,9,81.8,2.0,1.0,0.0,1.0,3,0.0,2.0,0.0,1.0,1.0,3.0,1.0,2.0,4.0,7.0,0.0,0.0,2.0,0.0,73.0,130,5,2,0,42,51,0.0,IC,IC,Interchange,both
68222,44,Jacob Ballard,Fremantle,Port Adelaide,2015,23,W,-69.0,1.0,5.0,6.0,6,100.0,1.0,0.0,0.0,0.0,0,0.0,0.0,0.0,1.0,3.0,1.0,0.0,0.0,5.0,1.0,0.0,0.0,0.0,0.0,32.0,25,0,0,0,16,15,0.0,Sub,S,Substitute,both
68281,25,Clem Smith,Carlton,Hawthorn,2015,23,W,-57.0,2.0,3.0,5.0,4,80.0,1.0,0.0,0.0,0.0,2,0.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,3.0,0.0,0.0,0.0,0.0,24.0,86,2,0,1,23,18,0.0,Sub,S,Substitute,both


For the remaining rows without field positions we could:

- Randomly assign a position, <br>
- Attempt to manually wrangle, or <br>
- Drop the remaining rows<br>

For now we will just leave them as is.

## [5. AFLCA votes](#L) <a id = '5'></a>
We will also add an alternative performance measure, the vote for the AFL Coaches Association Award. The AFL Coaches Association’s AFL Champion Player Award is an award voted by the 18 coaching panels on a 5, 4, 3, 2, 1 basis after each home and away game, acknowledging outstanding effort by an individual player in a season.

In [120]:
# Get Request
response = requests.get('https://aflcoaches.com.au/awards/the-aflca-champion-player-of-the-year-award/leaderboard/'
                        '2022/20220123')
soup = BeautifulSoup(response.content, 'html.parser')

In [121]:
# Find one row of data
soup.find('div', {'class':'row border-bottom pt-1 pb-1 div-hover'}).text.split()

['10', 'Clayton', 'Oliver', '(MELB)']

In [122]:
# Get round
ha_round = soup.find('h2', {'class':'mt-2 text-center text-md-left'}).text.split(' ')[1]
# Get year
year = soup.find('option', {'selected':'"selected"'}).text

In [123]:
# Get points, players and club
aflca_scores = [x.text.split() for x in soup.findAll('div', {'class':'row border-bottom pt-1 pb-1 div-hover'})]
aflca_scores[:3]

[['10', 'Clayton', 'Oliver', '(MELB)'],
 ['7', 'Kysaiah', 'Pickett', '(MELB)'],
 ['4', 'Lachie', 'Neale', '(BL)']]

In [124]:
# Combine player first and last names together and remove brackets around team names
aflca_scores = [[x[0]] + [" ".join(x[1:-1])] + [x[-1][1:-1]] for x in aflca_scores]
aflca_scores[:3]

[['10', 'Clayton Oliver', 'MELB'],
 ['7', 'Kysaiah Pickett', 'MELB'],
 ['4', 'Lachie Neale', 'BL']]

In [125]:
aflca_scores = [x[1:] + [ha_round, year] + [x[0]] for x in aflca_scores]
aflca_scores[:3]

[['Clayton Oliver', 'MELB', '23', '2022', '10'],
 ['Kysaiah Pickett', 'MELB', '23', '2022', '7'],
 ['Lachie Neale', 'BL', '23', '2022', '4']]

In [126]:
round_links = ['https://aflcoaches.com.au' + \
                   x.get("href") for x in soup.find('div', {'class', 'rounds-carousel'}).findAll('a')]

### [5.1 Get all AFLCA votes](#L) <a id = '5.1'></a>

In [132]:
master_list = []

for y in range(2022,2014,-1):
    # Get Request
    response = requests.get(f'https://aflcoaches.com.au/awards/the-aflca-champion-player-of-the-year-award/leaderboard/{y}')
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Get list of round links
    round_links = ['https://aflcoaches.com.au' + \
                   x.get("href") for x in soup.find('div', {'class', 'rounds-carousel'}).findAll('a')]
    
    for link in round_links:
        response = requests.get(link)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Get round and year
        ha_round = int(soup.find('h2', {'class':'mt-2 text-center text-md-left'}).text.split(' ')[1])
        year = int(soup.find('option', {'selected':'"selected"'}).text)
        
        # Get points, players and club
        aflca_scores = [x.text.split() for x in soup.findAll('div', {'class':'row border-bottom pt-1 pb-1 div-hover'})]
        # Combine player first and last names together and remove brackets around team names
        aflca_scores = [[x[0]] + [" ".join(x[1:-1])] + [x[-1][1:-1]] for x in aflca_scores]
        # Add round and year in
        aflca_scores = [x[1:] + [ha_round, year] + [x[0]] for x in aflca_scores]
        
        master_list += aflca_scores

In [133]:
# Convert to dataframe
aflca_df = pd.DataFrame(master_list, columns=["Player", "Club", "Round", "Year", "AFLCA"])

In [134]:
aflca_df

Unnamed: 0,Player,Club,Round,Year,AFLCA
0,Clayton Oliver,MELB,23,2022,10
1,Kysaiah Pickett,MELB,23,2022,7
2,Lachie Neale,BL,23,2022,4
3,Christian Petracca,MELB,23,2022,3
4,Luke Jackson,MELB,23,2022,2
...,...,...,...,...,...
10166,Shaun Burgoyne,HAW,1,2015,6
10167,Luke Breust,HAW,1,2015,4
10168,Luke Hodge,HAW,1,2015,2
10169,Jarryd Roughead,HAW,1,2015,1


In [135]:
# Check team name format
aflca_df['Club'].unique()

array(['MELB', 'BL', 'FRE', 'GWS', 'GCFC', 'GEEL', 'RICH', 'ESS', 'PORT',
       'ADEL', 'WB', 'HAW', 'CARL', 'COLL', 'STK', 'SYD', 'NMFC', 'WCE'],
      dtype=object)

In [136]:
# Standardise team names to facilitate joining with main DataFrame
aflca_df = aflca_df.replace({"Club": {
    "ADEL":"Adelaide",
    "BL":"Brisbane",
    "CARL":"Carlton",
    "COLL":"Collingwood",
    "ESS":"Essendon",
    "FRE":"Fremantle",
    "GEEL":"Geelong",
    "GCFC":"Gold Coast",
    "HAW":"Hawthorn",
    "MELB":"Melbourne",
    "NMFC":"North Melbourne",
    "PORT":"Port Adelaide",
    "RICH":"Richmond",
    "STK":"St Kilda",
    "SYD":"Sydney",
    "WCE":"West Coast",
    "WB":"Western Bulldogs"
}})

In [137]:
aflca_df

Unnamed: 0,Player,Club,Round,Year,AFLCA
0,Clayton Oliver,Melbourne,23,2022,10
1,Kysaiah Pickett,Melbourne,23,2022,7
2,Lachie Neale,Brisbane,23,2022,4
3,Christian Petracca,Melbourne,23,2022,3
4,Luke Jackson,Melbourne,23,2022,2
...,...,...,...,...,...
10166,Shaun Burgoyne,Hawthorn,1,2015,6
10167,Luke Breust,Hawthorn,1,2015,4
10168,Luke Hodge,Hawthorn,1,2015,2
10169,Jarryd Roughead,Hawthorn,1,2015,1


In [138]:
merged = merged.drop(['True'], axis=1)

In [139]:
temp = pd.merge(merged, aflca_df, how='outer', indicator='True', on=['Player', 'Club', 'Round', 'Year'])

In [140]:
# Reveal players that could not be joined with main dataframe
temp[temp['True']=='right_only']['Player'].unique()

array(['Tom J Lynch', 'Cameron Rayner', 'Josh J Kennedy', 'Nicholas Hind',
       'William Rioli', 'Mitchell Georgiades', 'Nicholas Newman',
       'Nick Martin', 'Joel Sudar-Jeffrey', 'Joshua Rachele', 'Tom Jonas',
       'Mitch W Brown', 'Matt De Boer', 'Callum L Brown',
       'Darcy MacPherson', 'Alexander Keath', 'Cameron Ellis-Yolmen',
       'David MacKay', 'Matthew Suckling', 'Cameron McCarthy',
       'Matthew Scharenberg', 'Jamie MacMillan', 'Mark Lecras',
       'Matthew White', 'Thomas Boyd', 'Tommy Sheridan', 'James Bartel',
       'Nathan Van Berlo', 'Michael Pyke'], dtype=object)

In [141]:
temp

Unnamed: 0,#,Player,Club,Opponent,Year,Round,Result,Margin,KI,HB,DI,ED,DE%,MK,GL,BH,GA,SI,HO,TK,RB,IF,CL,CG,FF,FA,CP,UP,CM,MI,1%,BO,%P,MG,TO,ITC,T5,AF,SC,BR,Pos,Gen-pos,Sub-pos,AFLCA,True
0,12.0,Toby Bedford,Melbourne,Western Bulldogs,2022,1,W,26.0,5.0,4.0,9.0,7.0,77.8,3.0,0.0,1.0,0.0,1.0,0.0,2.0,1.0,1.0,0.0,1.0,2.0,0.0,4.0,5.0,0.0,1.0,2.0,1.0,55.0,124.0,1.0,4.0,1.0,43.0,46.0,0.0,Fwd,HF,Half-forward flank,,left_only
1,17.0,Jake Bowey,Melbourne,Western Bulldogs,2022,1,W,26.0,8.0,1.0,9.0,8.0,88.9,2.0,1.0,0.0,0.0,1.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,3.0,6.0,0.0,0.0,2.0,0.0,73.0,197.0,2.0,4.0,0.0,38.0,58.0,0.0,Def,HB,Half-back flank,,left_only
2,10.0,Angus Brayshaw,Melbourne,Western Bulldogs,2022,1,W,26.0,12.0,11.0,23.0,15.0,65.2,6.0,0.0,0.0,0.0,3.0,0.0,3.0,2.0,2.0,0.0,4.0,1.0,1.0,4.0,19.0,0.0,0.0,1.0,0.0,83.0,264.0,3.0,7.0,0.0,86.0,79.0,0.0,Def,HB,Half-back flank,,left_only
3,50.0,Ben Brown,Melbourne,Western Bulldogs,2022,1,W,26.0,9.0,4.0,13.0,8.0,61.5,8.0,3.0,3.0,0.0,11.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,6.0,7.0,2.0,6.0,2.0,0.0,86.0,245.0,1.0,0.0,0.0,77.0,87.0,0.0,Fwd,FF,Full-forward,6,both
4,31.0,Bayley Fritsch,Melbourne,Western Bulldogs,2022,1,W,26.0,8.0,1.0,9.0,5.0,55.6,4.0,2.0,2.0,1.0,6.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,1.0,3.0,5.0,1.0,3.0,0.0,0.0,81.0,251.0,2.0,0.0,0.0,53.0,61.0,0.0,Fwd,FF,Forward pocket,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68653,,Josh J Kennedy,West Coast,,2015,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10,right_only
68654,,Josh J Kennedy,West Coast,,2015,4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,right_only
68655,,David MacKay,Adelaide,,2015,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,6,right_only
68656,,Tom J Lynch,Gold Coast,,2015,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,right_only


In [142]:
# Open mappings prepared in excel from above differences
reader = csv.DictReader(open('AFL_player_names_mapping.csv'))
names_map = {}

for row in reader:
    names_map[row['Key']] = row['Value']

In [143]:
# Standardise names using dictionary mapping
aflca_df.replace({"Player":names_map}, inplace=True)

In [144]:
temp = pd.merge(merged, aflca_df, how='outer', indicator='True', on=['Player', 'Club', 'Round', 'Year'])
# Reveal players that could not be joined with main dataframe
temp[temp['True']=='right_only']['Player'].unique()

array([], dtype=object)

In [145]:
# Add AFLCA votes to main DataFrame
merged = pd.merge(merged, aflca_df, how='outer', indicator='True', on=['Player', 'Club', 'Round', 'Year'])\
            .drop(['True'], axis=1)

In [146]:
merged

Unnamed: 0,#,Player,Club,Opponent,Year,Round,Result,Margin,KI,HB,DI,ED,DE%,MK,GL,BH,GA,SI,HO,TK,RB,IF,CL,CG,FF,FA,CP,UP,CM,MI,1%,BO,%P,MG,TO,ITC,T5,AF,SC,BR,Pos,Gen-pos,Sub-pos,AFLCA
0,12,Toby Bedford,Melbourne,Western Bulldogs,2022,1,W,26.0,5.0,4.0,9.0,7,77.8,3.0,0.0,1.0,0.0,1,0.0,2.0,1.0,1.0,0.0,1.0,2.0,0.0,4.0,5.0,0.0,1.0,2.0,1.0,55.0,124,1,4,1,43,46,0.0,Fwd,HF,Half-forward flank,
1,17,Jake Bowey,Melbourne,Western Bulldogs,2022,1,W,26.0,8.0,1.0,9.0,8,88.9,2.0,1.0,0.0,0.0,1,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,3.0,6.0,0.0,0.0,2.0,0.0,73.0,197,2,4,0,38,58,0.0,Def,HB,Half-back flank,
2,10,Angus Brayshaw,Melbourne,Western Bulldogs,2022,1,W,26.0,12.0,11.0,23.0,15,65.2,6.0,0.0,0.0,0.0,3,0.0,3.0,2.0,2.0,0.0,4.0,1.0,1.0,4.0,19.0,0.0,0.0,1.0,0.0,83.0,264,3,7,0,86,79,0.0,Def,HB,Half-back flank,
3,50,Ben Brown,Melbourne,Western Bulldogs,2022,1,W,26.0,9.0,4.0,13.0,8,61.5,8.0,3.0,3.0,0.0,11,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,6.0,7.0,2.0,6.0,2.0,0.0,86.0,245,1,0,0,77,87,0.0,Fwd,FF,Full-forward,6
4,31,Bayley Fritsch,Melbourne,Western Bulldogs,2022,1,W,26.0,8.0,1.0,9.0,5,55.6,4.0,2.0,2.0,1.0,6,0.0,1.0,0.0,2.0,0.0,2.0,0.0,1.0,3.0,5.0,1.0,3.0,0.0,0.0,81.0,251,2,0,0,53,61,0.0,Fwd,FF,Forward pocket,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68459,17,Jake Melksham,Essendon,Collingwood,2015,23,W,3.0,10.0,9.0,19.0,14,73.7,5.0,1.0,0.0,0.0,2,0.0,2.0,2.0,4.0,3.0,4.0,1.0,2.0,4.0,14.0,0.0,0.0,0.0,0.0,86.0,430,6,4,0,72,58,0.0,Def,HB,Half-back flank,
68460,20,Jackson Merrett,Essendon,Collingwood,2015,23,W,3.0,10.0,2.0,12.0,10,83.3,5.0,1.0,0.0,1.0,5,0.0,3.0,0.0,2.0,1.0,1.0,1.0,0.0,2.0,11.0,0.0,2.0,0.0,0.0,84.0,206,3,3,0,68,65,0.0,Fwd,FF,Forward pocket,
68461,16,Tayte Pears,Essendon,Collingwood,2015,23,W,3.0,6.0,8.0,14.0,9,64.3,4.0,0.0,0.0,0.0,3,0.0,1.0,0.0,0.0,1.0,2.0,1.0,0.0,5.0,10.0,2.0,0.0,5.0,0.0,88.0,63,3,6,0,51,53,0.0,Fwd,HF,Half-forward flank,
68462,5,Brent Stanton,Essendon,Collingwood,2015,23,W,3.0,22.0,8.0,30.0,22,73.3,8.0,3.0,1.0,0.0,9,0.0,3.0,4.0,5.0,4.0,1.0,0.0,0.0,7.0,20.0,0.0,1.0,1.0,0.0,89.0,645,5,4,1,137,130,3.0,Def,HB,Half-back flank,4


In [147]:
# Reokace NaN values with 0s for players that did not receive votes
merged['AFLCA'] = merged['AFLCA'].fillna(0)

In [148]:
merged.head()

Unnamed: 0,#,Player,Club,Opponent,Year,Round,Result,Margin,KI,HB,DI,ED,DE%,MK,GL,BH,GA,SI,HO,TK,RB,IF,CL,CG,FF,FA,CP,UP,CM,MI,1%,BO,%P,MG,TO,ITC,T5,AF,SC,BR,Pos,Gen-pos,Sub-pos,AFLCA
0,12,Toby Bedford,Melbourne,Western Bulldogs,2022,1,W,26.0,5.0,4.0,9.0,7,77.8,3.0,0.0,1.0,0.0,1,0.0,2.0,1.0,1.0,0.0,1.0,2.0,0.0,4.0,5.0,0.0,1.0,2.0,1.0,55.0,124,1,4,1,43,46,0.0,Fwd,HF,Half-forward flank,0
1,17,Jake Bowey,Melbourne,Western Bulldogs,2022,1,W,26.0,8.0,1.0,9.0,8,88.9,2.0,1.0,0.0,0.0,1,0.0,0.0,2.0,1.0,1.0,1.0,0.0,0.0,3.0,6.0,0.0,0.0,2.0,0.0,73.0,197,2,4,0,38,58,0.0,Def,HB,Half-back flank,0
2,10,Angus Brayshaw,Melbourne,Western Bulldogs,2022,1,W,26.0,12.0,11.0,23.0,15,65.2,6.0,0.0,0.0,0.0,3,0.0,3.0,2.0,2.0,0.0,4.0,1.0,1.0,4.0,19.0,0.0,0.0,1.0,0.0,83.0,264,3,7,0,86,79,0.0,Def,HB,Half-back flank,0
3,50,Ben Brown,Melbourne,Western Bulldogs,2022,1,W,26.0,9.0,4.0,13.0,8,61.5,8.0,3.0,3.0,0.0,11,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,6.0,7.0,2.0,6.0,2.0,0.0,86.0,245,1,0,0,77,87,0.0,Fwd,FF,Full-forward,6
4,31,Bayley Fritsch,Melbourne,Western Bulldogs,2022,1,W,26.0,8.0,1.0,9.0,5,55.6,4.0,2.0,2.0,1.0,6,0.0,1.0,0.0,2.0,0.0,2.0,0.0,1.0,3.0,5.0,1.0,3.0,0.0,0.0,81.0,251,2,0,0,53,61,0.0,Fwd,FF,Forward pocket,0


In [149]:
# Convert AFLCA column to integer and check values
merged = merged.astype({'AFLCA':'int'})
print(merged['AFLCA'].sum())
print(merged['AFLCA'].sum()/1538)

46140
30.0


There were a total of 1538 matches in the dataset and for each match, the two coaches award 5 + 4 + 3 + 2 + 1 = 15 votes to players (15 x 2 = 30).

In [154]:
# Alter column order again
cols = merged.columns.tolist()

cols = cols[:6] + cols[-4:-1] + cols[6:-7] + cols[-7:-5] + [cols[-1]] + [cols[-5]]

merged = merged[cols]

In [155]:
# Export to csv
merged.to_csv("afl_tables_2.csv", index=False)