# Albert Melbourne as a scraping case study

Now that I've figured out how many parkruns are in the Greater Melbourne area -- 49! -- I'm going to use Albert Melbourne as a case study to figure out how to scrape the data I want from each event, given the URLs all follow a common pattern.

**Scraping data from the Latest results page: https://www.parkrun.com.au/albertmelbourne/results/latestresults/**

In [1]:
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/114.0.0.0 Safari/537.36'
}

response = requests.get('https://www.parkrun.com.au/albertmelbourne/results/latestresults/', headers=headers)
response.raise_for_status()

In [2]:
doc = BeautifulSoup(response.text, 'html.parser')
doc

﻿
<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="/wp-content/themes/parkrun/favicons/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/wp-content/themes/parkrun/favicons/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/wp-content/themes/parkrun/favicons/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/wp-content/themes/parkrun/favicons/site.webmanifest" rel="manifest"/>
<link color="#2b233d" href="/wp-content/themes/parkrun/favicons/safari-pinned-tab.svg" rel="mask-icon"/>
<link href="/wp-content/themes/parkrun/favicons/favicon.ico" rel="shortcut icon"/>
<meta content="#da532c" name="msapplication-TileColor"/>
<meta content="/wp-content/themes/parkrun/favicons/browserconfig.xml" name="msapplication-config"/>
<meta content="#ffffff" name="theme-color"/>
<meta content="Albert Park, Melbourne, Australia" name

In [3]:
results_table = doc.find('table')
results_table

<table class="Results-table Results-table--compact js-ResultsTable"><thead><tr class="Results-table-thead"><th class="Results-table-th Results-table-th--position">Position</th><th class="Results-table-th Results-table-th--name">parkrunner</th><th class="Results-table-th Results-table-th--gender">Gender</th><th class="Results-table-th Results-table-th--ageGroup">Age Group</th><th class="Results-table-th Results-table-th--club">Club</th><th class="Results-table-th Results-table-th--time">Time</th></tr></thead><tbody class="js-ResultsTbody"><tr class="Results-table-row" data-achievement="New PB!" data-agegrade="76.55" data-agegroup="SM20-24" data-club="Maitland Triathlon Club" data-gender="Male" data-groups="Maitland Triathlon Club" data-name="Cooper LEE" data-position="1" data-runs="121" data-vols="1"><td class="Results-table-td Results-table-td--position">1</td><td class="Results-table-td Results-table-td--name"><div class="compact"><a href="/albertmelbourne/parkrunner/2242132" target="

In [4]:
results_table = doc.find('table', class_='Results-table')

rows = results_table.find('tbody').find_all('tr')
rows

[<tr class="Results-table-row" data-achievement="New PB!" data-agegrade="76.55" data-agegroup="SM20-24" data-club="Maitland Triathlon Club" data-gender="Male" data-groups="Maitland Triathlon Club" data-name="Cooper LEE" data-position="1" data-runs="121" data-vols="1"><td class="Results-table-td Results-table-td--position">1</td><td class="Results-table-td Results-table-td--name"><div class="compact"><a href="/albertmelbourne/parkrunner/2242132" target="_top">Cooper LEE</a></div><div class="detailed">121 parkruns<span class="Results-tablet Results-tablet--inline"><span class="spacer"> | </span><span class="Results-table--M">
                               Male
                             </span>1<span class="Results-table--genderCount" data-gender="Male">
                            
                         </span></span><span class="spacer"> | </span><a class="milestone-r100 Results-table--clubIcon Results-table--100club" href="https://www.parkrun.com/about/our-clubs/" title="Member 

In [5]:
# For each row, extract data such as position, name, time, etc.
for row in rows:
    cols = row.find_all('td')
    if len(cols) < 5:
        continue  # skip malformed rows

    position = cols[0].text.strip()
    name = cols[1].text.strip()
    gender = cols[2].text.strip()
    age_category = cols[3].text.strip()
    time = cols[4].text.strip()

    print(f"{position}: {name} ({gender}, {age_category}) - {time}")

1: Cooper LEE121 parkruns | 
                              Male
                            1
                           
                         | 
                        Member of the 100 Club
                      SM20-24 | 76.55%
                          Maitland Triathlon Club (Male
                      1, SM20-2476.55% age grade) - Maitland Triathlon Club
2: Glenn FARRAR4 parkruns | 
                              Male
                            2
                           
                        SM30-34 | 75.61% (Male
                      2, SM30-3475.61% age grade) - 
3: Tom GRAY3 parkruns | 
                              Male
                            3
                           
                        JM15-17 | 79.81% (Male
                      3, JM15-1779.81% age grade) - 
4: Paris POLLOCK61 parkruns | 
                              Male
                            4
                           
                         | 
                        Member of the 50

In [6]:
# Turn into a dataframe instead of printing

import pandas as pd

data = []

for row in rows:
    cols = [td.text.strip() for td in row.find_all('td')]
    
    if len(cols) == 6:
        data.append({
            'Position': cols[0],
            'Name': cols[1],
            'Gender': cols[2],
            'Age Group': cols[3],
            'Club': cols[4],
            'Time': cols[5]
        })

# Create a DataFrame
df = pd.DataFrame(data)

# Show the top of the DataFrame
df.head(20)

Unnamed: 0,Position,Name,Gender,Age Group,Club,Time
0,1,Cooper LEE121 parkruns | \n ...,Male\n 1,SM20-2476.55% age grade,Maitland Triathlon Club,16:55New PB!
1,2,Glenn FARRAR4 parkruns | \n ...,Male\n 2,SM30-3475.61% age grade,,17:09New PB!
2,3,Tom GRAY3 parkruns | \n ...,Male\n 3,JM15-1779.81% age grade,,17:10First Timer!
3,4,Paris POLLOCK61 parkruns | \n ...,Male\n 4,VM35-3976.65% age grade,,17:25PB 17:06
4,5,William CANNON36 parkruns | \n ...,Male\n 5,SM30-3474.74% age grade,,17:33PB 16:11
5,6,Lauchlan THOMPSON46 parkruns | \n ...,Male\n 6,SM30-3473.26% age grade,Correbirras,17:42PB 17:37
6,7,Mark RICHARDS19 parkruns | \n ...,Male\n 7,SM25-2972.63% age grade,,17:47PB 16:16
7,8,Gavi PINCUS85 parkruns | \n ...,Male\n 8,SM30-3472.46% age grade,,17:51New PB!
8,9,Patrick ATKINSON84 parkruns | \n ...,Male\n 9,SM30-3471.85% age grade,,18:00New PB!
9,10,Unknown,,,,


In [7]:
# My df needs some cleaning so let's look at what is being captured in its entireity. 

df.iloc[0]

Position                                                     1
Name         Cooper LEE121 parkruns | \n                   ...
Gender                           Male\n                      1
Age Group                              SM20-2476.55% age grade
Club                                   Maitland Triathlon Club
Time                                              16:55New PB!
Name: 0, dtype: object

In [8]:
df.iloc[761]

Position                                                   762
Name         Alex DAWSON4 parkruns | \n                    ...
Gender                         Male\n                      407
Age Group                              VM40-4421.33% age grade
Club                                                          
Time                                           1:05:29PB 43:50
Name: 761, dtype: object

### Redoing my analysis from here as the first time I didn't get the cleaning step exactly right and I want to be able to do it all in one step instead of two.

In [10]:
# Time to use some regular expressions with the help of ChatGPT to clean this up. 

import re

data = []

for row in rows:
    cols = row.find_all('td')
    if len(cols) != 6:
        continue

    # Clean and extract values
    position = cols[0].text.strip()

    # Name is in an <a> tag
    name = cols[1].find('a').text.strip() if cols[1].find('a') else cols[1].text.strip()

    # Clean gender and gender position
    gender_raw = cols[2].text.strip()
    gender_lines = gender_raw.split('\n')
    gender = gender_lines[0].strip()
    gender_pos = gender_lines[1].strip() if len(gender_lines) > 1 else ''

    # Age group might include age grade
    age_group_raw = cols[3].text.strip()
    # Updated regex to handle both age formats: SM20-24, JW10, JM11-14 etc.
    match = re.search(r'([A-Z]{2})(\d{2}(?:-\d{2})?)\s*(\d{2,3}\.\d{2}%)?', age_group_raw)
    if match:
        age_category = match.group(1)      # e.g., "SM", "JM", "JW"
        age_range = match.group(2)         # e.g., "20-24", "10"
        age_grade = match.group(3) or ''   # e.g., "76.55%"
    else:
        # Fallbacks if the regex doesn't match (rare cases)
        age_category = ''
        age_range = ''
        age_grade = ''

    club = cols[4].text.strip()

    # Time might include "New PB!" or other notes
    time_raw = cols[5].text.strip()
    time_match = re.match(r'(\d{1,2}:\d{2}(?::\d{2})?)(.*)?', time_raw)

    if time_match:
        time = time_match.group(1)
        note = time_match.group(2).strip() if time_match.group(2) else ''
    else:
        time = time_raw
        note = ''

    data.append({
        'Position': position,
        'Name': name,
        'Gender': gender,
        'Gender Pos': gender_pos,
        'Age Category': age_category,
        'Age Range': age_range,
        'Age Grade': age_grade,
        'Club': club,
        'Time': time,
        'Note': note
    })

cleaned_df = pd.DataFrame(data)
cleaned_df.head(15)

Unnamed: 0,Position,Name,Gender,Gender Pos,Age Category,Age Range,Age Grade,Club,Time,Note
0,1,Cooper LEE,Male,1.0,SM,20-24,76.55%,Maitland Triathlon Club,16:55,New PB!
1,2,Glenn FARRAR,Male,2.0,SM,30-34,75.61%,,17:09,New PB!
2,3,Tom GRAY,Male,3.0,JM,15-17,79.81%,,17:10,First Timer!
3,4,Paris POLLOCK,Male,4.0,VM,35-39,76.65%,,17:25,PB 17:06
4,5,William CANNON,Male,5.0,SM,30-34,74.74%,,17:33,PB 16:11
5,6,Lauchlan THOMPSON,Male,6.0,SM,30-34,73.26%,Correbirras,17:42,PB 17:37
6,7,Mark RICHARDS,Male,7.0,SM,25-29,72.63%,,17:47,PB 16:16
7,8,Gavi PINCUS,Male,8.0,SM,30-34,72.46%,,17:51,New PB!
8,9,Patrick ATKINSON,Male,9.0,SM,30-34,71.85%,,18:00,New PB!
9,10,Unknown,,,,,,,,


In [11]:
cleaned_df = cleaned_df.drop(columns=['Club', 'Note'])
cleaned_df

Unnamed: 0,Position,Name,Gender,Gender Pos,Age Category,Age Range,Age Grade,Time
0,1,Cooper LEE,Male,1,SM,20-24,76.55%,16:55
1,2,Glenn FARRAR,Male,2,SM,30-34,75.61%,17:09
2,3,Tom GRAY,Male,3,JM,15-17,79.81%,17:10
3,4,Paris POLLOCK,Male,4,VM,35-39,76.65%,17:25
4,5,William CANNON,Male,5,SM,30-34,74.74%,17:33
...,...,...,...,...,...,...,...,...
757,758,Longfei WANG,Female,306,VW,35-39,23.57%,1:03:39
758,759,Nathan LAWRENCE,Male,406,VM,40-44,21.44%,1:03:40
759,760,Jo BUCKLE,Female,307,VW,55-59,27.30%,1:05:27
760,761,Joanne WATKINS,Female,308,VW,50-54,25.61%,1:05:28


In [12]:
# cleaned_df[cleaned_df['Name'] == 'Unknown'].value_counts() overcomplicating things again!

cleaned_df['Name'].value_counts()

Name
Unknown              45
Cooper LEE            1
Yee Vien NG           1
Liam PHILP            1
Emily DEVINE          1
                     ..
Lauren SUTHERLAND     1
Rick DRURY            1
Charlie EVANS         1
Kirk CETINIC          1
Alex DAWSON           1
Name: count, Length: 718, dtype: int64

In [13]:
# There were 45 Unknowns + 2 runners who didn't give their gender + 407 male + 308 female = 762 parkrunners

cleaned_df['Gender'].value_counts()

Gender
Male      407
Female    308
           47
Name: count, dtype: int64

In [14]:
# May need to change some of the data types of my columns so I can manipulate them more easily.

cleaned_df.dtypes

Position        object
Name            object
Gender          object
Gender Pos      object
Age Category    object
Age Range       object
Age Grade       object
Time            object
dtype: object

In [15]:
# Changing time to time in seconds. 

def time_to_seconds(t):
    try:
        parts = t.strip().split(':')
        parts = [int(p) for p in parts if p != '']  # skip empty parts
        if len(parts) == 3:  # H:MM:SS
            return parts[0]*3600 + parts[1]*60 + parts[2]
        elif len(parts) == 2:  # MM:SS
            return parts[0]*60 + parts[1]
        else:
            return None  # unexpected format
    except Exception:
        return None  # catch malformed or missing values

cleaned_df['TimeSeconds'] = cleaned_df['Time'].apply(time_to_seconds)
cleaned_df

Unnamed: 0,Position,Name,Gender,Gender Pos,Age Category,Age Range,Age Grade,Time,TimeSeconds
0,1,Cooper LEE,Male,1,SM,20-24,76.55%,16:55,1015.0
1,2,Glenn FARRAR,Male,2,SM,30-34,75.61%,17:09,1029.0
2,3,Tom GRAY,Male,3,JM,15-17,79.81%,17:10,1030.0
3,4,Paris POLLOCK,Male,4,VM,35-39,76.65%,17:25,1045.0
4,5,William CANNON,Male,5,SM,30-34,74.74%,17:33,1053.0
...,...,...,...,...,...,...,...,...,...
757,758,Longfei WANG,Female,306,VW,35-39,23.57%,1:03:39,3819.0
758,759,Nathan LAWRENCE,Male,406,VM,40-44,21.44%,1:03:40,3820.0
759,760,Jo BUCKLE,Female,307,VW,55-59,27.30%,1:05:27,3927.0
760,761,Joanne WATKINS,Female,308,VW,50-54,25.61%,1:05:28,3928.0


In [16]:
cleaned_df.dtypes

Position         object
Name             object
Gender           object
Gender Pos       object
Age Category     object
Age Range        object
Age Grade        object
Time             object
TimeSeconds     float64
dtype: object

In [17]:
# Unknown runners are not given times, only positions.

cleaned_df['TimeSeconds'].value_counts(dropna=False)

TimeSeconds
NaN       45
1660.0     3
1711.0     3
1403.0     3
1635.0     3
          ..
1571.0     1
1574.0     1
1575.0     1
1576.0     1
3929.0     1
Name: count, Length: 633, dtype: int64

In [18]:
# Changing age grade to a decimal.

def clean_age_grade(grade):
    try:
        return float(grade.strip().replace('%', '')) / 100
    except:
        return None

cleaned_df['AgeGradeDecimal'] = cleaned_df['Age Grade'].apply(clean_age_grade)
cleaned_df

Unnamed: 0,Position,Name,Gender,Gender Pos,Age Category,Age Range,Age Grade,Time,TimeSeconds,AgeGradeDecimal
0,1,Cooper LEE,Male,1,SM,20-24,76.55%,16:55,1015.0,0.7655
1,2,Glenn FARRAR,Male,2,SM,30-34,75.61%,17:09,1029.0,0.7561
2,3,Tom GRAY,Male,3,JM,15-17,79.81%,17:10,1030.0,0.7981
3,4,Paris POLLOCK,Male,4,VM,35-39,76.65%,17:25,1045.0,0.7665
4,5,William CANNON,Male,5,SM,30-34,74.74%,17:33,1053.0,0.7474
...,...,...,...,...,...,...,...,...,...,...
757,758,Longfei WANG,Female,306,VW,35-39,23.57%,1:03:39,3819.0,0.2357
758,759,Nathan LAWRENCE,Male,406,VM,40-44,21.44%,1:03:40,3820.0,0.2144
759,760,Jo BUCKLE,Female,307,VW,55-59,27.30%,1:05:27,3927.0,0.2730
760,761,Joanne WATKINS,Female,308,VW,50-54,25.61%,1:05:28,3928.0,0.2561


In [19]:
cleaned_df['AgeGradeDecimal'].value_counts(dropna=False)

AgeGradeDecimal
NaN       47
0.5344     3
0.4372     3
0.5716     3
0.4663     3
          ..
0.6161     1
0.4937     1
0.5696     1
0.5066     1
0.2133     1
Name: count, Length: 654, dtype: int64

In [20]:
# Age Range is tied to gender because it's part of Age Group (I've split it here for purposes of analysis later, so people who don't give their gender will also be missing this.)

cleaned_df['Age Range'].value_counts(dropna=False)

Age Range
25-29    130
30-34    100
35-39     83
45-49     63
55-59     60
40-44     53
50-54     48
20-24     47
          47
60-64     42
65-69     20
11-14     17
15-17     13
70-74     12
75-79     10
18-19      6
10         5
80-84      5
85-89      1
Name: count, dtype: int64

In [22]:
cleaned_df['Position'].value_counts(dropna=False)

Position
1      1
512    1
503    1
504    1
505    1
      ..
257    1
258    1
259    1
260    1
762    1
Name: count, Length: 762, dtype: int64

In [23]:
cleaned_df['Gender Pos'].value_counts(dropna=False)

Gender Pos
       47
1       2
205     2
212     2
211     2
       ..
338     1
337     1
336     1
335     1
407     1
Name: count, Length: 408, dtype: int64

In [24]:
cleaned_df['Position'] = cleaned_df['Position'].astype('int64')

In [25]:
cleaned_df.dtypes

Position             int64
Name                object
Gender              object
Gender Pos          object
Age Category        object
Age Range           object
Age Grade           object
Time                object
TimeSeconds        float64
AgeGradeDecimal    float64
dtype: object

## Some other things I was considering doing but don't need to as outside the scope of my analysis.

* Convert Age Range to a pandas Categorical type: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
* Reorder the columns --> a nice-to-have not a need-to-have
* Scrape this page as well: https://www.parkrun.com.au/albertmelbourne/results/eventhistory/

In [26]:
cleaned_df.to_csv("albertmelbourne_scrapingresultseg.csv", index=False)