Script to scrape Victoria public holidays and onsite/remote school dates.<br>

Victoria public holidays are taken from <b>victoriapublicholiday.com.au</b><br>

In [15]:
import pandas as pd
from bs4 import BeautifulSoup
import urllib3
import re

In [16]:
http = urllib3.PoolManager()

In [20]:
rows = []
columns = []
for year in range(2015, 2021):
    url = f'https://victoriapublicholiday.com.au/{year}-holiday-list'
    response = http.request('GET', url)

    soup = BeautifulSoup(response.data)

    table = soup.find('div', attrs={ "class" : "table"})

    headers = [header.text for header in table.find_all('th')]

    for row in table.find_all('tr'):
        rows.append([val.text for val in row.find_all('td')])
    
    if columns == []: columns = [c.text for c in table.find_all('th')]
    
rows = [r for r in rows if r != []]

df = pd.DataFrame(data = rows, columns = columns)
df

Unnamed: 0,Holiday,Date,Holiday Type,Area
0,New Year's Day,"Thursday, 1 January 2015",Public,Vic Wide
1,Australia Day,"Monday, 26 January 2015",Public,Vic Wide
2,Labour Day,"Monday, 9 March 2015",Public,Vic Wide
3,Good Friday,"Friday, 3 April 2015",Public,Vic Wide
4,Saturday before Easter Sunday,"Saturday, 4 April 2015",Public,Vic Wide
...,...,...,...,...
77,Friday before AFL Grand Final Holiday,"Friday, 23 October 2020",Public,Vic Wide
78,*Melbourne Cup,"Tuesday, 3 November 2020",Public,Vic Most Areas
79,Christmas Day,"Friday, 25 December 2020",Public,Vic Wide
80,Boxing Day,"Saturday, 26 December 2020",Public,Vic Wide


In [21]:
df['Date'] = df['Date'].apply(lambda x: x.split(',')[1].strip())
df[['Holiday', 'Date']].to_csv("Vic_Holidays.csv", index=False)

<b>In 2020 Victorian schools were partially and then completely moved to remote education.</b> 
Public schools moved to remote not in sync with private schools. <b>For simplicity assume all schools went remote a the same time.</b> Below is a timeline of relevant events extracted from news/@DanielAndrewsMP:

- Term 2 remote from the start (https://twitter.com/DanielAndrewsMP/status/1247406308178849794).
- End remote on 26 May (https://twitter.com/DanielAndrewsMP/status/1259975151371776000/photo/1).

- Term 3 remote from 4 August (https://www.abc.net.au/news/2020-08-02/coronavirus-changes-victorian-schools-and-child-care-explained/12516544) due to stage 4 lockdown until the end.

- Term 4 started remote (up until the end day of this data series).

<b>Change school terms schedule to account for onsite dates only.</b>

In [9]:
with open("School term days.txt") as f:
    rows = f.readlines()

rows = [r.strip() for r in rows]

rows[:10]

['terms\tStart date\tFinish date',
 'Term 1\t29 January 2020\t24 March 2020',
 'Term 2\t15 April 2020\t26 June 2020',
 'Term 3\t13 July 2020\t18 September 2020',
 'Term 4\t5 October 2020\t18 December 2020',
 'Term 1\t30 January 2019\t5 April 2019',
 'Term 2\t23 April 2019\t28 June 2019',
 'Term 3\t15 July 2019\t20 September 2019',
 'Term 4\t7 October 2019\t20 December 2019',
 'Term 1\t30 January 2018\t29 March 2018']

In [10]:
tabs_pttn = re.compile('(\\t)+')
rows = [tabs_pttn.sub(", ", r).split(',') for r in rows]
rows[:10]

[['terms', ' Start date', ' Finish date'],
 ['Term 1', ' 29 January 2020', ' 24 March 2020'],
 ['Term 2', ' 15 April 2020', ' 26 June 2020'],
 ['Term 3', ' 13 July 2020', ' 18 September 2020'],
 ['Term 4', ' 5 October 2020', ' 18 December 2020'],
 ['Term 1', ' 30 January 2019', ' 5 April 2019'],
 ['Term 2', ' 23 April 2019', ' 28 June 2019'],
 ['Term 3', ' 15 July 2019', ' 20 September 2019'],
 ['Term 4', ' 7 October 2019', ' 20 December 2019'],
 ['Term 1', ' 30 January 2018', ' 29 March 2018']]

In [11]:
df = pd.DataFrame(data = rows[1:], columns = [r.strip() for r in rows[0]])
for col in df.columns:
    df[col] = df[col].apply(lambda x: x.strip())

In [13]:
df['Start date'][3] = '7 October 2020'
df['Finish date'][2] = '4 August 2020'
df['Start date'][1] = '26 May 2020'
df = df.drop([3])

In [14]:
df.to_csv('Vic_School_Terms.csv', index=False)