In [1]:
# COMP5339 Week 4 Tutorial
# Material last updated: 25 August 2025
# Note materials were designed with the Roboto Condensed font, which can be installed here: https://www.1001fonts.com/roboto-condensed-font.html

from IPython.display import HTML
HTML('''
    <style> body {font-family: "Roboto Condensed Light", "Roboto Condensed";} h2 {padding: 10px 12px; background-color: #E64626; position: static; color: #ffffff; font-size: 40px;} .text_cell_render p { font-size: 15px; } .text_cell_render h1 { font-size: 30px; } h1 {padding: 10px 12px; background-color: #E64626; color: #ffffff; font-size: 40px;} .text_cell_render h3 { padding: 10px 12px; background-color: #0148A4; position: static; color: #ffffff; font-size: 20px;} h4:before{ 
    content: "@"; font-family:"Wingdings"; font-style:regular; margin-right: 4px;} .text_cell_render h4 {padding: 8px; font-family: "Roboto Condensed Light"; position: static; font-style: italic; background-color: #FFB800; color: #ffffff; font-size: 18px; text-align: center; border-radius: 5px;}input[type=submit] {background-color: #E64626; border: solid; border-color: #734036; color: white; padding: 8px 16px; text-decoration: none; margin: 4px 2px; cursor: pointer; border-radius: 20px;}</style>
    <script> code_show=true; function code_toggle() {if (code_show){$('div.input').hide();} else {$('div.input').show();} code_show = !code_show} $( document ).ready(code_toggle);</script>
    <form action="javascript:code_toggle()"><input type="submit" value="Hide/show all code."></form>
''')

# Week 4 - Web Scraping

Not all data is presented as neatly as a structured dataframe of rows and columns, like we've become accustomed to in the previous weeks. Often, meaningful information exists in unstructured or semi-structured formats. Take for example, the internet. Worlds of information exist across millions of webpages, and extracting particular fields of interest from these pages is our focus today.

This will require the following Python libraries:
- **Request**         for interacting with websites and web services
- **BeautifulSoup**   for webpage parsing
- **HTML5Lib**        for the actual parser that BeautifulSoup uses
- **Pandas**          for dataframe management

To use the above, you will need to have the following libraries installed (using either pip3 or Anaconda navigator):
- `bs4`
- `html5lib`

In [2]:
import sys
print(sys.executable)

c:\Users\slh\AppData\Local\Programs\Python\Python310\python.exe


In [3]:
import requests
import bs4
import pandas as pd

## 1. Scraping Data from a Webpage

We'll start with a familiar example, and read in the webpage for [this unit's outline](https://www.sydney.edu.au/units/COMP5339/2025-S2C-NE-CC) on the USYD website.

### 1.1 Webpage Retrieval and Parsing

The `requests` library can be used to `get()` the contents of a page, as seen below.



In [4]:
webpage_source = requests.get("https://www.sydney.edu.au/units/COMP5339/2025-S2C-NE-CC").text
print(webpage_source)


<!DOCTYPE HTML>
<html lang="en-US">
    <head><script>
;window.NREUM||(NREUM={});NREUM.init={session_replay:{enabled:true,block_selector:'',mask_text_selector:'*',sampling_rate:10.0,error_sampling_rate:100.0,mask_all_inputs:true,collect_fonts:true,inline_images:false,inline_stylesheet:true,mask_input_options:{}},distributed_tracing:{enabled:true},privacy:{cookies_enabled:true},ajax:{deny_list:["bam.nr-data.net"]}};

;NREUM.loader_config={accountID:"2511202",trustKey:"1322840",agentID:"1588991041",licenseKey:"9500fdf8a5",applicationID:"1588991041"};
;NREUM.info={beacon:"bam.nr-data.net",errorBeacon:"bam.nr-data.net",licenseKey:"9500fdf8a5",applicationID:"1588991041",sa:1};
;/*! For license information please see nr-loader-spa-1.265.1.min.js.LICENSE.txt */
</script>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
    
    
    <meta name="robots" content="noindex"/>
    
    
    

    








    

	<meta charset="utf-8"/>
	<meta http-equiv="X-UA-Compatible" c

This output of this request is the raw webpage source code. This is normally parsed and rendered by a web browser as a nice visual webpage.

The language in which this webpage is written, is called **HTML** (the *Hypertext Markup Language*), and is a tree-like structure of content elements. We can interpret this content using a **HTML parser** - several exist, but we'll be using *BeautifulSoup*.

In [5]:
from bs4 import BeautifulSoup
import html5lib
content = BeautifulSoup(webpage_source, 'html5lib')

### 1.2 Traversing the Tree

The key benefit of parsing the webpage is that we can now **locate and iterate through the HTML content** by either traversing the tree, or by selecting particular HTML tags, classes or identifiers. As a simple example, the webpage output contains a single instance of a `title` tag, as seen below:

`<title>Outline - The University of Sydney</title>`

This title is typically reflected in the browser tab name, which you'll notice here doesn't align with our code. In the web browser's code, the title reflects the unit of study name - "COMP5339: Semester 1, 2025". When first loaded though, there's a brief second where the tab name is "Outline - The University of Sydney", like our code here. For more complex web pages, some content may be dynamically generated on load (e.g. with Javascript), and hence small disparities like these can occur. This is the only field we'll encounter in this tutorial's example that should differ.

Nonetheless, using BeautifulSoup, we can extract this information, by finding the first `title` tag within the content, and extracting its text:

In [6]:
print(content.title.text)

Outline - The University of Sydney


We can get even more in-depth, and specify a path to traverse. The example below seeks the `body` tag, then the first `div` within that, and the first `div` within that!

In another browser tab with the actual webpage open, try using the **Inspect Element** feature to follow the path, and confirm this is pulling the right information:

In [7]:
print(content.body.div.div.text)

COMP5339: Semester 2, 2025


### 1.3 CSS Selectors

All elements of a HTML document can be assigned a **class** (multiple elements can share a class) or an **id** (which are unique). These are **Cascading Style Sheet** references that ease formatting (for example, all elements containing `class='darktext'` might be defined as having a black text colour).

Take the website's header as an example. This is all contained with a div called _primaryNavigation_, which we can focus on using the `find()` function. By then narrowing it down using `header > div`, we can find a few **<a\>** tags (which are links). Let's extract the `class` of the first link that appears:

In [8]:
print(content.find('div', 'primaryNavigation').header.div.a['class'])

['hamburgerIcon']


From there, we can tell it to jump to the next **<a\>** tag using `findNext()`. Within this element is an **img**, so we'll extract the class from that:

In [9]:
print(content.find('div', 'primaryNavigation').header.div.a.findNext('a').img['class'])

['unilogo']


  print(content.find('div', 'primaryNavigation').header.div.a.findNext('a').img['class'])


The above examples are useful, but only allow us to find single elements. The `find_all()` function captures _all_ occurrences of a HTML element, by tag, and optionally by class or ID. Header text elements are defined in HTML as **h1**, **h2**, etc. Finding the text from all occurrences of `h2` tags nicely recaps the page structure:

In [10]:
for heading in content.find_all('h2'):
    print(heading.text.strip())

# alternative one line solution:
# [x.text.strip() for x in content.find_all('h2')]

Overview
Overview
Assessment
Assessment
Learning support
Learning support
Weekly schedule
Weekly schedule
Learning outcomes
Learning outcomes
Responding to student feedback
Responding to student feedback
Additional information
Additional information
On this page
Useful links
Media
Student links
About us
Connect


**Task: Find the text and hyperlinks of all USYD social media platforms listed in the page footer.**

The footer section of all pages on USYD's website contains a few small icons on the left, each linking to a social media account run by the university. Use "Inspect Element" on the webpage to find the class of the `div` that contains this information, and within this, extract the **text** _and_ **link** of each.

In [11]:
### TO DO


### 1.4 HTML tables

Despite being a webpage, not all useful information is stored in text fields. HTML features `table` elements, which are made up of `<tr>` rows, each consisting of `<td>` cells (or `<th>` if a header). In our example, the top of each UoS page contains an overview table of academic details. We can first locate it by it's **class**, in this case _teaching-staff__wrapper_, and explore its structure:

In [12]:
details = content.find('div', 'teaching-staff__wrapper')
details

<div class="teaching-staff__wrapper">
                                                            




    
    
    <div class="title">
<h3>Teaching staff</h3>
</div>



                                                            <table class="table table-striped table-bordered">
                                                                <tbody>
                                                                <tr>
                                                                    <th rowspan="1">Coordinator</th>
                                                                    <td>
                                                                        <span>
                                                                            Uwe Roehm,
                                                                        </span>
                                                                        <span><a href="mailto:uwe.roehm@sydney.edu.au?unit=COMP5339">uwe.roehm@sydney.edu.au<

Let's iterate through each row, and extract both its header (in **<th\>**), and the corresponding cell data (in **<td\>**).

In [13]:
for row in details.find_all('tr'):
    print(row.th.text.strip(), '=', row.td.text.strip())

Coordinator = Uwe Roehm,
                                                                        
                                                                        uwe.roehm@sydney.edu.au
Lecturer(s) = Uwe Roehm, uwe.roehm@sydney.edu.au


**Task: Extract the details of all assessments in the webpage.**

1. Use InspectElement to locate the id/class of the div containing the assessment details (set this as `assessments`)
2. Create a list of `headers`, for the column headers (<td\>) of the table (e.g. ['Type', 'Description', ...])
3. For each row in the table, add a dictionary of values for that row (e.g. {'Type': 'Online task', 'Description': 'Weekly Homework', ...}) to the `data` list

Tip #1: The "Outcomes assessed" rows are not intended to be kept. Either skip these rows in your loop, or see if you can find a CSS class that would ignore these.

Tip #2: A different approach may be needed for cells with bold text, and cells without bold text, to avoid the longer description text being brought in.

In [14]:
### TO DO
assessments = '?' # use the find() function to locate the div containing the table

headers = []  # populate this list with the headers of the table

data = []
for row in '?':  # iterate through each row of the table
    assessment = {'Unit': 'COMP5339', 'Session': '2025-S2C-NE-CC'}  # start with a couple fields populated
    # iterate through each cell in the row, and add it to the 'assessment' dictionary
    data.append(assessment)  # add the dictionary of row values to our overall list 'data'

pd.DataFrame(data)  # return the results as a dataframe

Unnamed: 0,Unit,Session
0,COMP5339,2025-S2C-NE-CC


## 2. Web Crawling

Web scraping can be very powerful, but especially so when a script can be established to do so over **multiple** webpages.

Note the legal/ethical cautions, and best practices:
1. Check the **robots.txt** to determine whether users are permitted to scrape pages, and at what frequency
2. Add **intentional delays** in the code to avoid congesting servers (or getting blocked from websites!)
3. Initially, just **practice** building your code over a single webpage or two. Only scale up to multiple pages once you are confident the code does as it is intended to!

### 2.1 Link Extraction

So far, we've been exploring the webpage for this year's occurrence of COMP5339. If we go back to the homepage of COMP5339, we can similarly parse the HTML content, and find links to all occurrences of the unit. Past occurrences are represented in the _archivedOutlines_ div, and current units are in the _currentOutlines_ div, so we'll pull links from them both.

In [15]:
page = requests.get("https://www.sydney.edu.au/units/COMP5339").text
content = BeautifulSoup(page, 'html5lib')
links = []
oldlinks = content.find('div', id='archivedOutlines').find_all('a')
for link in oldlinks:
    if link.has_attr('href'):
        links.append(link.get('href'))
newlinks = content.find('div', id='currentOutlines').find_all('a')
for link in newlinks:
    if link.has_attr('href'):
        links.append(link.get('href'))
print(links)

['/units/COMP5339/2023-S2C-NE-CC', '/units/COMP5339/2024-S1C-NE-CC', '/units/COMP5339/2024-S2C-NE-CC', '/units/COMP5339/2025-S1C-NE-CC', '/units/COMP5339/2025-S2C-NE-CC']


Note the links there seem incomplete - they start with a slash, rather than specifying a full URL. This implies pages on the same web domain. Therefore, we can add the domain in, to turn these into fully qualified hyperlinks:

In [16]:
for link in links:
    URL = 'http://sydney.edu.au'+link
    print(URL)

http://sydney.edu.au/units/COMP5339/2023-S2C-NE-CC
http://sydney.edu.au/units/COMP5339/2024-S1C-NE-CC
http://sydney.edu.au/units/COMP5339/2024-S2C-NE-CC
http://sydney.edu.au/units/COMP5339/2025-S1C-NE-CC
http://sydney.edu.au/units/COMP5339/2025-S2C-NE-CC


### 2.2 Link Traversal

**Task: Create a function that receives a URL, and returns the assessment data.**

The function is set up below, for you to paste in your answer from the task in Section 1.4. Only a couple adjustments are needed:
1. Your previous code worked with a predefined `content` variable. This function should receive the URL, retrieve its web contents, parse its HTML, and then proceed with this.
2. Our previous row-by-row `assessment` dictionary had the unit and session hardcoded in. Try updating this to reflect this information dynamically from the URL itself.

When confident your function is likely correct, test it runs correctly by uncommenting the last row of the cell below, which will test it on [COMP5313](https://www.sydney.edu.au/units/COMP5313/2025-S1C-NE-CC).

In [17]:
### TO DO
def findAssessments(URL):
    # retrieve the URL first, then:
    """
    paste in your code from the task in Section 1.4, but adjust the initial 'assessment' dictionary to actually detail the true unit code and session from the URL
    """

    return pd.DataFrame(data)

findAssessments('https://www.sydney.edu.au/units/COMP5313/2025-S1C-NE-CC')

Unnamed: 0,Unit,Session
0,COMP5339,2025-S2C-NE-CC


Once this has been achieved, we can test it by iterating over the links we located in Section 2.1.

Note an explicit delay of two seconds has been added in between each request, using the `.sleep()` function from the `time` module.

In [18]:
import time as t
df = pd.DataFrame(columns=['Unit', 'Session']+headers[:5])  # establishing a blank dataframe to be populated
for link in links:  # for each link we found earlier
    URL = 'http://sydney.edu.au'+link  # establishing its full address
    print(URL)  # printing it to summarise our progress
    t.sleep(2)  # waiting for two seconds before requesting
    df = pd.concat([df, findAssessments(URL)])  # merging the new data with our existing df

df

http://sydney.edu.au/units/COMP5339/2023-S2C-NE-CC
http://sydney.edu.au/units/COMP5339/2024-S1C-NE-CC
http://sydney.edu.au/units/COMP5339/2024-S2C-NE-CC
http://sydney.edu.au/units/COMP5339/2025-S1C-NE-CC
http://sydney.edu.au/units/COMP5339/2025-S2C-NE-CC


Unnamed: 0,Unit,Session
0,COMP5339,2025-S2C-NE-CC
0,COMP5339,2025-S2C-NE-CC
0,COMP5339,2025-S2C-NE-CC
0,COMP5339,2025-S2C-NE-CC
0,COMP5339,2025-S2C-NE-CC


And there we have it! A simple, brief example of crawling and scraping to collate data summarised from websites.

## 3. Data Storage

As mentioned in previous coverage of Pandas, exporting to a CSV file is quite simple using the `.to_csv()` function. This should create a CSV in your working directory, containing the information we collated.

In [19]:
df.to_csv("assessments.csv", index=False)

## 4. Application

The fun (but way overboard) **OPTIONAL** extra task.

For the last couple of years, OLEs have been a requirement of all undergraduate degrees at USYD. What if we could extract assessment information **for all OLEs at the university**, thereby enabling students to narrow down those that are most appealing to them? Perhaps one student is interested in finding the OLE with the _least_ assessments, but including at least one presentation, for example. Perhaps another seeks OLEs with a group work element above a weighting of 20%. The possibilities are plentiful.

**OPTIONAL Task: Extract the list of all OLE UoS codes/titles.**

Run the below cell to request the OLE page, then (again using Inspect Element), extract the list of all OLE titles (e.g. 'OLET1622 Numbers and Numerics') available.

In [20]:
OLEpage = requests.get("https://www.sydney.edu.au/handbooks/interdisciplinary_studies/open_learning_environment/open_learning_environment_ad_table.html").text
OLEcontent = BeautifulSoup(OLEpage, 'html5lib')

In [21]:
### TO DO
uoslist = '?'

**OPTIONAL Task: For each OLE found, extract all assessments for their most recent outline.**

Iterate through the list of units extracted above, and visit the UoS page for each. Within this UoS page, find the first link that appears in _currentOutlines_ (if one exists), and apply the same `findAssessments()` function we developed earlier, again ensuring to leave an intentional delay between visiting web pages.

A `progress()` helper function is included below, so that the time taken for the cell to run is reported. It is also recommended to print the unit code as each next one is reached, so that your progress can be monitored.

A limit on your list of OLEs has also been added by default, so that only the first three pages are processed. Only remove this once you are confident your code will run smoothly on the remaining pages.

In [22]:
### TO DO

# helper function to report how long the cell took to run
def progress(t0):
    print('Completed in ' + str(round((t.time()-t0)/60, 1)) + ' minutes.')
    
t0 = t.time()
OLEdf = pd.DataFrame(columns=['Unit', 'Session']+headers[:5])  # establishing a blank dataframe to be populated
for i, uoscode in enumerate(uoslist[:3]):  # purposefully limiting to just the first few for now
    print(f'({i}/{len(uoslist)}) {uoscode}')  # printing what element in the list we're up to
    t.sleep(2)  # wait two seconds before requesting anything
    # request the URL for this unit and locate the first link in the unit outlines table, if it exists
    # go to the first link in this table, and use the findAssessments() function to extract its assessment info
    # final line should be something like: OLEdf = pd.concat([OLEdf, findAssessments(URL)])

progress(t0)

OLEdf

(0/1) ?
Completed in 0.0 minutes.


Unnamed: 0,Unit,Session


From there, you are welcome to ingest this in your database server, and begin querying it at will, to begin discovering your "ideal" OLE. Any findings from this are gladly welcomed! If sufficient interest is garnered, we may even create an Ed thread to discuss some student findings, and (more importantly from an educational perspective), the queries used to discover them :)