# Scraper for Uva Courses 


NOTE: AI, specifically ChatGPT and Deepseek, were used to generate almost all of the code present within this notebook. The reason for this is because parsing online websites for data with libraries such as BS4 and Selenium is outside the scope of the course. For this reason, I worked one level of abstraction higher than the code, understanding the inputs and outputs of what the code generated was doing, but not fully understanding the application.

The first part of the notebook deals with scraping all the course links from the UvA course catalogue. This code was generated using ChatGPT, and slightly tweaked. Only the first three prompts in the conversation were used to generate the code, the rest of the conversation did not end up being used. The transcript for this conversation can be found here: https://chatgpt.com/share/6834a603-651c-800d-9f41-0290db8d714d 

The second part of the notebook deals with parsing the HTML of each course link to find information useful for the recommender system. This code was generated using Deepseek. However, Deepseek does not yet provide the option to share conversations, so instead each prompt used will be put in a Markdown box at the end of the notebook.

## Scraping Course Links from Course Catalogue

First, we import the necessary libraries to create a function to scrape the webpage links of all the courses from the UvA course catalogue website.

In [7]:
#pip install selenium

In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import re

Next, we create the function itself. The UvA course catalogue contains 20 courses per page, so the function needs to parse the html of the current page, find the course links of the 20 courses, and then click on the "next" button to go to the next page. 

In [9]:
def scrape_course_links():
    base_url = "https://studiegids.uva.nl/xmlpages/page/2024-2025-en/search-course"
    course_links = []

    # Set up headless Chrome browser
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    driver = webdriver.Chrome(options=options)

    wait = WebDriverWait(driver, 10)
    driver.get(base_url)

    # There's 20 courses per page, and 3802 courses, so 200 max pages = 4000 courses just to be safe. I also decreased
    # max_pages when running the code for bug fixing so as not to scrape the entire catalogue each time.
    cur_page = 0
    max_pages = 200

    while cur_page<max_pages:
        time.sleep(2)  # wait for JS content to load

        # Get the page source again and re-fetch elements
        wait.until(EC.presence_of_element_located((By.ID, "search-results")))

        # Get course links freshly on each iteration
        course_elements = driver.find_elements(By.CSS_SELECTOR, "div#search-results a[href*='/search-course/course/']")
        for i in range(len(course_elements)):
            try:
                # Refresh the element to avoid stale reference
                element = driver.find_elements(By.CSS_SELECTOR, "div#search-results a[href*='/search-course/course/']")[i]
                link = element.get_attribute("href")
                if link and link not in course_links:
                    course_links.append(link)
            except Exception as e:
                print(f"Skipping element due to error: {e}")
                continue

        # Try clicking "next"
        try:
            next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[contains(text(), 'next')]")))
            driver.execute_script("arguments[0].scrollIntoView(true);", next_button)
            time.sleep(1)
            next_button.click()
            
            # Wait for page to change (wait for current button to become stale)
            WebDriverWait(driver, 10).until(
                EC.staleness_of(next_button)
            )
            
            # Wait for new results
            WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.ID, "search-results"))
            )
            time.sleep(1)  # Additional buffer
            
            cur_page += 1
        except:
            print("No more pages found. Exiting loop.")
            break

    driver.quit()
    return course_links

Now, we can simply execute the function and save the course links inside a .txt file, which we can later parse to obtain the information we need for the recommender:

In [10]:
links = scrape_course_links()
print(f"Found {len(links)} courses.")
    
output_file = "datasets/uva_course_links.txt"
with open(output_file, "w", encoding="utf-8") as f:
    for link in links:
        if re.search(r"studiegids\.uva\.nl/xmlpages/page/2024-2025-en/search-course/course/\d{6}", link):
            f.write(link + "\n")

No more pages found. Exiting loop.
Found 3812 courses.


## Obtaining Dataset from Course Links

Now, we need to parse the html of the website at each course link in order to obtain the data for the dataset. Each website contains a table in which we find the following datapoints:
1. Course name
2. Course catalogue number
3. College/graduate
4. Language of instruction
5. Time period
6. Is part of

Underneath the table there is also a course description, which is the most important piece of information for the recommender, as this is what we are going to use to compute the similarity between courses. The course description typically contains a few headings, but the ones we will be obtaining are 'Objectives' and 'Contents', as these are the most descriptive of the course itself. We don't think that other headings such as "Study materials" or "Assessment" etc. would be useful, as these sections do not contain many discriminative tokens that would differentiate the course from others, leading to noise.

We will utilise beautiful soup 4 to parse the HTML of each course link, and use pandas to process the information into a dataframe.


First, we will define a function that takes the cells i.e. the different sections, of the table in each course link as the parameter and return the corresponding time-period. The reason this needs a separate function is because in the HTML of each course link the semester the course is held is represented with boxes, which makes it more difficult to directly scrape the time period the course is available for. This function will be utilised in a function later on in which we will scrape all the information from the course website simultaneously.

In [11]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
import csv

In [12]:
def extract_time_period(cells):
    try:
        time_periods = []
        
        # Find the time period table
        time_table = cells[1].find('table')
        if not time_table:
            return None
        
        # Process each semester column
        for sem_col in time_table.find_all('td', class_=lambda x: x and ('sem1' in x or 'sem2' in x)):
            # Determine semester and base number
            semester = 1 if 'sem1' in sem_col.get('class', []) else 2
            base_num = 0 if semester == 1 else 3
            
            # Find all red blocks that aren't hidden
            red_blocks = sem_col.select('div.red:not(.hide)')
            
            for block in red_blocks:
                # Get block number from class (handles block-2 and block-2-3 formats)
                block_class = next((c for c in block.get('class', []) if c.startswith('block-')), None)
                if not block_class:
                    continue
                
                # Extract numbers
                parts = block_class.split('-')[1:]
                if not parts:
                    continue
                
                # Convert to final numbers
                if len(parts) == 1:  # Single block (e.g., block-2)
                    block_num = int(parts[0])
                    time_periods.append(str(base_num + block_num))
                else:  # Range (e.g., block-2-3)
                    start = int(parts[0])
                    end = int(parts[1])
                    time_periods.append(f"{base_num + start}-{base_num + end}")
                    
        return ', '.join(time_periods) if time_periods else None
    
    except Exception as e:
        print(f"Error processing time period: {e}")
        return None

Next, we will define a function that takes a course link as the parameter and returns a dictionary containing the table section as the keys, and the information itself as the values. 

In [13]:
def extract_course_info(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0'}
        response = requests.get(url.strip(), headers=headers, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract course name
        breadcrumb = soup.find('nav', id='breadcrumb')
        course_name = breadcrumb.find('li', class_='current').find('a').get_text(strip=True) if breadcrumb else None
        
        # Initialize course data (all fields will be strings)
        course_data = {
            'url': url.strip(),
            'course_name': course_name,
            'course_catalogue_number': '',
            'college_graduate': '',
            'language_of_instruction': '',
            'time_period': '',
            'is_part_of': '',
            'course_description': ''
        }

        # Extract table data
        item_info = soup.find('div', class_='item-info')
        if item_info:
            for row in item_info.find_all('tr'):
                cells = row.find_all('td')
                if len(cells) >= 2:
                    label = cells[0].get_text(strip=True).lower()
                    value = cells[1].get_text(strip=True)
                    if 'course catalogue number' in label:
                        course_data['course_catalogue_number'] = value
                    elif 'language of instruction' in label:
                        course_data['language_of_instruction'] = value
                    elif 'time period(s)' in label:
                        course_data['time_period'] = extract_time_period(cells)
                    elif 'college/graduate' in label:
                        course_data['college_graduate'] = value
                    elif 'is part of' in label:
                        links = [link.get_text(strip=True) for link in cells[1].find_all('a')]
                        course_data['is_part_of'] = ', '.join(links) if links else ''

        # Extract description sections
        def extract_section(heading_id):
            heading = soup.find('h4', id=heading_id)
            if not heading:
                return ''
            
            content = []
            next_elem = heading.find_next_sibling()
            while next_elem and next_elem.name != 'h4':
                if next_elem.name == 'p':
                    content.append(next_elem.get_text(strip=True))
                elif next_elem.name == 'ul':
                    content.extend(li.get_text(strip=True) for li in next_elem.find_all('li'))
                next_elem = next_elem.find_next_sibling()
            
            if content:
                return ' '.join(content)
            return ''

        objectives = extract_section('leerdoel')
        contents = extract_section('inhoud')
        
        # Combine all description parts
        description_parts = [part for part in [objectives, contents] if part]
        if description_parts:
            course_data['course_description'] = '\n'.join(description_parts)

        return course_data
    except Exception as e:
        print(f"Error processing {url.strip()}: {str(e)}")
        return {
            'url': url.strip(),
            'course_name': '',
            'course_catalogue_number': '',
            'college_graduate': '',
            'language_of_instruction': '',
            'is_part_of': '',
            'course_description': '',
            'error': str(e)
        }

Now we can simply read the course links from the .txt file we created earlier, scrape the information from each course link utilising the functions we just made, save this information in a pandas dataframe, and then convert the dataframe into a .csv file.

In [14]:
try:
    with open("datasets/uva_course_links.txt", 'r') as f:
        urls = [line.strip() for line in f if line.strip()]
        
    results = []
    for url in tqdm(urls, desc="Scraping courses"):
        results.append(extract_course_info(url))
        
    # Convert to DataFrame and save with proper CSV formatting
    df = pd.DataFrame(results)
        
    # Replace None with empty string to avoid issues
    df = df.fillna('')
        
    # Save with quoting all fields and proper encoding
    df.to_csv(
        "datasets/recommender_dataset.csv",
        index=False,
        quoting=csv.QUOTE_ALL,
        quotechar='"',
        encoding='utf-8-sig'  # For proper special character handling
    )
    print(f"Successfully saved {len(df)} courses to datasets/recommender_dataset.csv")

except Exception as e:
    print(f"Error: {str(e)}")

Scraping courses: 100%|█████████████████████████████████████████████████████████████| 3812/3812 [15:01<00:00,  4.23it/s]


Successfully saved 3812 courses to datasets/recommender_dataset.csv


## Deepseek Prompts

NOTE: since I somtimes passed HTML directly to Deepseek, I had to put random apostrophes throughout the HTML code so that it is not automatically executed in the Markdown cell.

**Prompt 1**: I have a CSV file with 3802 links to webpages, where each is a webpage for a university course. I want to extract the following from these webpages:
1. Course name
2. Course catalogue number
3. Time period(s)
4. College/graduate
5. Langauge of instruction
6. Is part of
7. Course description
Here is the html file format that these webpages follow. Use this to write code to extract the previously mentioned characteristics from the webpages inside the csv file:


**Prompt 2**: its actually a .txt file that contains the course links, not a csv file, my bad

**Prompt 3**: This script works great, except there are two problems:
1. The course name is not saved properly. Change the code so that it saves it properly. I will also provide the html for the website once more, the course name is found in: <div''' class="inner-spacer"><h1'> 9A: Machine Learning and Deep Learning Brush-up</h1>
2. The time period is not saved properly. The way the website represents the time period is by six boxes, as seen in the html code as <div class="block-3 red"'><img src="/xmlpages/resources/TXP/uva/studiegidswebsite/img/red-block.png" alt="" width="16" height="16"/'>. The way I want you to represent the time period is by looking at the number in the block-x-red section. If it is "block-3 red" as shown here, then make the time period "3".

Now here is the html code for the website:

**Prompt 4**: Once agan, the course name is not found properly. Use this part of the html in order to find the course's name:
<li' class="current"><a 'href="/xmlpages/page/2024-2025-en/search-course/course/122456/">9A: Machine Learning and Deep Learning Brush-up</a'>.
In this example, the course name is 9A: Machine Learning and Deep Learning Brush Up

**Prompt 5**: The course names works correctly now. I want you to fix one more thing:
The time period is represented by two classes, sem1 and sem2. I want you to represent the semester in the following way:
If the red block is in the sem1 class, the number of the red block should correspond to 1-3. If the red block is in the sem2 class, the red block should correspond to 4-6. So if there is block-2 red in sem2, it should be saved as 5. Please modify the code the account for this.

**Prompt 6**: The time period might be a range also, so for example the course might be available in semester 1 from block 2-3. Modify the code to account for this range

**Prompt 7**: I'd like you to modify the following part of the code, extending the course description to also include objectives, i.e. the h4 heading with id = "leerdoel". Combine the objectives and contents in order to make the course description:

#' Extract course description (content after the "Contents" heading)
        contents_heading = soup.find('h4', id='inhoud')
        if contents_heading:
            description = []
            next_elem = contents_heading.find_next_sibling()
            while next_elem and next_elem.name != 'h4':
                if next_elem.name == 'p':
                    description.append(next_elem.get_text(strip=True))
                elif next_elem.name == 'ul':
                    for li in next_elem.find_all('li'):
                        description.append(li.get_text(strip=True))
                next_elem = next_elem.find_next_sibling()
            course_data['course_description'] = ' '.join(description) if description else None

**Prompt 8**: The objectives usually contain bullet points, as can be seen in an example in the following html code:
<'ul class="bullets-outside"><l'i>have a better understanding of long-term human impact on the lived environment;</'li><'li>have insight in the ways in which cultures in the past have reacted to environmental change;</'li><'li>understand the effect of modern climate change on archaeological heritage;</li'><'li>have knowledge of the ways in which the discipline of archaeology can contribute to environmental debates;</li'><'li>be aware of theories concerning social and cultural resilience and able to relate these to case studies in different parts of the world.</'li></'ul><'p>

**Prompt 9**: This is good, but what I think is happening is that the following code is putting each bullet point etc in a new row of the .csv file, which is not what is meant to happen. Instead, the entire course description should be one string which is inside the course description column:
if description_parts:
            # Filter out None values and join with newlines
            course_data['course_description'] = '\n'.join(filter(None, [s.strip() if isinstance(s, str) else s for s in description_parts]))
        else:
            course_data['course_description'] = None

**Prompt 10**: what is the escapechar

**Prompt 11**: What does quoting =1 do

**Prompt 12**: Ok, now we are going to do something tricky: try to also include the semester the course is in. Here is some useful information to take account of when generating the code:
1. There are two semesters, semester 1 and semester 2
2. Both semesters have 3 blocks inside of them, for a total of 6 blocks. Hence, I'd like the semester to be represented by an integer between 1 and 6, which can also be a range e.g. 2-3
3. The semester is decided by red-blocks in the html code. After a red block appears, the time period is also provided. However, you need to account which semester the red block is in. If a red block appears for time period 2-3 in semester 2, the semester in the csv file should be 5-6.
4. Here is a snippet of the html code from a course website which include the red blocks, in this case for semester 2 time period 2-3, which would translate to 5-6 if we were to include it in the csv:
<'p class="meta">Time period(s)<'/p></'td><'td><'table style="width:inherit;"><'tr class="result"><'th style="width:60px" class="sem1">Sem. 1</'th><'th style="width:60px" class="sem2">Sem. 2</'th></'tr><'tr class="result"><'td class="sem1"><'div class="sem-blocks"><'img src="/xmlpages/resources/TXP/uva/studiegidswebsite/img/grey-blocks.png" alt="" width="56" height="16"/><'div class="hide red"><'img src="/xmlpages/resources/TXP/uva/studiegidswebsite/img/red-block.png" alt="" width="16" height="16"/><'/div></'div><'/td><'td class="sem2"><'div class="sem-blocks"><'img src="/xmlpages/resources/TXP/uva/studiegidswebsite/img/grey-blocks.png" alt="" width="56" height="16"/><'div class="block-2-3 red"><'img src="/xmlpages/resources/TXP/uva/studiegidswebsite/img/red-block.png" alt="" width="16" height="16"/>

**Prompt 13**: This did not work. One thing to keep in mind that may help is that there are two classes, sem1 and sem2. Inside each of these classes there is the sem-blocks classes, and within that class there should be the red-blocks, as can be seen in this html code:
<'p class="meta">Time period(s)<'/p><'/td><'td><'table style="width:inherit;"><'tr class="result"><'th style="width:60px" class="sem1">Sem. 1</'th><'th style="width:60px" class="sem2">Sem. 2</'th></'tr><'tr class="result"><'td class="sem1"><'div class="sem-blocks"><'img src="/xmlpages/resources/TXP/uva/studiegidswebsite/img/grey-blocks.png" alt="" width="56" height="16"/><'div class="hide red"><'img src="/xmlpages/resources/TXP/uva/studiegidswebsite/img/red-block.png" alt="" width="16" height="16"/></'div></'div></'td><'td class="sem2"><'div class="sem-blocks"><'img src="/xmlpages/resources/TXP/uva/studiegidswebsite/img/grey-blocks.png" alt="" width="56" height="16"/><'div class="block-2-3 red"><'img src="/xmlpages/resources/TXP/uva/studiegidswebsite/img/red-block.png" alt="" width="16" height="16"/>

**Prompt 14**: It seems that there is something wrong with the logic somewhere, as the output csv has the time_period column completely empty

**Prompt 15**: What does this line do:
red_blocks = sem_col.find_all('div', class_=lambda x: x and 'red' in x and 'hide' not in x)

**Prompt 16**: This is interesting, as I found using debug statements that tthis line captured the following multiple times:
[<'div class="hide red"><' alt="" height="16" src="/xmlpages/resources/TXP/uva/studiegidswebsite/img/red-block.png" width="16"/></'div>]