# Scraping the UCSD course catalog

The UCSD course catalog contains 6,000+ course descriptions across 80+ pages. The HTML is messy, inconsistent, and full of surprises, so this was lots of fun.


First, I get the links to the individual program pages (e.g. CSE, LANG, MATH) from the "cover" of the course catalog. I just get every link named "courses".


In [2]:
import requests
import bs4
from bs4 import BeautifulSoup
import json
from collections import defaultdict
from tabulate import tabulate
import re

page = requests.get("https://catalog.ucsd.edu/front/courses.html")
cover = BeautifulSoup(page.text, "html.parser")
links = cover.find_all("a", href=lambda href: href and href.startswith("../courses"))
urls = ["https://catalog.ucsd.edu" + link["href"][2:] for link in links]


def get_dept_abbr(url):
    return url[33:-5].upper()


with open("urls.txt", "w") as f:
    for url in urls:
        f.write(url + "\n")


In [3]:
# fetch all the pages (takes like 20-30 seconds)
pages = [BeautifulSoup(requests.get(url).content, "html.parser") for url in urls]


I used the code below to show me all the most uncommon combinations of tag names and classes. It helped me find all sorts of oddities and edge cases.


In [4]:
def key(tag):
    return (tag.name, tuple(tag.attrs.get("class") or []))


tag_types = {}

with open("edge-cases.txt", "w") as f:
    for url, page in zip(urls, pages):
        # f.write('\n\n==========================================\n\n' + url + '\n\n==========================================\n\n')
        content = page.find(class_="col-md-12 blank-slate").children
        for tag in content:
            if not isinstance(tag, bs4.element.Tag):
                continue

            k = key(tag)
            if k not in tag_types:
                tag_types[k] = [0, set()]
            tag_types[k][0] += 1
            tag_types[k][1].add(url)

    data = [[v[0], k[0], " ".join(k[1]), " ".join(v[1])] for k, v in tag_types.items()]
    data.sort()

    f.write(tabulate(data, headers=["Count", "Tag", "Classes", "URLs"]))


The HTML is messy, but at least every page is structured similarly: all of the info I care about is in a `div` with class `"col-md-12 blank-slate"`. After that, I classify each element as either a header (if so, which one), a course name (e.g. CSE 100. Advanced Data Structures), a course number (e.g. `cse100`, usually found in an `a` tag), or text. Text is assigned to the last seen course/header. Text that is assigned to a header must be between the header and its corresponding courses, so it will end up containing notes that apply to all the following courses.


In [5]:
# # classes i definitely dont care about
# ignored_classes = set([
#     'course-head',
#     'basic-offset-top-only',
#     'faculty-staff-subhead',
#     'note',
#     'course-note',
#     'alphabreak',
#     'sectionNav',
#     'program-contact-info',
#     'anchor-parent',
#     'courseFacLink',
#     'course-list-overview',
#     'course-prerequisite-paragraph',
#     'course-list-courses',
#     'course-disclaimer',
# ])

# # tag names i definitely dont care about
# ignored_tag_names = set([
#     'ul', 'table', 'a'
# ])
    

def get_type(tag) -> str:
    if type(tag) is bs4.element.NavigableString:
        return None if tag.string.strip() == '' else 'plain-string'
    
    if type(tag) is not bs4.element.Tag:
        return None
    
    if tag.name[0] == "h":
        return tag.name

    classes = tag.attrs.get("class")
    if not classes:
        return "text"

    if classes[0] == "course-name":
        return "name"

    if classes[0] == "anchor-parent":
        return "num"

    return "text"


def get_tags(page: BeautifulSoup):
    tags = page.find(class_="col-md-12 blank-slate").children
    return [(tag, get_type(tag)) for tag in tags if get_type(tag) is not None]


Now, I extract and clean the raw data and process the id's; we'll process each course more afterwards. The most important thing I rely on is that each course title is a paragraph with class "course-name". I use the headers to give the courses categories. I write all the content into a file in the form of JSON so I can manually fix typos/formatting and add/remove content. My JSON will be a list of the 82 pages, each of which is a list of headers and courses.

Remember to output to `pages-data-2.json` first and diff it with `pages-data.json`.


In [6]:
def get_clean_data(page: BeautifulSoup):
    def get_backlinks(title):
        first_period = title.find('.')
        if first_period != -1:
            ret = re.split(', |\/', title[:first_period])
            if all(re.fullmatch('[A-Z]+ \d+[A-Z]*', s) for s in ret):
                return ret
        
        print('add backlinks: ' + title)
        return []

    def id_to_backlink(id: str):
        id = id.upper()
        if id.find('-') != -1:
            id = id.replace('-', ' ')
        else:
            for i, char in enumerate(id):
                if char.isdigit():
                    id = id[:i] + " " + id[i:]
                    break
        
        if not re.fullmatch('[A-Z]+ \d+[A-Z]*', id):
            print('fix id: ' + id)
        return id
    
    # if this tag (or any tag it contains) has a matching id and name, return the id
    def get_course_num(tag):
        def first_child(tag): # returns first child of a tag that is also a tag
            for c in tag.children:
                if type(c) is bs4.element.Tag:
                    return c

        if type(tag) is not bs4.element.Tag:
            return
        while tag:
            id = tag.get('id')
            name = tag.get('name')
            if id == '': id = None
            if name == '': name = None
            if tag.attrs.get('class') == ['anchor']:
                return id or name
            if id and name and id == name:
                return id
            tag = first_child(tag)


    # list of course numbers
    # cleared out after each header/course name
    # used to label courses
    cur_backlinks = []

    ret = [] # list of courses and headers in JSON format

    seen_courses_header = False
    started = False
    for tag, tag_type in get_tags(page):
        
        if not started:
            # start at either at the first .course-name, .anchor-parent, or the first header after "Courses"
            if type(tag) is not bs4.element.Tag:
                continue
            if tag.text == 'Courses':
                seen_courses_header = True
                continue

            if tag_type in ('name', 'num'):
                started = True
            elif tag_type[0] == 'h' and seen_courses_header:
                started = True
            else:
                continue
        
        if tag_type[0] == 'h':
            title = tag.text.strip()
            if title == title.upper():
                title = title.title()
            ret.append({
                'type': tag_type,
                'content': title,
                'desc': '',
            })
            cur_backlinks = []
        elif tag_type == 'name':
            title = tag.text.strip()
            if title == title.upper():
                title = title.title()
            if '.' not in title:
                print('add period: ' + title)
            ret.append({
                'type': 'course',
                'title': title,
                'desc': '',
                'backlinks': cur_backlinks[:] if cur_backlinks else get_backlinks(title)
            })
            cur_backlinks = []
        elif get_course_num(tag):
            cur_backlinks.append(id_to_backlink(get_course_num(tag)))
        elif ret:
            text = tag.text if tag_type == 'text' else tag.string
            text = text.strip()
            if text:
                if ret[-1]['desc'] != '':
                    ret[-1]['desc'] += '\n'
                ret[-1]['desc'] += text
    
    return ret

def get_h1(page: BeautifulSoup):
    return page.find('h1').text

pages_data = [{
    'url': url,
    'dept': get_h1(page),
    'deptAbbr': get_dept_abbr(url),
    'content': get_clean_data(page)
} for url, page in zip(urls, pages)]

with open('pages-data-2.json', 'w') as f:
    json.dump(pages_data, f)



fix id: CLAS 196A B
add period: COMM 114M CSI: Communication and the Law (4)
fix id: EDS128 A
add period: GPCO 468: Evaluating Technological Innovation (4)
add period: GPPS 481: The Political Economy of Authoritarian Regimes (4)
add period: HIUS 178/278 The Atlantic World, 1400–1800 (4)
fix id: HLP
fix id: DS
fix id: LITWORLD
fix id: SPACER
add backlinks: Electives. Varies (12)
add backlinks: HIGR 236A-B. Seminar in History of Science (4-4)
add backlinks: HISC 163/263. History, Science, and Politics of Climate Change (4)
add backlinks: HISC 167/267. Gender and Science (4)
add backlinks: HISC 173/273. Seminar on Darwin and Darwinisms (4)
add backlinks: HISC 180/280. Science and Public Policy (4)
fix id: SIO 182B2
fix id: SIOB 273A2
add period: SOCI 123 Japanese Culture Inside/Out: A Transnational Perspective (4)


I manually did the following cleanup:
- Add periods in the 5 titles where they were missing with the regex `"type": "course",\n.*"title": [^.]*$`
- Removed the 2 instances of `"Not offered until"`
- Replace the regex ` [Pp]rerequisites: none\.?"` with `"`
- Replace the regex `(\\n){2,}` with `\n`
- Fix periods with no space after them: `\.[A-Z][a-z]`
- Replace colon-space-space with colon-space

In [9]:

with open("pages-data.json", "r") as f:
    pages_data = json.load(f)

# courses: list[str] = []
# for page in pages_data:
#     for course in page['content']:
#         if course['type'] == 'course':
#             courses.append(course['desc'])

# courses.sort(key=lambda x: -len(x))

# for course in courses:
#     print(course)
#     print('\n' * 4 + '-'*100 + '\n' * 5)

for i, page_data in enumerate(pages_data):
    print("\n\n" + page_data["url"] + "\n")
    for j, obj in enumerate(page_data["content"]):
        if obj["type"][0] == "h":
            print("#" * int(obj["type"][1]) + " " + obj["content"])

with open('temp.json', 'w') as f:
    json.dump(pages_data, f)




https://catalog.ucsd.edu/courses/AIP.html



https://catalog.ucsd.edu/courses/AASM.html

## Lower Division
## Upper Division


https://catalog.ucsd.edu/courses/AWP.html

## Lower Division
## Upper Division
## Graduate


https://catalog.ucsd.edu/courses/ANTH.html

## Lower Division
## Upper Division
## Anthropology: Archaeology
## Anthropology: Biological Anthropology
## Anthropology: Sociocultural Anthropology
## Graduate


https://catalog.ucsd.edu/courses/AUDL.html



https://catalog.ucsd.edu/courses/BIOI.html



https://catalog.ucsd.edu/courses/BIOL.html

## Lower Division
## Upper Division
### Biochemistry
### Genetics, Cellular and Developmental Biology of Plants and Animals
### Ecology, Behavior, and Evolution
### Molecular Biology, Microbiology
### Physiology and Neuroscience
### Special Courses
## Graduate


https://catalog.ucsd.edu/courses/BIOM.html



https://catalog.ucsd.edu/courses/CHEM.html

## Lower Division
## Upper Division
## Graduate


https://catalog.ucsd.edu/course

Now, I factor out all of the following metadata about each course:

- short/long name
- units
- sub-category (maybe later)
- department
- link
- description (everything before "Prerequisites: ")
- details (everything after)

I will still group the courses by department, like the original course catalog does.


In [24]:
with open("pages-data.json", "r") as f:
    parsed_data = json.load(f)

# cur_path = []  # [h2] or [h2, h3] or [h2, h3, h4]

course_id = 0

course_urls = []

for dept in parsed_data:
    del dept['url']
    for course in dept['content']:
        if course['type'] != 'course':
            continue
        
        # parse title
        title = course['title']
        i = title.find('.')
        j = title.find('(')
        if j == -1: j = len(title)
        short_name = title[:i].strip()
        long_name = title[i+1 : j].strip()
        units = title[j+1 : -1]

        del course['title']
        course['shortName'] = short_name
        course['longName'] = long_name
        course['units'] = units

        # parse description
        desc = course['desc']
        i = desc.find('Prerequisites:')
        if i == -1: i = len(desc)
        details = desc[i:].strip()
        desc = desc[:i].strip()
        course['desc'] = desc
        course['details'] = details
        course['id'] = course_id

        # give this course a url using a backlink TODO
        url = 'ucsdcourses.vercel.app/c/'
        course['url'] = url
        course_urls.append(url)
        course_id += 1

        # if len(course['backlinks']) == 1 and course['backlinks'][0] != course['shortName']:
        #     print(course['shortName'])

with open('parsed-data-2.json', 'w') as f:
    json.dump(parsed_data, f)



Now, it's time to link the courses together.

First, I get the subject codes (e.g. CSE, LANG) and the major codes (e.g. CS27, UNHA) from their respective pages.

In [19]:
# major codes
soup = BeautifulSoup(
    requests.get("https://blink.ucsd.edu/instructors/academic-info/majors/major-codes.html").content, "html.parser"
)
major_codes = []
for row in soup.find_all("tr"):
    cells = row.find_all("td")
    if not cells or len(cells) < 2 or len(cells[-2].text) < 3:
        continue
    try:
        major_codes.append([cells[-2].text, cells[-1].text])
    except IndexError:
        print(cells)


# subject codes
soup = BeautifulSoup(
    requests.get("https://blink.ucsd.edu/instructors/courses/schedule-of-classes/subject-codes.html").content,
    "html.parser",
)
subject_codes = []
for row in soup.find_all("tr"):
    cells = row.find_all("td")
    if not cells or len(cells) < 2:
        continue
    subject_codes.append([cells[0].text, cells[1].text])

with open('major-codes.json', 'w') as f:
    json.dump(major_codes, f)

with open('subject-codes.json', 'w') as f:
    json.dump(subject_codes, f)

# manually remove "(see ...)"

In [20]:
with open('major-codes.json', 'r') as f:
    major_codes = json.load(f)

with open('subject-codes.json', 'r') as f:
    subject_codes = json.load(f)

Then, I index the backlinks

In [27]:
with open('parsed-data.json') as f:
    svelte_markup = json.load(f)

backlinks = defaultdict(list)

for dept in svelte_markup:
    for course in dept['content']:
        if course['type'] != 'course':
            continue
        
        for link in course['backlinks']:
            backlinks[link].append((len(course['backlinks']), course['id']))

backlinks = {link: min(ids)[1] for link, ids in backlinks.items()}

In [41]:
s = 'The issues of war/peace, nationalism/internationalism, and economic growth/redistribution will be examined in both historical and theoretical perspectives. POLI 12 is Lecture only, and POLI 12D is Lecture plus Discussion section. These courses are equivalents of each other in regard to major requirements, and students may not receive credit for both 12 and 12D.'
for m in reversed(list(re.finditer('[A-Z]{2,} \d+[A-Z]*', s))):
    i, j = m.span()
    link = s[i:j]
    if link not in backlinks:
        continue
    course_link = f'<CourseLink id=backlinks["{s[i:j]}"]>'
    s = s[:i] + course_link + s[j:]
s

'The issues of war/peace, nationalism/internationalism, and economic growth/redistribution will be examined in both historical and theoretical perspectives. <CourseLink id=backlinks["POLI 12"]> is Lecture only, and <CourseLink id=backlinks["POLI 12D"]> is Lecture plus Discussion section. These courses are equivalents of each other in regard to major requirements, and students may not receive credit for both 12 and 12D.'

Finally, I add Svelte components (e.g. `<MajorLink name={name} code={code} />`) every time I see any of them in a description


In [51]:
def add_markup(desc):
    for code, major in major_codes:
        desc = desc.replace(code, f'<MajorLink code="{code}" name="{major}" />')

        for m in reversed(list(re.finditer('[A-Z]{2,} \d+[A-Z]*', desc))):
            i, j = m.span()
            link = desc[i:j]
            if link not in backlinks:
                continue
            course_link = f'<CourseLink id={backlinks[link]} />'
            desc = desc[:i] + course_link + desc[j:]
        
    return desc

# for dept in svelte_markup:
#     for course in dept['content']:
#         if course['type'] != 'course':
#             continue

In [52]:
test = 'Distributions over the real line. Independence, expectation, conditional expectation, mean, variance. Hypothesis testing. Learning classifiers. Distributions over R^n, covariance matrix. Binomial, Poisson distributions. Chernoff bound. Entropy. Compression. Arithmetic coding. Maximal likelihood estimation. Bayesian estimation. CSE 103 is not duplicate credit for ECE 109, ECON 120A, or MATH 183. Prerequisites: MATH 20B and CSE 21 or MATH 154 or MATH 158 or MATH 184 or MATH 188; restricted to CS25, CS26, CS27, and CS28 majors. Other students will be allowed as space permits.'
add_markup(test)

'Distributions over the real line. Independence, expectation, conditional expectation, mean, variance. Hypothesis testing. Learning classifiers. Distributions over R^n, covariance matrix. Binomial, Poisson distributions. Chernoff bound. Entropy. Compression. Arithmetic coding. Maximal likelihood estimation. Bayesian estimation. <CourseLink id=1841 /> is not duplicate credit for <CourseLink id=2010 />, <CourseLink id=1372 />, or <CourseLink id=4620 />. Prerequisites: <CourseLink id=4546 /> and <CourseLink id=1829 /> or <CourseLink id=4591 /> or <CourseLink id=4595 /> or <CourseLink id=4621 /> or <CourseLink id=4626 />; restricted to <MajorLink code="CS25" name="Computer Engineering" />, <MajorLink code="CS26" name="Computer Science (B.S.)" />, <MajorLink code="CS27" name="Computer Science with a Specialization in Bioinformatics" />, and CS28 majors. Other students will be allowed as space permits.'