# Scraping with beautiful soup: Working notebook

## Quick note

Attributes of a tag become a dictionary in a `bs4` tag object.

```html

<a href="https://example.com/" class="no-underline">Click!</a>
```

a tag has attributes:
```python
{
"href": "https://example.com/",
"class": "no-underline"
}
```
If the tag object is `t`, then the dict above would be `t.attrs`.

But `t.attrs["href"]` can be replaced by the shorthand `t["href"]`.


## Exploring some beautiful soup features

In [3]:
import bs4
import urllib
import time
import os

In [5]:
# Once and only once, store a local copy

if os.path.exists("local_copy_dumas_io.html"):
    print("Local copy already exists")
else:
    #download
    with urllib.request.urlopen("https://www.dumas.io/") as http_fp:
        with open("local_copy_dumas_io.html","wb") as local_fp:
            local_fp.write( http_fp.read() )
    print("Downloaded a local copy!")

Local copy already exists


In [7]:
# Convert the local copy to a DOM (soup)
with open("local_copy_dumas_io.html","rb") as fp:
    soup = bs4.BeautifulSoup(fp,"html.parser")

In [10]:
# What kind of tag contains the section heading "Teaching"?

for tag in soup.find_all(True):
    if tag.string == "Teaching":
        print("Found it!")
        teaching_heading = tag
        break

Found it!


In [12]:
# What kind of tag?
teaching_heading.name

'h3'

In [13]:
# Are there any tags inside this one?
teaching_heading.contents

['Teaching']

In [15]:
UIC_course_list = teaching_heading.parent.ul.ul # <---- courses at UIC
                    #              ^-div  ^- list of places

In [18]:
UIC_courses = UIC_course_list.find_all("li")

In [19]:
len(UIC_courses)

34

In [23]:
for link in UIC_course_list.find_all("a"):
    print("Link text:",link.string)
    print("Link dest:",link["href"])
    print()

Link text: MCS 275: Programming Tools and File Management
Link dest: https://uic.blackboard.com/ultra/courses/_267469_1/outline

Link text: Math 547: Algebraic Topology I
Link dest: /teaching/2023/fall/math547/

Link text: MCS 275: Programming Tools and File Management
Link dest: /teaching/2023/spring/mcs275/

Link text: Math 549: Differentiable Manifolds I
Link dest: /teaching/2022/fall/math549/

Link text: MCS 275: Programming Tools and File Management
Link dest: /teaching/2022/spring/mcs275/

Link text: MCS 260: Introduction to Computer Science
Link dest: /teaching/2021/fall/mcs260/

Link text: MCS 275: Programming Tools and File Management
Link dest: /teaching/2021/spring/mcs275/

Link text: MCS 260: Introduction to Computer Science
Link dest: /teaching/2020/fall/mcs260/

Link text: Math 445: Introduction to Topology I
Link dest: /teaching/2019/spring/math445/

Link text: Math 550: Differentiable Manifolds II
Link dest: /teaching/2019/spring/math550/

Link text: Math 320: Linear Al

## UIC academic calendar scraper work

In [24]:
import bs4
import urllib
import time
import os
# Once and only once, store a local copy

if os.path.exists("uic_academic_calendar.html"):
    print("Local copy already exists")
else:
    #download
    with urllib.request.urlopen("https://catalog.uic.edu/ucat/academic-calendar/") as http_fp:
        with open("uic_academic_calendar.html","wb") as local_fp:
            local_fp.write( http_fp.read() )
    print("Downloaded a local copy!")

Downloaded a local copy!


In [25]:
with open("uic_academic_calendar.html","rb") as fp:
    soup = bs4.BeautifulSoup(fp,"html.parser")

In [27]:
for tag in soup.find_all("h2"):
    if "2023-2024 academic calendar" in tag.text.lower():
        target_h2 = tag
        break
        

In [30]:
ay_table = target_h2.find_next_sibling("table")  # first table after the h2

In [42]:
import csv

with open("uic_ay2023_2024.csv","w",newline="",encoding="UTF-8") as fp:
    writer = csv.writer(fp)
    semester = None
    for row in ay_table.find_all("tr"):
        tds = row.find_all("td")
        if len(tds) != 2:
            continue
        if tds[1].string is None:
            if "summer" in tds[0].string.lower():
                semester = None  # how we'll ignore summer
            else:
                semester = tds[0].string
        else:
            if semester is not None:
                writer.writerow([semester,tds[0].string,tds[1].string])