## Level 1 parsing: html.parser

An event-based processor converting HTML text to a stream of logical items (tags, text, etc.)

In [2]:
from urllib.request import urlopen
import time
from html.parser import HTMLParser

class RecordTitleHTMLParser(HTMLParser):
    def __init__(self,*args,**kwargs):
        super().__init__(*args,**kwargs)
        self.page_title = ""
        self.looking = False
        
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
        if tag == "title":
            self.looking = True

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)
        if tag == "title":
            self.looking = False

    def handle_data(self, data):
        print("Encountered some data  :", data)
        if self.looking:
            self.page_title += data


with urlopen("http://example.com") as fp:
    html_str = fp.read().decode( fp.headers.get_content_charset(failobj="utf-8")  )
    
parser = RecordTitleHTMLParser()
parser.feed(html_str)

Encountered some data  : 

Encountered a start tag: html
Encountered some data  : 

Encountered a start tag: head
Encountered some data  : 
    
Encountered a start tag: title
Encountered some data  : Example Domain
Encountered an end tag : title
Encountered some data  : 

    
Encountered a start tag: meta
Encountered an end tag : meta
Encountered some data  : 
    
Encountered a start tag: meta
Encountered an end tag : meta
Encountered some data  : 
    
Encountered a start tag: meta
Encountered an end tag : meta
Encountered some data  : 
    
Encountered a start tag: style
Encountered some data  : 
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
      

In [3]:
print("The title of the page was: ",parser.page_title)

The title of the page was:  Example Domain


## Level 2 parsing: Convert to a DOM

In [5]:
import time
from urllib.request import urlopen
from bs4 import BeautifulSoup

time.sleep(1)
with urlopen("http://example.com/") as fp:
    soup_ex = BeautifulSoup(fp,"html.parser")

In [7]:
# Full HTML
print("Got an object of type:")
type(soup_ex)
print("\nHere is the full HTML code it corresponds to:")
print(soup_ex)

Got an object of type:

Here is the full HTML code it corresponds to:
<!DOCTYPE html>

<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1

In [8]:
# Print a nicely indented form of the same
print(soup_ex.prettify())


<!DOCTYPE html>
<html>
 <head>
  <title>
   Example Domain
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style type="text/css">
   body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
  </style>
 </head>
 <body>
  <div>
   <h1>
    Example Domain
   </h1>
   <p>
    This dom

In [11]:
# Page title
# (Attribute .title means "find and return the first title tag")
soup_ex.title.get_text()

'Example Domain'

### MCS 275 lecture titles

In [12]:
import time
from urllib.request import urlopen
from bs4 import BeautifulSoup

def lecture_title(n):
    time.sleep(0.5)
    with urlopen("https://www.dumas.io/teaching/2024/spring/mcs275/slides/lecture{}.html".format(n)) as fp:
        soup = BeautifulSoup(fp,"html.parser")
    return soup.title.get_text()

In [13]:
lecture_title(23)

'Lec 23: Julia sets'

### Slide headings

In [16]:
time.sleep(0.5)
with urlopen("https://www.dumas.io/teaching/2024/spring/mcs275/slides/lecture37.html") as fp:
    soup = BeautifulSoup(fp,"html.parser")

In [17]:
for tag in soup.find_all("section"):  # section means slide
    h2 = tag.h2 # first h2 tag in a section
    if h2 is not None:
        print(h2.get_text())

Working with APIs and HTML
Getting data from the web
API usage example
HTML but no API?
Simple HTML processing
HTML document as an object
DOM
Beautiful Soup
Minimal soup
Minimal soup
Minimal soup
Scraping and spiders
Minimal soup
BS4 basics
Working with tags
Searching
Simulating CSS
