# Beautiful Soup examples

## MCS 275 Spring 2021 - Instructor David Dumas

### Lecture 40

Beautiful soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Import the module (only need to do this once)

In [1]:
from bs4 import BeautifulSoup

Parse a single HTML file into a DOM-like data structure in a variable `soup`:

(This is one of the slide presentations from MCS 275.)

In [2]:
with open("html-for-scraping/lecture40.html") as fobj:
    soup = BeautifulSoup(fobj,"html.parser")

Get the title of that lecture (the string that is the only text node under the title tag)

In [3]:
soup.head.title.string

'Lec 40: Parsing and scraping HTML'

How many slides were in that lecture?

In [4]:
# each slide is a <section> tag.
len(soup.find_all("section"))

21

(This count is only approximately right; in reveal.js, nested section tags are used to create slides that appear below others, and that feature is used here.  The true slide count would be the number of section tags that don't contain other section tags.  How would you find that?)

Let's do the same thing, but for every html file in the `html-for-scraping` directory (several of the MCS 275 lectures).

In [5]:
import os

DATADIR="html-for-scraping"

for fn in os.listdir(DATADIR):
    if not fn.endswith(".html"):
        continue
    with open(os.path.join(DATADIR,fn)) as fobj:
        soup = BeautifulSoup(fobj,"html.parser")
    print(fn,soup.head.title.string)

lecture17.html Lec 17: Quicksort
lecture40.html Lec 40: Parsing and scraping HTML
lecture23.html Lec 23: CSV and JSON
lecture22.html Lec 22: set and defaultdict


Remark: A cleaner way to get all files that end in .html would be to use `glob.glob("html-for-scraping/*.html")`. But we didn't discuss the `glob` module, so I used `os.listdir`.

## Examples with example.com front page

The next cell retrieves https://example.com/.  Be careful to avoid making frequent automated requests to any web server, and to follow a site's terms of use and robots.txt rules.  Here, I've added a 1-second delay to make sure this cell can never make more than 1 request per second.

In [6]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import time

time.sleep(1)
with urlopen("https://example.com/") as response:
    soup = BeautifulSoup(response,"html.parser")

Note: If we were going to work with the contents of this page many times, it would be better to download it to a file and then parse the file.  That way, there would only be one network request, rather than a new request each time the program is run.

In [7]:
# printing a BeautifulSoup object shows the corresponding
# HTML
soup

<!DOCTYPE html>

<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative example

In [8]:
# But it's actually a BeautifulSoup object, which has
# many methods and attributes.
type(soup)

bs4.BeautifulSoup

In [9]:
# First p in the document
soup.p

<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

In [10]:
soup.find_all("p")[1]  # second p in the document

<p><a href="https://www.iana.org/domains/example">More information...</a></p>

In [11]:
soup.div.p  # first p tag inside the first div

<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

In [12]:
# is the first p that appears in a div actually the first p
# in the whole document?
soup.div.p == soup.p

True

## Examples with an HTML string

In [13]:
soup = BeautifulSoup("""
<html><head><title>Hello</title>
<body><h1>Hello</h1> This is my document.
<strong>Mine.</strong></body></html>""","html.parser")

In [14]:
soup


<html><head><title>Hello</title>
<body><h1>Hello</h1> This is my document.
<strong>Mine.</strong></body></head></html>

In [15]:
print(soup.prettify())

<html>
 <head>
  <title>
   Hello
  </title>
  <body>
   <h1>
    Hello
   </h1>
   This is my document.
   <strong>
    Mine.
   </strong>
  </body>
 </head>
</html>


In [16]:
soup.title

<title>Hello</title>

In [17]:
type(soup.title)

bs4.element.Tag

In [18]:
soup.h1

<h1>Hello</h1>

In [19]:
soup.find_all("h1")

[<h1>Hello</h1>]