## Scraping the Web

Another way to get data is by scraping it from web pages. Fetching web pages, it turns out, is pretty easy; getting meaningful structured information out of them less so.

Pages on the web are written in HTML, in which text is (ideally) marked up into elements and their attributes:

```
<html>
    <head>
    <title>A web page</title>
    </head>
    <body>
        <p id="author">Joel Grus</p>
        <p id="subject">Data Science</p>
    </body>
</html>
```

In a perfect world, where all web pages were marked up semantically for our benefit, we would be able to extract data using rules like “find the `<p>` element whose id is subject and return the text it contains.”

In the actual world, HTML is not generally well formed, let alone annotated. This means we’ll need help making sense of it.

We will be using a couple of packages to get data out of HTML.
- Beautiful Soup library, which builds a tree out of the various elements on a web page and provides a simple interface for accessing them
- Requests library, which is a much nicer way of making HTTP requests than anything that’s built into Python
- html5lib library, which is able to handle HTML that's not perfectly formed better than Python's built-in HTML parser

If you installed Anaconda, these libraries should have already been installed. Otherwise, you may need to install them yourself.

In [1]:
# From Jupyter Notebook, run
!pip install beautifulsoup4 requests html5lib



From console/shell/prompt, run

`python -m pip install beautifulsoup4 requests html5lib`

To use Beautiful Soup, we pass a string containing HTML into the BeautifulSoup function. In our examples, this will be the result of a call to `requests.get`:

In [2]:
from bs4 import BeautifulSoup
import requests

url = ("https://raw.githubusercontent.com/joelgrus/data/master/getting-data.html")
html = requests.get(url).text
print(html)
soup = BeautifulSoup(html, 'html5lib')

<!doctype html>
<html lang="en-US">
<head>
    <title>Getting Data</title>
    <meta charset="utf-8">
</head>
<body>
    <h1>Getting Data</h1>
    <div class="explanation">
        This is an explanation.
    </div>
    <div class="comment">
        This is a comment.
    </div>
    <div class="content">
        <p id="p1">This is the first paragraph.</p>
        <p class="important">This is the second paragraph.</p>
    </div>
    <div class="signature">
        <span id="name">Joel</span>
        <span id="twitter">@joelgrus</span>
        <span id="email">joelgrus-at-gmail</span>
    </div>
</body>
</html>



We’ll typically work with `Tag` objects, which correspond to the tags representing the structure of an HTML page.

For example, to find the first `<p>` tag (and its contents), you can use:

In [7]:
first_paragraph = soup.find('p')        # or just soup.p
print(first_paragraph)

<p id="p1">This is the first paragraph.</p>


You can get the text contents of a `Tag` using its `text` property:

In [8]:
first_paragraph_text = soup.p.text
first_paragraph_words = soup.p.text.split()

print(first_paragraph_text)
print(first_paragraph_words)

This is the first paragraph.
['This', 'is', 'the', 'first', 'paragraph.']


And you can extract a tag’s attributes by treating it like a `dict`:

In [9]:
first_paragraph_id = soup.p['id']       # raises KeyError if no 'id'
first_paragraph_id2 = soup.p.get('id')  # returns None if no 'id'

print(first_paragraph_id)
print(first_paragraph_id2)

p1
p1


You can get multiple tags at once as follows:

In [10]:
all_paragraphs = soup.find_all('p')  # or just soup('p')
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]

print(all_paragraphs)
print(paragraphs_with_ids)

[<p id="p1">This is the first paragraph.</p>, <p class="important">This is the second paragraph.</p>]
[<p id="p1">This is the first paragraph.</p>]


Frequently, you’ll want to find tags with a specific class:

In [11]:
important_paragraphs = soup('p', {'class' : 'important'}) 
important_paragraphs2 = soup('p', 'important')

print(important_paragraphs)
print(important_paragraphs2)

[<p class="important">This is the second paragraph.</p>]
[<p class="important">This is the second paragraph.</p>]


And you can combine these methods to implement more elaborate logic. For example, if you want to find every `<span>` element that is contained inside a `<div>` element, you could do this:

In [12]:
spans_inside_divs = [span
                     for div in soup('div')     # for each <div> on the page
                     for span in div('span')]   # find each <span> inside it

print(spans_inside_divs)

[<span id="name">Joel</span>, <span id="twitter">@joelgrus</span>, <span id="email">joelgrus-at-gmail</span>]


Of course, the important data won’t typically be labeled as class="important". You’ll need to carefully inspect the source HTML, reason through your selection logic, and worry about edge cases to make sure your data is correct.

## Example: Keeping Tabs on Congress

The VP of Policy at your start-up company is worried about potential regulation of the data science industry and asks you to quantify what Congress is saying on the topic. In particular, he wants you to find all the representatives who have press releases about "data."

There is a page with links to all of the representatives' websites at https://www.house.gov/representatives

And if you "view source," all of the links to the websites look like:
```
<td>
    <a href="https://jayapal.house.gov">Jayapal, Pramila</a>
</td>
```
Let’s start by collecting all of the URLs linked to from that page:

In [13]:
from bs4 import BeautifulSoup
import requests

url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, "html5lib")

all_urls = [a['href']
            for a in soup('a')     # Note: the <a> tag is used to define a hyperlink
            if a.has_attr('href')]
print(len(all_urls))

966


This returns way too many URLs. If you look at them, the ones we want start with either `http://` or `https://`, have some kind of name, and end with either `.house.gov` or `.house.gov/`.

This is a good place to use a regular expression:

In [14]:
import re

# Must start with http:// or https://
# Must end with .house.gov or .house.gov/
# For references on generating regular expressions, see
# https://docs.python.org/3/library/re.html
# https://www.w3schools.com/python/python_regex.asp
regex = r"^https?://.*\.house\.gov/?$"

# Let's write some tests!
assert re.match(regex, "http://joel.house.gov")
assert re.match(regex, "https://joel.house.gov")
assert re.match(regex, "http://joel.house.gov/")
assert re.match(regex, "https://joel.house.gov/")
assert not re.match(regex, "joel.house.gov")
assert not re.match(regex, "http://joel.house.com")
assert not re.match(regex, "https://joel.house.gov/biography")

# And now apply
good_urls = [url for url in all_urls if re.match(regex, url)]
print(len(good_urls))

872


That’s still way too many, as there are only 435 representatives. If you look at the list, there are a lot of duplicates. Let’s use set to get rid of them:

In [15]:
good_urls = list(set(good_urls))
print(len(good_urls))

436


So the number did not turn out to be exactly 435. Maybe someone has more than one website. In any case, this is good enough.

When we look at the sites, most of them have a link to press releases. For example:

In [16]:
html = requests.get('https://susielee.house.gov/').text
soup = BeautifulSoup(html, 'html5lib')

# Use a set because the links might appear multiple times.
links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}
print(links) # {'/media/press-releases'}

{'/media/press-releases'}


Notice that this is a relative link, which means we need to remember the originating site. Let’s do some scraping:

In [17]:
from typing import Dict, Set

press_releases: Dict[str, Set[str]] = {}
    
for house_url in good_urls:
    html = requests.get(house_url).text
    soup = BeautifulSoup(html, 'html5lib')
    pr_links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}
    
    print(f"{house_url}: {pr_links}")
    press_releases[house_url] = pr_links

https://lowenthal.house.gov: {'/media/press-releases'}
https://finkenauer.house.gov/: {'/media/press-releases'}
https://gregmurphy.house.gov: {'/media/press-releases'}
https://davidscott.house.gov/: {'/News/DocumentQuery.aspx?DocumentTypeID=377'}
https://chuygarcia.house.gov/: {'/media/press-releases'}
https://stevens.house.gov/: {'/media/press-releases'}
https://perry.house.gov/: set()
https://crow.house.gov/: {'/media/press-releases'}
https://bustos.house.gov: {'https://bustos.house.gov/category/press-release/'}
https://arrington.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://fletcher.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://byrne.house.gov/: {'/media-center/press-releases'}
https://calvert.house.gov/: {'/media/press-releases'}
https://jacksonlee.house.gov/: {'/media-center/press-releases'}
https://dennyheck.house.gov: {'/media-center/press-releases'}
https://mfume.house.gov/: {'/media/press-releases'}
https://waltz.house.gov: {'/news/docum

https://allen.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://foster.house.gov: {'/media/press-releases'}
https://crist.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://torressmall.house.gov/: {'/media/press-releases'}
https://schweikert.house.gov/: {'/media-center/press-releases'}
https://wenstrup.house.gov: {'/news/documentquery.aspx?DocumentTypeID=2491'}
https://kevinmccarthy.house.gov/: {'/media-center/press-releases'}
https://chabot.house.gov/: {'/news/documentquery.aspx?DocumentTypeID=2508'}
https://jordan.house.gov/: {'/News/DocumentQuery.aspx?DocumentTypeID=1611'}
https://bost.house.gov/: {'/media-center/press-releases'}
https://loudermilk.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://vargas.house.gov: {'/media-center/press-releases'}
https://davis.house.gov: {'/press-releases/'}
https://huizenga.house.gov/: {'/News/DocumentQuery.aspx?DocumentTypeID=2041'}
https://thornberry.house.gov: {'/News/DocumentQuery.aspx?DocumentTyp

https://pascrell.house.gov/: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://mikelevin.house.gov: {'/media/press-releases'}
https://pingree.house.gov/: set()
https://ruppersberger.house.gov: {'/news-room/press-releases'}
https://schakowsky.house.gov: {'/media/press-releases'}
https://fudge.house.gov/: set()
https://ferguson.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://stivers.house.gov/: {'/News/DocumentQuery.aspx?DocumentTypeID=2054'}
https://barragan.house.gov: {'https://barragan.house.gov/category/press-releases/'}
https://cox.house.gov: {'/media/press-releases'}
https://guthrie.house.gov/: set()
https://hayes.house.gov: {'/media/press-releases'}
https://gonzalez-colon.house.gov: {'/media/press-releases'}
https://jhb.house.gov/: {'/News/DocumentQuery.aspx?DocumentTypeID=2113'}
https://visclosky.house.gov/: {'/media-center/latest-news', '/media-center/press-releases'}
https://velazquez.house.gov: {'/media-center/press-releases'}
https://pressley.house.gov:

https://gottheimer.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://ruiz.house.gov: {'/media-center/press-releases'}
https://zeldin.house.gov/: {'/media-center/press-releases'}
https://doggett.house.gov: set()
https://davids.house.gov/: {'/media/press-releases'}
https://lamalfa.house.gov: {'/media-center/press-releases'}
https://joyce.house.gov: {'/press-releases/'}
https://cole.house.gov: {'/media-center/press-releases'}
https://tomgraves.house.gov/: {'/News/DocumentQuery.aspx?DocumentTypeID=2000'}
https://ebjohnson.house.gov: {'/media-center/press-releases'}
https://vantaylor.house.gov/: set()
https://courtney.house.gov/: {'/media-center/press-releases'}
https://hollingsworth.house.gov: set()
https://keller.house.gov: {'/media/press-releases'}
https://garamendi.house.gov/: {'/media/press-releases'}
https://aguilar.house.gov/: {'/media-center/press-releases'}
https://kim.house.gov/: {'/media/press-releases'}
https://bobbyscott.house.gov: {'/media-center/press-releases

Note: Normally it is impolite to scrape a site freely like this. Most sites will have a robots.txt file that indicates how frequently you may scrape the site (and which paths you’re not supposed to scrape), but since it’s Congress we don’t need to be particularly polite.

If you watch these as they scroll by, you’ll see a lot of */media/press-releases* and *media-center/press-releases*, as well as various other addresses. One of these URLs is https://susielee.house.gov/media/press-releases.

Remember that our goal is to find out which congresspeople have press releases mentioning "data." We’ll write a slightly more general function that checks whether a page of press releases mentions any given term.

If you visit the site and view the source, it seems like there’s a snippet from each press release inside a `<p>` tag, so we’ll use that as our first attempt:

In [18]:
def paragraph_mentions(text: str, keyword: str) -> bool:
    """
    Returns True if a <p> inside the text mentions {keyword}
    """
    soup = BeautifulSoup(text, 'html5lib')
    paragraphs = [p.get_text() for p in soup('p')]

    return any(keyword.lower() in paragraph.lower() for paragraph in paragraphs)

Let’s write a quick test for it:

In [19]:
text = """<body><h1>Facebook</h1><p>Twitter</p>"""
assert paragraph_mentions(text, "twitter")       # is inside a <p>
assert not paragraph_mentions(text, "facebook")  # not inside a <p>

At last we’re ready to find the relevant congresspeople and give their names to the VP:

In [None]:
for house_url, pr_links in press_releases.items():
    for pr_link in pr_links:
        url = f"{house_url}/{pr_link}"
        text = requests.get(url).text

        if paragraph_mentions(text, 'data'):
            print(f"{house_url}")
            break # done with this house_url

Note: If you look at the various “press releases” pages, most of them are paginated with only 5 or 10 press releases per page. This means that we only retrieved the few most recent press releases for each congressperson. A more thorough solution would have iterated over the pages and retrieved the full text of each press release.