# Web Scraping with BeautifulSoup

Sometimes there may not be an easily accessible data set for the project you are interested. However, there may be data that exists on the web which you can scrape. One way to do this in python is to use `BeautifulSoup`.

## What we will accomplish in this notebook

In this notebook we will:
- Discuss the structure of HTML code,
- Introduce the `bs4` pacakge,
- Parse simple HTML code with `BeautifulSoup`,
- Review how to request the HTML code from a url,
- Scrape data from an actual webpage and
- Touch on some of the issues that may arise when web scraping.

In [1]:
## Import base packages we'll use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from seaborn import set_style
set_style("whitegrid")

## Scraping data with `BeautifulSoup`

### Importing `BeautifulSoup`

In order to use `BeautifulSoup` we first need to make sure that we have it installed on our computer. Try to run the following code chunks.

In [2]:
## this imports BeautifulSoup from its package, bs4
import bs4

In [3]:
## Run this to check your version
## I wrote this notebook with version  4.10.0
print(bs4.__version__)

4.10.0


If the above code does not work you will need to install the package before being able to run the code in this notebook. Here are installation instructions from the `bs4` documentation:
- Via conda: <a href="https://anaconda.org/conda-forge/bs4">https://anaconda.org/conda-forge/bs4</a>,
- Via pip: <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup">https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup</a>.

### The structure of an HTML page

`BeautifulSoup` takes in an HTML document and will 'parse' it for you so that you can extract the information you want. To best understand what that means we will look at a toy example of a webpage. To see what the snippet of HTML code below looks like in a web browser click here <a href="SampleHTML.html">SampleHTML.html</a>.

In [4]:
## This is an html chunk
## It has a head and a body, just like you
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""

We can now use `BeautifulSoup` to parse this simple HTML chunk.

In [5]:
## First we import the BeautifulSoup object
from bs4 import BeautifulSoup

In [6]:
## Now we make a BeautifulSoup object out of the html code
## The first input is the html code
## The second input is how you want BeautifulSoup
## to parse the code
soup = BeautifulSoup(html_doc,'html.parser')

In [7]:
## Let's use the prettify method to make our html pretty and see what it has to say
## Ideally this is how someone writing pure html code would write their code
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



Html files have a natural tree structure that we will briefly cover now. Here is the tree of our sample HTML:

<img src = "html_tree.png" width = "50%"></img>

Each level in the tree represents a 'generation' of the html code. for instance the body has 3 p children, the leftmost p has one b child. `BeautifulSoup` helps us traverse these trees to gather the data we want.

In [8]:
## Below are some examples of beautifulsoup methods and 
## attributes that help us better understand the structure 
## of html code

In [9]:
## We can traverse to the "title" by working our way through
## the tree
print(soup.head.title)
print()

<title>The Dormouse's story</title>



In [13]:
## Notice we can also get the title like so
## This is because this is the first and only title 
## in the code
print(soup.title)

title


In [11]:
## What if I just want the text from the title?
print(soup.title.text)

The Dormouse's story


In [12]:
## What html structure is the title's parent?
print(soup.title.parent)
print(soup.title.parent.name)

<head><title>The Dormouse's story</title></head>
head


In [14]:
## What is the first a of the html document?
print(soup.a)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


In [15]:
## What is the first a's class?
print(soup.a['class'])

['sister']


In [17]:
## There are multiple a's can I find all of them?
print(soup.find_all('a'))


for a in soup.find_all('a'):
    print()
    print(a['class'], a.text)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

['sister'] Elsie

['sister'] Lacie

['sister'] Tillie


#### Exercises

Take a moment and try to complete the following exercises.

In [20]:
## Find the first p of the document
## What is the first p's class? What string is in that p?
print(soup.p)

print()

print(soup.p['class'], soup.p.text)

<p class="title"><b>The Dormouse's story</b></p>

['title'] The Dormouse's story


In [22]:
## For all of the a's in the document find their href

for a in soup.find_all('a'):
    print(a['href'])
    print()


http://example.com/elsie

http://example.com/lacie

http://example.com/tillie



## Scraping real webpages

Let's now pivot to a real webpage. In this example we will imagine we are in the spot of wanting to scrape information regarding the website FiveThirtyEight's Sports articles as found here, <a href="https://fivethirtyeight.com/sports/">https://fivethirtyeight.com/sports/</a>.

### Sending a request

In order to scrape that data we need to have the HTML code associated with the page. In python we can do this with the `requests` module.

In [23]:
import requests

In [24]:
## This is the url for the HTML code we want
url = "https://fivethirtyeight.com/sports/"

## We send a request to the website's server with
## the following code
requests.get(url)

<Response [200]>

Now we will want to store this response in a variable so we can access its contents. First we will note that, if the request was successful, we should be seeing `<Response [200]>` above. This tells us that the request was recieved and the data was returned successfully. If we instead saw something like `404` or `500`, we would know that something went wrong. For a list of possible response codes see <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses">https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses</a>.

In [25]:
r = requests.get(url)

In [26]:
r.status_code

200

In [27]:
## The HTML code is stored in r.content
r.content



In [28]:
## We can now parse this data with BeautifulSoup
soup = BeautifulSoup(r.content)

In [29]:
soup.head.title

<title>Sports – FiveThirtyEight</title>

### Web developer tools

As we can see, this is much messier than our simple example above. 

We only want three pieces of information for each article/podcast/posting listed on the page:
1. The title,
2. The author and
3. The associated url.

To hone in on this information we can utilize the web developer tools for your browser. Below are links on how to locate the web developer tools for Mozilla Firefox, Google Chrome and Safari:
- Mozilla Firefox: Go to `Browser Tools` in the `Tools` dropdown menu and click on `Web Developer Tools` you should see something like this:
<img src="firefox_develop.jpg" width="80%"></img>
- Google Chrome: Go to `Developer` in the `View` dropdown menu and click on `Developer Tools` you should see something like this:
<img src="chrome_develop.jpg" width="80%"></img>
- Safari: Go to the `Develop` dropdown menu and select `Show Web Inspector`, if there is no `Develop` dropdown menu follow the instructions on this page <a href="https://support.apple.com/guide/safari/use-the-developer-tools-in-the-develop-menu-sfri20948/mac">https://support.apple.com/guide/safari/use-the-developer-tools-in-the-develop-menu-sfri20948/mac</a>. You should see something like this once opened:
<img src="safari_develop.jpg" width="80%"></img>

<i>Note that the images above will be slightly different than what you see because 538 will have different articles at the time you access the page. These images were added 3-10-2022.</i>

<br>

The web developer tools will allow you to find out in what HTML elements certain pieces of data live. For example, you should be able to hover over an item on the webpage and it will highlight what HTML structure holds it. Here we can see what that looks like for the banner article (the large one at the top) of this page:
<img src="div_highlight.png" width="30%"></img>

So the article title and author are stored in a `div` item with the `class` `post.info`. We can use the `soup`'s `find` and `find_all` functionality to get the data we desire. 

In [30]:
## .find takes in an HTML element type
## and a dictionary that specifies additional qualifications you
## want the element to have
## and returns the first such element in the HTML code that
## meets the desired requirements.
##############
##############
## First we'll find the id=primary div, then within that the class=post-info div
soup.find('div', {'class':"post-info"})

<div class="post-info">
<p class="topic"><a class="term" href="https://fivethirtyeight.com/tag/golf/" name="&amp;lpos=fivethirtyeight&amp;lid=mem:slug:golf">Golf</a></p>
<time class="updated visually-hidden" title="2022-04-06T14:00:00+00:00">April 6, 2022 10:00 AM</time>
<h2 class="article-title entry-title">
<a href="https://fivethirtyeight.com/features/if-tiger-woods-tees-off-at-the-masters-hell-be-playing-to-win/" name="&amp;lpos=fivethirtyeight&amp;lid=sports+mem:story1">If Tiger Woods Tees Off At The Masters, He’ll Be Playing To Win</a>
</h2>
<p class="byline vcard">
	By <a class="author url fn" href="https://fivethirtyeight.com/contributors/alex-kirshner/" name="&amp;lpos=fivethirtyeight&amp;lid=mem:byline:alex kirshner" rel="author" title="">Alex Kirshner</a></p>
</div>

In [31]:
## .find_all follows the same syntax as .find,
## but returns all HTML elements that satisfy the
## requirements you provide
##############
##############
## First we'll find the id=primary div, then within that the class=post-info div
soup.find_all('div', {'class':"post-info"})

[<div class="post-info">
 <p class="topic"><a class="term" href="https://fivethirtyeight.com/tag/golf/" name="&amp;lpos=fivethirtyeight&amp;lid=mem:slug:golf">Golf</a></p>
 <time class="updated visually-hidden" title="2022-04-06T14:00:00+00:00">April 6, 2022 10:00 AM</time>
 <h2 class="article-title entry-title">
 <a href="https://fivethirtyeight.com/features/if-tiger-woods-tees-off-at-the-masters-hell-be-playing-to-win/" name="&amp;lpos=fivethirtyeight&amp;lid=sports+mem:story1">If Tiger Woods Tees Off At The Masters, He’ll Be Playing To Win</a>
 </h2>
 <p class="byline vcard">
 	By <a class="author url fn" href="https://fivethirtyeight.com/contributors/alex-kirshner/" name="&amp;lpos=fivethirtyeight&amp;lid=mem:byline:alex kirshner" rel="author" title="">Alex Kirshner</a></p>
 </div>,
 <div class="post-info">
 <p class="topic">
 <time class="datetime updated" title="2022-04-06T06:00:58-04:00">Apr. 6, 2022</time>
 </p>
 <div class="tease-meta">
 <div class="tease-meta-content">
 <h2 c

In [34]:
## let's get the title for the fist post
soup.find('div', {'class':"post-info"}).find('h2', {'class':"article-title entry-title"}).text

'\nIf Tiger Woods Tees Off At The Masters, He’ll Be Playing To Win\n'

Now we can write a loop that finds all of the titles for articles contained in `div`s with `class="post-info"` using `find_all`.

In [36]:
## .find_all follows the same syntax as .find,
## but returns all HTML elements that satisfy the
## requirements you provide
##############
##############
## First we'll find the id=primary div, then within that the class=post-info div
soup.find_all('div', {'class':"post-info"})[-1]

<div class="post-info hentry no-image">
<time class="updated visually-hidden" title="2022-04-05T14:58:03+00:00">April 5, 2022 10:58 AM</time>
<h3 class="article-title entry-title">
<a data-adl-event-name="content select interaction" data-content_id="330397" data-content_select_type="fte_features" data-content_title="AL West Preview: The Mariners And Angels Are (Maybe) Coming For The Astros’ Crown" data-placement="More stories" data-position_number="3" href="https://fivethirtyeight.com/features/al-west-preview-the-mariners-and-angels-are-maybe-coming-for-the-astros-crown/" name="&amp;lpos=fivethirtyeight&amp;lid=sports+mostpopular:story3">
<span id="">AL West Preview: The Mariners And Angels Are (Maybe) Coming For The Astros’ Crown</span> </a>
</h3>
<span class="vcard visually-hidden"><span class="fn">FiveThirtyEight</span></span>
</div>

In [37]:
## use find_all
for div in soup.find_all('div', {'class':"post-info"}):
    ## it's good practice to check that the element you're expecting
    ## to be there is actually there before 
    if div.find('h2', {'class':"article-title entry-title"}):
        ## I clean the text here to remove annoying white space
        print(div.find('h2', {'class':"article-title entry-title"}).text.replace("\n", " ").strip())
    elif div.find('h3', {'class':"article-title entry-title"}):
        print(div.find('h3', {'class':"article-title entry-title"}).text.replace("\n", " ").strip())
    else:
        print("Title not in h2 or h3 elements")
    print("")

If Tiger Woods Tees Off At The Masters, He’ll Be Playing To Win

2022 MLB Predictions

How Our MLB Forecast Is Changing For 2022

The Raptors Don’t Need Bigs To Pound The Ball Inside

AL West Preview: The Mariners And Angels Are (Maybe) Coming For The Astros’ Crown

NL West Preview: The Dodgers Are Still Trying To Outspend (And Out-Talent) Everyone Else

NFL General Managers Still Love The Senior Bowl. But Do Those Players Pan Out?

NL Central Preview: Who Will Play Spoiler To The Brewers?

What The MLB Lockout Can Tell Us About Political Fandom And Sports Partisanship

How Each Team In The Women’s Final Four Can Win It All

If Tiger Woods Tees Off At The Masters, He’ll Be Playing To Win

How Our MLB Forecast Is Changing For 2022

AL West Preview: The Mariners And Angels Are (Maybe) Coming For The Astros’ Crown



Here we can notice one of the difficulties in web scraping, messiness of HTML code. 

At the end of our titles we might notice a few repeats this is due to 538's "Top Sports Stories Today" section in the bottom right of the page. These titles are also stored in a `div` with `class="post-info"`. While this does not result in a messy webpage, it does make our task slightly more complicated.

It helps to notice that the `div`s of articles that are not the banner article are themselves contained within a `div` with `class="posts content-area"`. We can thus first `find` that `div` and then `find_all` `post-info` `div`s within.

In [38]:
## use find_all
for div in soup.find('div', {'class':"posts content-area"}).find_all('div', {'class':"post-info"}):
    ## it's good practice to check that the element you're expecting
    ## to be there is actually there before 
    if div.find('h2', {'class':"article-title entry-title"}):
        ## I clean the text here to remove annoying white space
        print(div.find('h2', {'class':"article-title entry-title"}).text.replace("\n", " ").strip())
    elif div.find('h3', {'class':"article-title entry-title"}):
        print(div.find('h3', {'class':"article-title entry-title"}).text.replace("\n", " ").strip())
    else:
        print("Title not in h2 or h3 elements")
    print("")

2022 MLB Predictions

How Our MLB Forecast Is Changing For 2022

The Raptors Don’t Need Bigs To Pound The Ball Inside

AL West Preview: The Mariners And Angels Are (Maybe) Coming For The Astros’ Crown

NL West Preview: The Dodgers Are Still Trying To Outspend (And Out-Talent) Everyone Else

NFL General Managers Still Love The Senior Bowl. But Do Those Players Pan Out?

NL Central Preview: Who Will Play Spoiler To The Brewers?

What The MLB Lockout Can Tell Us About Political Fandom And Sports Partisanship

How Each Team In The Women’s Final Four Can Win It All



In [39]:
## banner article
print(soup.find('div', {'class':"featured-banner"}).find('div', {'class':"post-info"}).find('h2', {'class':"article-title entry-title"}).text.strip())
print()

## get remaining articles in results
## use find_all
for div in soup.find('div', {'class':"posts content-area"}).find_all('div', {'class':"post-info"}):
    ## it's good practice to check that the element you're expecting
    ## to be there is actually there before 
    if div.find('h2', {'class':"article-title entry-title"}):
        ## I clean the text here to remove annoying white space
        print(div.find('h2', {'class':"article-title entry-title"}).text.replace("\n", " ").strip())
    elif div.find('h3', {'class':"article-title entry-title"}):
        print(div.find('h3', {'class':"article-title entry-title"}).text.replace("\n", " ").strip())
    else:
        print("Title not in h2 or h3 elements")
    print("")

If Tiger Woods Tees Off At The Masters, He’ll Be Playing To Win

2022 MLB Predictions

How Our MLB Forecast Is Changing For 2022

The Raptors Don’t Need Bigs To Pound The Ball Inside

AL West Preview: The Mariners And Angels Are (Maybe) Coming For The Astros’ Crown

NL West Preview: The Dodgers Are Still Trying To Outspend (And Out-Talent) Everyone Else

NFL General Managers Still Love The Senior Bowl. But Do Those Players Pan Out?

NL Central Preview: Who Will Play Spoiler To The Brewers?

What The MLB Lockout Can Tell Us About Political Fandom And Sports Partisanship

How Each Team In The Women’s Final Four Can Win It All



We now have code to get all of the article titles on the page including both the banner article and the list of articles beneath.

##### Exercise

Take some time to write code to retrieve all of the article authors and article URLs below.

In [41]:
### URLS ###

## banner article
print(soup.find('div', {'class':"featured-banner"}).find('div', {'class':"post-info"}).find('h2', {'class':"article-title entry-title"}).a['href'])
print()

## get remaining articles in results
## use find_all
for div in soup.find('div', {'class':"posts content-area"}).find_all('div', {'class':"post-info"}):
    ## it's good practice to check that the element you're expecting
    ## to be there is actually there before 
    if div.find('h2', {'class':"article-title entry-title"}).a:
        ## I clean the text here to remove annoying white space
        print(div.find('h2', {'class':"article-title entry-title"}).a['href'])
    elif div.find('h3', {'class':"article-title entry-title"}).a:
        print(div.find('h3', {'class':"article-title entry-title"}).a['href'])
    else:
        print("href not in h2 or h3 elements")
    print("")



https://fivethirtyeight.com/features/if-tiger-woods-tees-off-at-the-masters-hell-be-playing-to-win/

https://projects.fivethirtyeight.com/2022-mlb-predictions/

https://fivethirtyeight.com/features/how-our-mlb-forecast-is-changing-for-2022/

https://fivethirtyeight.com/features/the-raptors-dont-need-bigs-to-pound-the-ball-inside/

https://fivethirtyeight.com/features/al-west-preview-the-mariners-and-angels-are-maybe-coming-for-the-astros-crown/

https://fivethirtyeight.com/features/nl-west-preview-the-dodgers-are-still-trying-to-outspend-and-out-talent-everyone-else/

https://fivethirtyeight.com/features/nfl-general-managers-still-love-the-senior-bowl-but-do-those-players-pan-out/

https://fivethirtyeight.com/features/nl-central-preview-who-will-play-spoiler-to-the-brewers/

https://fivethirtyeight.com/features/what-the-mlb-lockout-can-tell-us-about-political-fandom-and-sports-partisanship/

https://fivethirtyeight.com/features/how-each-team-in-the-womens-final-four-can-win-it-all/



In [44]:
### authors ###
## banner article
print(soup.find('div', {'class':"featured-banner"}).find('div', {'class':"post-info"}).find('p', {'class':"byline vcard"}).text.strip())
print()

## get remaining articles in results
## use find_all
for div in soup.find('div', {'class':"posts content-area"}).find_all('div', {'class':"post-info"}):
    ## it's good practice to check that the element you're expecting
    ## to be there is actually there before 
    if div.find('p', {'class':"single-metadata card vcard"}):
        ## I clean the text here to remove annoying white space
        print(div.find('p', {'class':"single-metadata card vcard"}).text.strip())
    else:
        print("href not in h2 or h3 elements")
    print("")



By Alex Kirshner

By Jay Boice

By Neil Paine and Jay Boice

By Louis Zatzman

By Neil Paine

By Neil Paine

By Josh Hermsmeyer

By Neil Paine

By Nathaniel Rakich and Jean Yi

By Howard Megdal



## Common problems while web scraping

### Messy or inconsistent HTML code

We have seen one problem that you can encounter while web scraping, small and messy differences in HTML code that make automating your scraping more difficult. It is important to note that 538 is actually not very messy in the grand scheme of the world wide web. For example, you can come across websites that do not label their HTML elements with `id`s or `class`es or any other kind of distinguishing meta data. This makes automation incredibly difficult. Other websites may offer no consistency from page to page. In such cases there may not be a quick or easy fix, you typically just have to hack something together and hope it works.

### Too many requests

Repeatedly sending requests to the same website can raise a flag at the site's server after which your IP address will be blocked from receiving future request results for some period of time. This is why it is good practice to space out your requests to a single website. You can do so with the `sleep` function in the `time` module, <a href="https://docs.python.org/3/library/time.html#time.sleep">https://docs.python.org/3/library/time.html#time.sleep</a>. While this decreases your risk of being flagged as a bot/scraper, it is also just being a good denizen of the internet. Sending too many requests to a single website in a short amount of time can mess with that website's ability to function for other visitors.

### Bot detection

Some websites have been set up to detect bot/scraper activity regardless of the number of times you send a request. Sometimes there are ways around this, but the specific approach depends upon how the website is blocking your request. To counter such detection do a web search for the specific error or response code you are getting and look for a helpful stack overflow or stackexchange post.

### User interactive content

Some of the content on a page may be dependent on the actions of a user visiting that page. For example, there are websites where data tables do not load until the user has clicked a button or scrolled down the page. Again there are work arounds to this, but the answer will depend upon the specific issue you are encountering. Do a web search and there is likely to be a solution.

## Summary

In this notebook we touched on how you can parse HTML code with the `bs4` package. We looked at both a simple phony example and an example from a live website. If you are interested in learning more about `bs4` I encourage you to consult their documentation, <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">https://www.crummy.com/software/BeautifulSoup/bs4/doc/</a>.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)