# Web Scraping with `BeautifulSoup`

Sometimes you won't be able to find a neat data set for your project. However, if the data you want exists on the internet there is a chance that you can scrape the data yourself. Let's learn how to do that in python.

## What We'll Accomplish in this Notebook

In this notebook we'll do the following:
- Learn about the structure of an html page,
- Introduce the `BeautifulSoup` package,
- Parse html code with a toy example,
- Scrape data from some saved html code,
- Download then scrape an actual webpage.

Let's get going!

In [1]:
## Import base packages we'll use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from seaborn import set_style
set_style("whitegrid")

## Scraping Data with `BeautifulSoup`

### Importing `BeautifulSoup`

In order to use `BeautifulSoup` to scrape web data we first need to make sure that we have it installed on our computer. Try to run the following code:

In [2]:
## this import BeautifulSoup
from bs4 import BeautifulSoup

If the above code does not work you'll need to install the package before being able to run the code in this notebook. Here are installation instructions from the `BeautifulSoup` documentation, <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup">https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup</a>.

### The Structure of an HTML Page

`BeautifulSoup` takes in an HTML document and will 'parse' it for you so that you can extract the information you want. To best understand what that means let's look at a toy example of a webpage. To see what the snippet of html code below looks like in a web browser click here <a href="SampleHTML.html">SampleHTML.html</a>.

In [3]:
## This is an html chunk
## It has a head and a body, just like you
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""

We can now use `BeautifulSoup` to parse this simple html chunk.

In [4]:
## Now we make a BeautifulSoup object out of the html code
## The first input is the html code
## The second input is how you want BeautifulSoup
## to parse the code
soup = BeautifulSoup(html_doc,'html.parser')

In [5]:
## Let's use the prettify method to make our html pretty and see what it has to say
## Ideally this is how someone writing pure html code would write their code
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


Html files have a natural tree structure that we'll briefly cover now. Here's the tree of our sample html:

<img src = "html_tree.png" width = "50%"></img>

Each level in the tree represents a 'generation' of the html code. for instance the body has 3 p children, the leftmost p has one b child. `BeautifulSoup` helps us traverse these trees to gather the data we want.

We'll now traverse this html sapling.

In [6]:
## Below are some examples of beautifulsoup methods and 
## attributes that help us better understand the structure 
## of html code

In [7]:
## We can traverse to the "title" by working our way through
## the tree
print(soup.head.title)
print()

<title>The Dormouse's story</title>



In [8]:
## Notice we can also get the title like so
## This is because this is the first and only title 
## in the code
print(soup.title)

<title>The Dormouse's story</title>


In [9]:
## What if I just want the text from the title?
print(soup.title.text)

The Dormouse's story


In [10]:
## What html structure is the title's parent?
print(soup.title.parent)
print(soup.title.parent.name)

<head><title>The Dormouse's story</title></head>
head


In [11]:
## What is the first a of the html document?
print(soup.a)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


In [12]:
## What is the first a's class?
print(soup.a['class'])

['sister']


In [13]:
## There are multiple a's can I find all of them?
print(soup.find_all('a'))
for a in soup.find_all('a'):
    print()
    print(a['class'], a.text)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

['sister'] Elsie

['sister'] Lacie

['sister'] Tillie


#### You Code

Take a breakout session and try the following coding exercises. This may go quickly, so don't feel bad if you are unable to complete the exercise before time is up. You can always come back to finish it later!

In [14]:
## Find the first p of the document
## What is the first p's class? What string is in that p?


## Sample Solution
print(soup.p)
print(soup.p['class'])
print(soup.p.text)

<p class="title"><b>The Dormouse's story</b></p>
['title']
The Dormouse's story


In [15]:
## For all of the a's in the document find their href

## Sample Solution
for a in soup.find_all('a'):
    print(a['href'])

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


Now we've got some experience let's move on to some slightly more advanced parsing.

### Now We're of Drinking Age

I've included in this repository some html code from an <a href = "https://untappd.com/home">Untappd</a> search. We can read in that file with the following code. I went to untappd, and found the <a href = "https://www.seventhsonbrewing.com">Seventh Son</a> page, a brewery near my apartment, then clicked on their beer list and only saved the HTML code from the results. You can see the HTML file here <a href="SeventhSon.html">SeventhSon.html</a>.

In [16]:
## This will save the html file's code so we can parse it

seventh_son_beer_search = open("SeventhSon.html", "r", encoding='utf8')

## this code may be un-needed, it is there in case the above code
## causes an error
#seventh_son_beer_search = open("SeventhSon.html", 'r')

## Let's look at what we just did
seventh_son_beer_search

<_io.TextIOWrapper name='SeventhSon.html' mode='r' encoding='utf8'>

In [17]:
## You write code here to make a soup object of our code
## We'll do this together, not in a breakout session
# Sample Answer
soup = BeautifulSoup(seventh_son_beer_search,"html")

In [18]:
## Look at the code using prettify
## Again we'll do this together, not in a breakout session
## Sample Answer
print(soup.prettify())

<html>
 <body>
  <div class="beer-container">
   <div class="beer-item" data-bid="382779">
    <a class="label" href="/b/seventh-son-brewing-company-humulus-nimbus/382779">
     <img src="https://untappd.akamaized.net/site/beer_logos/beer-382779_814af_sm.jpeg"/>
    </a>
    <div class="beer-details">
     <p class="name">
      <a href="/b/seventh-son-brewing-company-humulus-nimbus/382779">
       Humulus Nimbus
      </a>
     </p>
     <p class="style">
      Pale Ale - American
     </p>
     <p class="desc desc-half-382779">
      A pale golden ale that is both super crisp and super hop forward with a refreshing mouthfeel and a summer friendly 6% abv. Mosaic &amp; simcoe hops lend tart…
      <a class="read-more-beerlist track-click" data-bid="382779" data-href=":readmorebeer" data-track="brewerylist" href="#">
       Read More
      </a>
     </p>
     <p class="desc desc-full-382779" style="display: none;">
      A pale golden ale that is both super crisp and super hop forward w

As we can see from the `prettify()` output this HTML code is more complicated than our toy example from above, but `BeautifulSoup` is able to handle it all the same. Let's write some code to go through the HTML and grab the beer names and then store those names in a list.

To do that let's learn a little more about BeautifulSoup's functionality. Looking at the prettify output we see that each beer is contained in a `"beer-item"`. We can use that class information to our advantage.

In [19]:
## We can find all 'div's with the 'class' = "beer-item"
## by using a dictionary argument to find_all
cooler = soup.find_all('div',{'class':"beer-item"})

In [20]:
cooler[0]

<div class="beer-item" data-bid="382779">
<a class="label" href="/b/seventh-son-brewing-company-humulus-nimbus/382779">
<img src="https://untappd.akamaized.net/site/beer_logos/beer-382779_814af_sm.jpeg"/>
</a><div class="beer-details">
<p class="name"><a href="/b/seventh-son-brewing-company-humulus-nimbus/382779">Humulus Nimbus </a></p>
<p class="style">Pale Ale - American</p>
<p class="desc desc-half-382779">A pale golden ale that is both super crisp and super hop forward with a refreshing mouthfeel and a summer friendly 6% abv. Mosaic &amp; simcoe hops lend tart… <a class="read-more-beerlist track-click" data-bid="382779" data-href=":readmorebeer" data-track="brewerylist" href="#">Read More</a> </p>
<p class="desc desc-full-382779" style="display: none;">A pale golden ale that is both super crisp and super hop forward with a refreshing mouthfeel and a summer friendly 6% abv. Mosaic &amp; simcoe hops lend tart blueberry and fragrant pine to a pleasingly bitter dandelion finish. We wan

In [21]:
len(cooler)

113

In [22]:
## Now we can use what we just learned to extract the name from the first beer
## The name is contained in a p element with class "name"
print(cooler[0].find("p",{'class':"name"}).text)

Humulus Nimbus 


In [23]:
## Using a list comprehension we can make a list that contains
## all of the beer names
beer_names = [beer.find("p",{'class':"name"}).text for beer in cooler]

In [24]:
for beer in beer_names[:5]:
    print(beer)

Humulus Nimbus 
Proliferous
The Scientist
Seventh Son American Strong Ale
Stone Fort Oat Brown


#### You Code

You've been hired by a competitor to SeventhSon. They want a dataframe of all of SeventhSon's beers that includes their name, beer type, abv, and ibu. Use `BeautifulSoup` to give them this.

Again these sessions can go quickly, so it's okay if you're unable to complete it right now. Just try your best, and you can always finish it later!

In [25]:
## Your Code here

## Sample Answer
beers = [beer.find("p",{'class':"name"}).text for beer in cooler]
types = [beer.find("p",{'class':"style"}).text for beer in cooler]
ibus = [beer.find("p",{'class':"ibu"}).text.replace("\n","").replace(" IBU","") for beer in cooler]
abvs = [beer.find("p",{'class':"abv"}).text.replace("\n","").replace(" ABV","") for beer in cooler]




beer_df = pd.DataFrame({'name':beers,'type':types,'ibu':ibus,'abv':abvs})

beer_df.head(10)

Unnamed: 0,name,type,ibu,abv
0,Humulus Nimbus,Pale Ale - American,53,6%
1,Proliferous,IPA - Imperial / Double,85,8.3%
2,The Scientist,IPA - American,75,7%
3,Seventh Son American Strong Ale,Strong Ale - American,40,7.7%
4,Stone Fort Oat Brown,Brown Ale - English,21,5.25%
5,Oubliette,Stout - American Imperial / Double,99,12%
6,Syzygy,IPA - Imperial / Double,100,10.9%
7,Assistant Manager,Golden Ale,36,4.5%
8,Goo Goo Muck,Sour - Farmhouse IPA,50,7.6%
9,Golden Ratio,IPA - American,68,7%


In [26]:
## You Code
## Use pandas to find how many of each type of beer
## exist in the data set

## Sample Answer
beer_df.type.value_counts()

IPA - American                          17
Saison / Farmhouse Ale                  11
IPA - Imperial / Double                  9
Pale Ale - American                      7
IPA - New England                        6
IPA - Brut                               4
Smoked Beer                              4
Stout - American Imperial / Double       4
Sour - Farmhouse IPA                     3
Pale Ale - New England                   3
IPA - Black / Cascadian Dark Ale         3
Stout - Imperial / Double                3
Lager - IPL (India Pale Lager)           2
Lager - Euro                             2
Bière de Garde                           2
Stout - American                         2
Barleywine - American                    2
Blonde Ale - Belgian Blonde / Golden     2
Brown Ale - English                      1
Belgian Tripel                           1
Dark Ale                                 1
Witbier                                  1
Barleywine - English                     1
Pale Ale - 

Great! We're finally getting familiar enough with BeautifulSoup to move on to an actual website.

### Surfing the web

Here's our hypothetical project. You're hired by someone that wants to start a FiveThirtyEight like website, but hates writing. Their goal is to create a natural language bot that uses an Natural Language Processing algorithm to generate new 538 like articles using previous 538 articles. They're too busy working on the algorithm so they've outsourced the job of scraping the article content to us. 

Their desired output is a compilation of 538's articles. The data they need is each article's title, author, and text.

Let's go through how to get the title, author, and text for one specific article.

Let's try this article, <a href="https://fivethirtyeight.com/features/the-mavericks-bet-big-on-kristaps-porzingis-its-paying-off-on-offense-but-what-about-defense/">The Mavericks Bet Big On Kristaps Porziņģis. It’s Paying Off On Offense, But What About Defense?</a>.

In [27]:
## This will let us to get the html code from 538
from urllib.request import urlopen

In [28]:
## First we'll tell python to grab the html code for us
html = urlopen("https://fivethirtyeight.com/features/" +
               "the-mavericks-bet-big-on-kristaps-porzingis" +
               "-its-paying-off-on-offense-but-what-about-defense/")

In [29]:
## Then we make it into a soup object
soup = BeautifulSoup(html,"html.parser")

In [30]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <script src="https://dcf.espn.com/TWDC-DTCI/prod/Bootstrap.js">
  </script>
  <title>
   The Mavericks Bet Big On Kristaps Porziņģis. It’s Paying Off On Offense, But What About Defense? | FiveThirtyEight
  </title>
  <!-- Jetpack Site Verification Tags -->
  <link as="font" crossorigin="anonymous" href="https://fivethirtyeight.com/wp-content/themes/espn-fivethirtyeight/assets/fonts/AtlasGrotesk-Bold-Web.woff2" rel="preload" type="font/woff2"/>
  <link as="font" crossorigin="anonymous" href="https://fivethirtyeight.com/wp-content/themes/espn-fivethirtyeight/assets/fonts/AtlasGrotesk-Regular-Web.woff2" rel="preload" type="font/woff2"/>
  <link as="font" crossorigin="anonymous" href="https://fivethirtyeight.com/wp-content/themes/espn-fivethirtyeight/assets/fonts/decimamonopro-webfont.woff2" rel="preload" type="font/woff2"/>
  <link

### Web Developer Tools

We can see that going through the code might be annoying. Let's learn about the web developer tools of your web browser.

#### Finding the Author

In particular let's see how to find the author field!

In [31]:
## Our search was seeded by our look at the Web Developer Tools
soup.find('p',{'class':"single-metadata single-byline vcard"})

<p class="single-metadata single-byline vcard">By <a class="author url fn" href="https://fivethirtyeight.com/contributors/jared-dubin/" rel="author" title="">Jared Dubin</a></p>

In [32]:
## we can clean this up a little bit
soup.find('p',{'class':"single-metadata single-byline vcard"}).text.replace("By ", "")

'Jared Dubin'

#### You Code

In our final break out practice session of this notebook you'll work on getting the remaining desired data, the title and the text of the article.

In [33]:
## You code
## Find the title here

## Sample Solution
soup.find('h1',{'class':"article-title article-title-single entry-title"}).text.replace("\n","").replace("\t","")

'The Mavericks Bet Big On Kristaps Porziņģis. It’s Paying Off On Offense, But What About Defense?'

In [34]:
## You code
## Find the article text here

## Sample Solution
[p.text for p in soup.find('div',{'class',"entry-content single-post-content"}).find_all("p")]


['Midway through Luka Dončić’s rookie season, the Dallas Mavericks seized on an opportunity that doesn’t come around all that often. Then-New York Knicks star Kristaps Porziņģis, still recovering from an ACL tear at the time, informed Knicks management that he wanted out of New York. Mere hours after news of the fateful meeting between Porziņģis and the front office broke, the Knicks and Mavericks reached a deal to send him to Dallas.',
 'Based on both the package the Mavericks sent to New York and the post-trade comments of head coach Rick Carlisle, it was clear pretty immediately that the team was thinking big. “We obviously think Porziņģis is a great young talent, similar in many ways to Dirk [Nowitzki],” Carlisle told ESPN Radio. “This is kind of a Dirk-and-[Steve] Nash type of situation, only these guys are taller.”',
 'Porziņģis sat out the remainder of that season as he finished rehabbing his injured knee, but the Mavericks nonetheless committed to him in full that summer. Dalla

## Going Beyond BeautifulSoup, APIs

As we've seen above `BeautifulSoup` can be quite powerful, but also a hassle to code. Sometimes there are websites or apps that are so popular that people have taken the time to write API python wrappers. 

In layman's terms an API is a way for you to "talk" to the website and tell it what data you want to get back. For a better explanation watch this short YouTube video on your own time, <a href="https://www.youtube.com/watch?v=s7wmiS2mSXY">https://www.youtube.com/watch?v=s7wmiS2mSXY</a>. A python wrapper for an API is just a python package that was written so you can communicate with the API in python as opposed to some other programming language.

Some popular python wrappers include:
- <a href="https://www.tweepy.org/">Tweepy</a>
- <a href="https://spotipy.readthedocs.io/en/2.15.0/">Spotipy</a>
- <a href="https://praw.readthedocs.io/en/latest/">praw</a>
- <a href="https://github.com/realpython/list-of-python-api-wrappers">And more!</a>

Many APIs require you to have developer credentials (for example the Twitter API), so if you have a project idea that may utilize an API you'll want to apply for the credentials ASAP. Some can take a long time to grant you access.

## That's It!

That's it for this notebook. For additional practice check out the corresponding homework notebook that has problems practicing what we learned in this notebook and problems expanding upon this notebook for more advanced scraping.

See you in the next notebook where we briefly discuss how to use python to access a database.

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2021.

Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)