<br>
<h1><TT>It's Officially Legal so Let's Scrape the Web</TT></h1>
<br>

Kimberly Fessel  
- Twitter - @kimberlyfessel 
- LinkedIn - kimberlyfessel

<br>
<h2> <TT> Scraping Basics </TT> </h2>

<br>

---



# Introduction to Google Colab and `BeautifulSoup`


### Google Colab

- Executes Python code on the fly
- Interactivity allows for instant feedback
- Memory persists across cells
- `shift+enter` 
- Use [markdown](https://blog.ghost.org/markdown/) (TEXT) mode for adding text like this

### BeautifulSoup

- open-source Python library
- extract data from HTML files
- understands HTML structure by working with a [parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) (`lxml`, `html5lib`, etc.) 
- [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for reference
<br> <br>

`BeautifulSoup` does not actually gather information from the web.  We will use the `requests` library for that.

# Learn to Scrape with Simple Inline HTML

Let's start with this simple HTML page given below as a string: 

In [1]:
simple_html = """
<html>

<head>
  <style>
    li {font-size: 18px;}
  </style>
</head>

<body>
  <div style="border-style: dotted; padding: 10px">
    <h1>Today's Learning Objectives</h1>
    <ul>
      <li>Decipher basic HTML</li>
      <li>Retrieve information from Internet</li>
      <li>Parse web data</li>
      <li>Gather and prepare data systematically</li>
    </ul>
    <br>
  </div>
</body>

</html>
"""

**EXERCISE: Quick HTML Review**
> What tags do we see on this page?  
> What attributes?  
> What's the inner HTML text of the header?

Now we will tell Python to render this string as HTML.

In [2]:
from IPython.core.display import display, HTML
display(HTML(simple_html)) 

This simple "page" contains a list of learning objectives for today's workshop. Now we will see how `BeautifulSoup` can extract information from this HTML.

First we need to import `BeautifulSoup` and parse the HTML string.

In [3]:
from bs4 import BeautifulSoup as bs

In [4]:
soup = bs(simple_html)

In [82]:
soup


<html>
<head>
<style>
    li {font-size: 18px;}
  </style>
</head>
<body>
<div style="border-style: dotted; padding: 10px">
<h1>Today's Learning Objectives</h1>
<ul>
<li>Decipher basic HTML</li>
<li>Retrieve information from Internet</li>
<li>Parse web data</li>
<li>Gather and prepare data systematically</li>
</ul>
<br/>
</div>
</body>
</html>

When we print out `soup`, it looks like `BeautifulSoup` hasn't done anything!  But no worries -- it has indeed parsed our code and `BeautifulSoup` now knows how to navigate through the HTML DOM.

In [7]:
type(soup)

bs4.BeautifulSoup

### Find by tag

We begin by using the `find()` method to extract the header of our HTML.

In [8]:
soup.find('h1')

<h1>Today's Learning Objectives</h1>

In [9]:
type(soup.find('h1'))

bs4.element.Tag

`find()` returns a tagged element, but we can grab just the inner HTML text instead.

In [10]:
soup.find('h1').text

"Today's Learning Objectives"

In [11]:
type(soup.find('h1').text)

str

We now have a way to extract text from a webpage -- powerful stuff!  

What do you think will be returned if we look for list tags (`li`)?

In [12]:
c

<li>Decipher basic HTML</li>

**Warning**: `BeautifulSoup` returns ONLY the FIRST matching element when we use `find()`.

### Find all

If we would like `BeautifulSoup` to return ALL matching elements, we can use `find_all()` instead.

In [13]:
soup.find_all('li')

[<li>Decipher basic HTML</li>,
 <li>Retrieve information from Internet</li>,
 <li>Parse web data</li>,
 <li>Gather and prepare data systematically</li>]

In [24]:
type(soup.find_all('li'))

bs4.element.ResultSet

Using `find_all()` yields a result set containing all of list elements on the "page."  You can basically think of a result set as actinly like a list. 

**Warning**: `BeautifulSoup` does not allow you to apply `.text` to a result set.  The following code **will fail**.

In [14]:
soup.find_all('li').text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Instead, you must apply `.text` to each item in the result set individually.

In [15]:
for item in soup.find_all('li'):
  print(item.text)

Decipher basic HTML
Retrieve information from Internet
Parse web data
Gather and prepare data systematically


In [16]:
learning_objectives = [item.text for item in soup.find_all('li')]

learning_objectives

['Decipher basic HTML',
 'Retrieve information from Internet',
 'Parse web data',
 'Gather and prepare data systematically']

**Tip**: The two **most common mistakes** I see in web scraping with `BeautifulSoup` are:
- Using `find()` when you really want `find_all()`
- Attempting to apply `.text` to a result set like the output of `find_all()`

### Exercises

For the exercises that follow, please use this HTML code describing today's agenda and tools:

In [18]:
workshop_html = """
<html>

<body>
  <h1>Today's Workshop</h1>
  <div id='agenda' style="background-color: aliceblue">
    <h2>Agenda</h2>
    <p>Today's workshop is comprised of three main sections:</p>
    <ol>
      <li>HTML Basics</li>
      <li>Scraping Basics</li>
      <li>Scraping Pipeline</li>
    </ol>
  </div>
  
  <div id='tools' style='background-color: honeydew'>
    <h2>Tools</h2>
    <p>You will be learning about two primary Python libraries:</p>  
    <ol>
      <li>BeautifulSoup</li>
      <li>requests</li>
    </ol>
  </div>
</body>

</html>
"""

In [19]:
from IPython.core.display import display, HTML
display(HTML(workshop_html)) 

**Exercise 1 - Finding the header**  _(Solutions to all exercises provide at bottom of notebook.)_
> Parse `workshop_html` with `BeautifulSoup`.  Find the main header text (`h1`) and save it in a variable.  Verify that you have the text by checking the `type` of your variable.

In [20]:
workshop_soup = bs(workshop_html)

In [78]:
workshop_header = workshop_soup.find('h1').text

In [81]:
type(workshop_header)

str

**Exercise 2 - Finding the paragraphs**

Now find all the paragraphs in `workshop_html` and print out the text that you find.

In [26]:
workshop_paragraphs = workshop_soup.find_all('p')

In [27]:
for p in workshop_paragraphs:
    print(p.text)

Today's workshop is comprised of three main sections:
You will be learning about two primary Python libraries:


**BONUS: Exercise 3 - Finding the agenda items**

Create a list of all of the agenda items for today's workshop.  Be sure to store only the TEXT for the AGENDA items!

In [75]:
workshop_agenda_items = workshop_soup.find(id='agenda').findChildren('li')

In [76]:
workshop_agenda_items

[<li>HTML Basics</li>, <li>Scraping Basics</li>, <li>Scraping Pipeline</li>]

In [77]:
type(workshop_agenda_items)

bs4.element.ResultSet

In [80]:
agenda_items = [li.text for li in workshop_soup.find_all('li')[:3]]
agenda_items

['HTML Basics', 'Scraping Basics', 'Scraping Pipeline']

# Scrape Test Webpage

In the last exercise, we found out that oftentimes using only the HTML tags alone won't be granular enough.  

Let's work with a more complicated HTML file to see what other options are available.

First download this file to your computer.

In [86]:
!wget https://raw.github.com/kimfetti/Conferences/master/PyCon_2020/pycon_info.html

--2021-04-03 06:44:01--  https://raw.github.com/kimfetti/Conferences/master/PyCon_2020/pycon_info.html
Resolving raw.github.com (raw.github.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.github.com (raw.github.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://raw.githubusercontent.com/kimfetti/Conferences/master/PyCon_2020/pycon_info.html [following]
--2021-04-03 06:44:01--  https://raw.githubusercontent.com/kimfetti/Conferences/master/PyCon_2020/pycon_info.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2953 (2.9K) [text/plain]
Saving to: ‘pycon_info.html.1’


2021-04-03 06:44:02 (199 KB/s) - ‘pycon_info.html.1’ saved [2953/2953]



Double click on this file to view it in your browser.  Once you have gotten a feel for the structure, read the file in and save as a string. 

In [83]:
pycon_html = open('pycon_info.html').read()

In [84]:
print(pycon_html)

<html>
    <head>
        <title>PyCon 2020 Info</title>

        <style>
            body {
                background-color: cornsilk;
            }

            h1 {
                font-size: 40px;
                font-family: courier new, arial;
                text-align: center;
                margin-top: 50px;
            }

            a {
                color: #411B2D;
                font-size: 20px;
            }

            p {
                font-size: 20px;
            }

            a:hover{
                color: white;
                background-color: #411B2D;
            }

            #toolbar {
                background-color: #F3B643;
                font-family: courier new, arial;
                font-weight: bold;
                font-size: 16px;
                display: flex;
                justify-content: space-around;
                flex-direction: row;
                border: 1px solid black;
                border-radius: 1px;
                marg

Since our HTML is a string, we can parse it with `BeautifulSoup` and begin collecting data.  

Let's say we are interested in gathering titles and links of events happening today.  Links can be found by looking for anchor, `a`, tags.  

In [87]:
soup = bs(pycon_html)

In [88]:
soup.find_all('a')

[<a href="https://us.pycon.org/2020/about/">WHAT IS PYCON?</a>,
 <a href="https://us.pycon.org/2020/schedule/tutorials/">TUTORIAL SCHEDULE</a>,
 <a href="https://us.pycon.org/2020/speaking/">SPEAKING AT PYCON</a>,
 <a href="https://us.pycon.org/2020/psf/">PYTHON SOFTWARE FOUNDATION</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/55/">Scalable Computing with Dask</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/63/">Creating a Great Python Package</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/45/">Minimum Viable Documentation</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/74/">Effective Data Visu

Whoa -- there are a lot more links on this page other than today's events!

###Find by attribute

In order to drill down to just the links we are interested in, notice that today's events are contained within a `div` that has `id=today`.  We can first isolate this `div` by searching for it by its `id`.

In [89]:
today_div = soup.find(id='today')

today_div

<div class="events" id="today">
<h2>A Selection of Today's Events</h2>
<p> Room 309, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a></p>
<p> Room 315, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a></p>
<p> Room 317, 1:20 pm - <a href="https://us.pycon.org/2020/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a></p>
<p> Room 318, 1:20 pm -<a href="https://us.pycon.org/2020/schedule/presentation/55/">Scalable Computing with Dask</a></p>
</div>

In [90]:
type(today_div)

bs4.element.Tag

Now we will look for all of the anchor tags that are contained within this division.

In [91]:
today_div.find_all('a')

[<a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/55/">Scalable Computing with Dask</a>]

**Tip**:  You can find elements by pretty much any attribute.  Let's find elements with that are members of the `events` class.

In [92]:
soup.find_all(class_ = 'events')

[<div class="events" id="today">
 <h2>A Selection of Today's Events</h2>
 <p> Room 309, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a></p>
 <p> Room 315, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a></p>
 <p> Room 317, 1:20 pm - <a href="https://us.pycon.org/2020/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a></p>
 <p> Room 318, 1:20 pm -<a href="https://us.pycon.org/2020/schedule/presentation/55/">Scalable Computing with Dask</a></p>
 </div>,
 <div class="events" id="tomorrow">
 <h2>Coming Up Tomorrow</h2>
 <p> Room 316, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/63/">Creating a Great Python Package</a></p>
 <p> Room 319, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/45/">Minimum Viable Documentation</a></p>
 <p> Room 309, 1:20 pm - <a href="https://us.pycon.org/2020/sc

Passing a dictionary of attributes works as well.

In [93]:
soup.find_all(attrs={'class':'events', 'id': 'tomorrow'}) 

[<div class="events" id="tomorrow">
 <h2>Coming Up Tomorrow</h2>
 <p> Room 316, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/63/">Creating a Great Python Package</a></p>
 <p> Room 319, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/45/">Minimum Viable Documentation</a></p>
 <p> Room 309, 1:20 pm - <a href="https://us.pycon.org/2020/schedule/presentation/74/">Effective Data Visualization</a>
 </p></div>]

### Retrieve attributes

If we want to just get the names of today's events, we can simply cycle through today's links and collect the `.text`.

In [94]:
today_text = [link.text for link in today_div.find_all('a')]

today_text

['Foundations of Numerical Computing in Python',
 "It's Officially Legal so Let's Scrape the Web",
 "A Beginner's Guide to Befriending Python",
 'Scalable Computing with Dask']

But what would we do if we wanted the **hyperlinks** to each of those events?

`BeautifulSoup` allows you to retrieve element attributes.  You will reference these using the same syntax as dictionary key.

In [95]:
today_div.find('a')

<a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a>

In [96]:
today_div.find('a')['href']

'https://us.pycon.org/2020/schedule/presentation/50/'

In [97]:
type(today_div.find('a')['href'])

str

In [98]:
today_links = [link['href'] for link in today_div.find_all('a')]

today_links

['https://us.pycon.org/2020/schedule/presentation/50/',
 'https://us.pycon.org/2020/schedule/presentation/72/',
 'https://us.pycon.org/2020/schedule/presentation/54/',
 'https://us.pycon.org/2020/schedule/presentation/55/']

### Exercises

**Exercise 4 - Tomorrow's event tuples** 
> Create a list of tuples for each of tomorrow's events.  The first element in your tuples will be the event title and the second will be the event link.

In [105]:
tommorow_tuples = [(link.text, link['href']) for link in soup.find(id='tomorrow').find_all('a')]

In [106]:
tommorow_tuples

[('Creating a Great Python Package',
  'https://us.pycon.org/2020/schedule/presentation/63/'),
 ('Minimum Viable Documentation',
  'https://us.pycon.org/2020/schedule/presentation/45/'),
 ('Effective Data Visualization',
  'https://us.pycon.org/2020/schedule/presentation/74/')]

**Exercise 5 - Finding the event headers** 
> Using `pycon_html` find the header text for today's and tomorrow's events by referencing the `events` class.

In [109]:
event_days = soup.find_all(class_ = 'events')
print(event_days)

[<div class="events" id="today">
<h2>A Selection of Today's Events</h2>
<p> Room 309, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a></p>
<p> Room 315, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a></p>
<p> Room 317, 1:20 pm - <a href="https://us.pycon.org/2020/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a></p>
<p> Room 318, 1:20 pm -<a href="https://us.pycon.org/2020/schedule/presentation/55/">Scalable Computing with Dask</a></p>
</div>, <div class="events" id="tomorrow">
<h2>Coming Up Tomorrow</h2>
<p> Room 316, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/63/">Creating a Great Python Package</a></p>
<p> Room 319, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/45/">Minimum Viable Documentation</a></p>
<p> Room 309, 1:20 pm - <a href="https://us.pycon.org/2020/schedule/pres

In [117]:
headers = [day.find('h2').text for day in event_days]

In [118]:
headers

["A Selection of Today's Events", 'Coming Up Tomorrow']

---

# Solutions

  **Exercise 0: Quick HTML Review**
> What tags do we see on this page? <br>
`div`, `h1`, `ul` (unordered list), `li` (list item)

> What attributes? <br>
`style` for the `div` container

> What's the inner HTML text of the header? <br>
"Today's Learning Objectives"

 **Exercise 1 - Finding the header**

> Parse `workshop_html` with `BeautifulSoup`.  Find the main header text (`h1`) and save it in a variable.  Verify that you have the text by checking the `type` of your variable.

In [None]:
soup = bs(workshop_html)

In [None]:
header = soup.find('h1').text

print(header)

In [None]:
type(header)

 **Exercise 2 - Finding the paragraphs**

Now find all the paragraphs in `workshop_html` and print out the text that you find.

In [None]:
soup.find_all('p')

In [None]:
for paragraph in soup.find_all('p'):
  print(paragraph.text)

 **BONUS: Exercise 3 - Finding the agenda items**

Create a list of all of the agenda items for today's workshop.  Be sure to store only the TEXT for the AGENDA items!

In [None]:
agenda_items = [li.text for li in soup.find_all('li')[:3]]

print(agenda_items)

In [None]:
#Later we will learn a better way: 
#  First look for the div that contains the agenda items

agenda_div = soup.find('div', id='agenda')

agenda_items = [li.text for li in agenda_div.find_all('li')]

print(agenda_items)

 **Exercise 4 - Tomorrow's event tuples** 
> Create a list of tuples for each of tomorrow's events.  The first element in your tuples will be the event title and the second will be the event link.

In [None]:
tomorrow_tuples = [(a.text, a['href']) for a in soup.find(id='tomorrow').find_all('a')]

tomorrow_tuples

 **Exercise 5 - Finding the event headers** 
> Using `pycon_html` find the header text for today's and tomorrow's events by referencing the `events` class.

In [None]:
event_headers = [div.find('h2') for div in soup.find_all(class_='events')]

In [None]:
event_header_text = [header.text for header in event_headers]

event_header_text