# Web Scraping with Beautiful Soup

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Understand when and when not to resort to web scraping.
* Become confident in using BeautifulSoup as a tool for web scraping.
* Understand the difference between tags, attributes, and attribute values.
* Use BeautifulSoup on a real-world website.
</div>


### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [To Scape Or Not To Scrape](#when)
2. [Installation](#install)
3. [BeautifulSoup: A Quick Example](#ex)
4. [Our Data](#data)
5. [Extracting and Parsing HTML](#extract)
6. [Scraping the Illinois General Assembly](#scrape)

<a id='when'></a>

# To Scrape Or Not To Scrape

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

<a id='install'></a>

# Installation

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:

In [None]:
%pip install requests

In [None]:
%pip install beautifulsoup4

We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:

In [None]:
%pip install lxml

In [None]:
# Import required libraries
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time

<a id='ex'></a>

# BeautifulSoup: A Quick Example

Let's consider a simple HTML structure:

In [None]:
html_content = """<html>
    <head>
        <title>Sample Page</title>
        <meta name="description" content="This is a sample page for BeautifulSoup explanation.">
    </head>
    <body>
        <div class="container" id="main-container">
            <h1 class="header">Welcome to the Sample Page</h1>
            <p class="text" style="color: blue;">First paragraph.</p>
            <p class="text" data-info="example">Second paragraph.</p>
            <a href="https://www.example.com" class="link">Visit Example</a>
            <div class="nested">
                <p class="text">Nested paragraph.</p>
                <img src="sample.jpg" alt="Sample Image" class="image">
            </div>
        </div>
    </body>
</html>"""


We can call `BeautifulSoup` on this `html_content`. This will return an object (called a **soup object**) which contains all of the HTML in the original document.

In [None]:
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
# Let's have a look
print(soup.prettify())

💡 **Tip:** `.prettify()` is a really useful method that retains the indentation of the original HTML. This makes it a lot more readable!

The output looks pretty similar to the original, but now it's organized in a `soup` object that allows us to more easily traverse the HTML.

## `find_all`

Let's search through this HTML using `BeautifulSoup`. We will search for ALL `p` tags in the HTML:

In [None]:
paragraphs = soup.find_all('p')
for para in paragraphs:
    print(para)

There are a lot of methods we can use to get more specific data (such as the text content itself), but this is the basic functionality of `BeautifulSoup`. Let's now look at a real-world example.

## 🥊 Challenge 1: Find h1

We can also use `find()` to find the first available tag in this HTML. Use it to find the `h1` tag in the soup!


In [None]:
# YOUR CODE HERE
soup.find_all('h1')


<a id='data'></a>
# Our Data

Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate).

**Let's open this website to take a look at its structure!**

Here's what happens if you click "Inspect" in your browser:

<img src="../img/inspect.png" alt="inspect in browser" width="700"/>

On the right-hand side, you see the HTML that makes up the website. To the right of that is the CSS linked to those elements.

Right-clicking on any part on the webpage and Inspecting it will automatically shpow you the part of the HTML that you are highlighting.

💡 **Tip**: If you want to see the full HTML code, you can right-click on the webpage and select "View Page Source".


<a id='extract'></a>

# Extracting and Parsing HTML 

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with `BeautifulSoup`
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output.

In [None]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# Read the content of the server’s response
src = req.text
# View some output
print(src[:1000])

## Step 2: Parse the Page with `BeautifulSoup`

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns a **soup object** which contains all of the HTML in the original document.

⚠️ **Warning**: If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

In [None]:
# Parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml')
# Take a look
print(soup.prettify()[:1000])

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML Tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**, like we did before. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

🔔 **Question**: What does the example below do?

In [None]:
# Find all elements with a certain tag
a_tags = soup.find_all("a")
print(a_tags[:10])

How many links did we obtain?

In [None]:
print(len(a_tags))

That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? 

We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="sidemenu"`. That means we'll only get the `a`-tags that also have a `class` attribute called `sidemenu`.

In [None]:
# Get only the 'a' tags in 'sidemenu' class
side_menus = soup("a", class_="sidemenu")
side_menus[:5]

## `find_all` and `select`

Another way to search for elements on a website is via a **CSS selector**. This method is particularly useful when you're familiar with CSS and want to leverage that knowledge to navigate and search through the document.

For this we can use a  method called `select()`. You can pass a string into `.select()` to get all elements with that string as a valid CSS selector.

For instance, we can use `"a.sidemenu"` as a CSS selector, which returns all `a` tags with class `sidemenu`--just like we did above!

In [None]:
# Get elements with "a.sidemenu" CSS Selector.
selected = soup.select("a.sidemenu")
selected[:5]

## 🥊 Challenge 2: Find All

Use BeautifulSoup to find all the `a` elements with class `mainmenu`.

In [None]:
# YOUR CODE HERE


## Step 4: Get Text or Attribute Values

Once we identify elements, we want the access information in that element. Usually, we will be interested in webpage text, or attribute values.

To do this, we first get a tag object. For instance, let's grab that `a` tag with the `sidemenu` attribute: 

In [None]:
# Get all sidemenu links as a list
side_menu_links = soup.select("a.sidemenu")

# Examine the first link
first_link = side_menu_links[0]
print(first_link)

What we just printed is a beautifulSoup object. It's a little piece of HTML. To recap:
* `<a>` is the element or tag.
* `class` is an attribute.
*  `"sidemenu"` is the value of the `class` attribute.
* `href` is another attribute.
* `"/senate/default.asp"` is the value of the href attribute.
* "Members" is the text content of the <a> element.


To get the text of a BeautifulSoup object, we can call a Python attribute called `text`.

In [None]:
print(first_link.text)

## Getting URLs

Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:

In [None]:
print(first_link['href'])

## 🥊 Challenge 3: Extract specific attributes

Extract all `href` attributes for each `mainmenu` URL.

In [None]:
# YOUR CODE HERE


<a id='scrape'></a>

# Scraping the Illinois General Assembly

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Specifically, our goal is to scrape information on each senator, including their name, district, and party.

## Scrape and Soup the Webpage

Let's scrape and parse the webpage, using the tools we learned in the previous section.

In [None]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")

## Search for the Table Elements

Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements.

In [None]:
# Get all table row elements
rows = soup.find_all("tr")
len(rows)

⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:

In [None]:
# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

for row in rows[:5]:
    print(row, '\n')

It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there.

In [None]:
example_row = rows[2]
print(example_row.prettify())

Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.
* We could use the the class name `.detail`.
* We could combine both and use the selector `td.detail`.

In [None]:
for cell in example_row.select('td'):
    print(cell)
print()

for cell in example_row.select('.detail'):
    print(cell)
print()

for cell in example_row.select('td.detail'):
    print(cell)
print()

We can confirm that these are all the same.

In [None]:
assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')

Let's use the selector `td.detail` to be as specific as possible.

In [None]:
# Select only those 'td' tags with class 'detail' 
detail_cells = example_row.select('td.detail')
detail_cells

Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:

In [None]:
# Keep only the text in each of those cells
row_data = [cell.text for cell in detail_cells]

print(row_data)

Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [None]:
print(row_data[0]) # Name
print(row_data[3]) # District
print(row_data[4]) # Party

## Getting Rid of Junk Rows

We saw at the beginning that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:

In [None]:
print('Row 0:\n', rows[0], '\n')
print('Row 1:\n', rows[1], '\n')
print('Last Row:\n', rows[-1])

When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.

As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this.

In [None]:
# Bad rows
print(len(rows[0]))
print(len(rows[1]))

# Good rows
print(len(rows[2]))
print(len(rows[3]))

Perhaps good rows have a length of 5. Let's check:

In [None]:
good_rows = [row for row in rows if len(row) == 5]

# Let's check some rows
print(good_rows[0], '\n')
print(good_rows[-2], '\n')
print(good_rows[-1])

We found a footer row in our list that we'd like to avoid. Let's try something else:

In [None]:
rows[2].select('td.detail') 

In [None]:
# Bad row
print(rows[-1].select('td.detail'), '\n')

# Good row
print(rows[5].select('td.detail'), '\n')

# How about this?
good_rows = [row for row in rows if row.select('td.detail')]

print("Checking rows...\n")
print(good_rows[0], '\n')
print(good_rows[-1])

Looks like we found something that worked!

## Loop it All Together

Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop.

In [None]:
# Define storage list
members = []

# Get rid of junk rows
valid_rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in valid_rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # Store in a tuple
    senator = (name, district, party)
    # Append to list
    members.append(senator)

In [None]:
# Should be 61
len(members)

Let's take a look at what we have in `members`.

In [None]:
print(members[:5])

<div class="alert alert-success">

## ❗ Key Points

* `BeautifulSoup` creates so-called soup objects from HTML that you can search through.
* The `find_all()` method searches through a soup object for a specified tag and attributes, e.g. `find_all('a', class_='sidemenu')`.
* The `select()` method searches through a soup object using CSS selectors, e.g. `select('a.sidemenu')`.
* Scraping is often a matter of searching through HTML code and, step by step, getting the right subset of information.
</div>