# Introduction to scraping in Python

## Setup

1. Download this notebook from <span style="font-size:2em">https://is.gd/MCO510Scraping</span> and put it in the same folder as other notebooks for this class so you can follow along.

## Goal

- Build on the concepts from the [Scraping 101](https://docs.google.com/presentation/d/1W4esHT93tKQVUHJLhyVM8f2je-DLWKu-MtzZ7-I6hi8/edit) lecture
- Think about how to plan and structure a scraping project
- Learn some skills in Python that can help us start to write scrapers

## Assumptions

- Familiarity with HTML
- Familiarity with CSS selectors
- This course hasn't really covered Python flow control so far

If this isn't the case it's ok. We'll deal.

## Notebook conventions

### Emoji garden

These are emojis that we'll use in this notebook to draw attention to certain concepts.

⚠️: Take extra caution or pay extra attention to this!

### Identifying cells

We'll start code cells with a comment with a short slug. That way, if anyone has a problem, it's easy to be on the same page about where in the notebook they're working.

In [1]:
# example-slug

# This cell doesn't do anything, but it shows the convention we'll use for identifying cells.

### Javelinas

These are placeholders to hide a solution when you should try some code or think about a question before you read the solution. Experiment, play around with some Python code or collect your thoughts before scolling down.

Now lets scrape some HTML like this javelina is scraping out a place to rest.

![](http://www.javelinahunter.com/images/sj2.jpg)

## ⚠️ Our first warning: we're scraping a real page

For this tutorial, I decided to talk about scraping a real website which has some structural issues that makes it subtly tricky to scrape. With that in mind, we won't have the satisfaction of ending up with a running scraper, but we will encounter some things that will prepare us for some of the challenges we may find when scraping other sites.

## Libraries

This is the reusable code from others that we're bringing into our notebook to help us scrape HTML.

- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/): Used to parse and extract information from HTML documents.

Beautiful Soup should already be installed when you installed Anaconda, but we should double check. This command should show information about the version installed.

In [None]:
# check-bs4-installed

# Check whether the Beautiful Soup package is installed

!conda list beautifulsoup4

# packages in environment at C:\Users\ghing\anaconda3:
#
# Name                    Version                   Build  Channel
beautifulsoup4            4.10.0             pyh06a4308_0  


If the above command didn't show output, you may have to install the Beautiful Soup package with this command:

In [None]:
# install-bs4

# If that didn't return output, we'll have to run this to install it

!conda install -c anaconda beautifulsoup4

## Why would we write a scraper in Python?

The skills we're going to go over today are some of the fundamental building blocks for writing a scraper in Python.

But since we're starting out, our pieces of code won't be able to do much more than the tools we already learned about. When  tools like Google Sheets are so powerful, why wouldn't we just use them?

![](https://www.arizonahighways.com/sites/default/files/1109_javelina_0.jpg)

Scrapers written in Python are good for:

- Scale: When you need to scrape hundreds or thousands of pages.
- Automation: When you need to scrape changing data on the same page regularly
- Reproducibility: Someone else (or future you) can run your scraper and it should work the same way. Other tools might require writing many grafs in a data diary, or taking a bunch of screenshots to document a process.
- Learning: If it takes about the same time to write a program as to copy/paste and clean up or manually enter data by hand, maybe it's worth learning something new that can prepare for when tasks are too big to do manually.

## "Scraping" is actually a few different steps

- Assess, make a plan
- Download the content (in many cases, HTML)
- **Extract elements in the content as structured data.** ⚠️ This is our focus today.
- Clean and/or transform the data

## Why should you break your scraping workflow into these steps?

![](https://cdn.branchcms.com/w5pOxEB39B-1533/images/blog/javelina-and-young.jpg)

- Respect your time!
- Mix and match strategies: automate the boring, time consuming parts; do the parts that are 
- Have an archive in case the pages go away
- Let's you test, make changes without having to make a ton of requests

## Assess, make a plan

- Do I have to scrape?
  - Can I file a records request?
  - Can I just ask for it?
  - Has someone else done this work?
    - Academics
    - Other news orgs
- How many records are there?
- Are they split across multiple pages?
- Can I really get all the data?
- How long would it take to do manually? 
  - Try timing manually entering one page, or 10 records?
- Are there any barriers to scraping (CAPTCHA, login, accepting TOS)?
- Are the URLs predictable?
- Is the data in the HTML or does it come from an API?
- Is the request a GET or a POST?
- Does the format of the data change?
  - Check to see if the pattern or structure of the data changes by
    - time period
    - category
    - jurisdiction
  - More differences means more coding and time!

## Example: Massage Therapist Search Tool

https://directorymassagetherapy.az.gov/dirsearch/

## Download the content

Limitations aside, we'll scrape this data.

We're just going to do this manually for the sake of expedience so we can jump straight into extracting information.

If you wanted to do this with code, try the [requests](https://docs.python-requests.org/en/latest/user/quickstart/#response-content) pacakge.

In your browser's menu: `File` > `Save Page As`

Specify `massage_therapist_results.html` as the file name.

⚠️ As with other data in notebooks, be sure to save this in the same directory as your notebook. Or, be ready to determine the path to the HTML file in your code.

## Open the file in your browser

Use `File` > `Open` in your browser's menu or open a new tab and drag and drop the HTML file, `massage_therapist_results.html` into the tab to open the local copy of the HTML.

### Then open the developer tools

Right click/ctrl+click somewhere in the data we want to scrape and choose `Inspect` from the context menu to open the developer tools with the `Elements` tab open.

Doing this will help us figure out how to select specific pieces of data with Python code.

## Extract elements in the content as structured data

### Hello Beautiful Soup

Let's load the downloaded HTML into a `BeautifulSoup` object, which represents the document as a nested data structure.

First, we have to import the `BeautifulSoup` *class* from the Beautiful Soup *package*, named `bs4`.

In [2]:
# import-bs4

# Import BeautifulSoup which we'll use to help us scrape information from
# HTML documents.

from bs4 import BeautifulSoup

In [3]:
# create-path

# Create a path to the HTML file

from pathlib import Path

# You probably want to do this:
results_html_path = Path("massage_therapist_results.html")

# This code is just to make this work on the instructor's computer, which works a little differently.
if not results_html_path.exists():
    results_html_path = Path.cwd() / "data" / "source" / "massage_therapist_results.html"

Now try to open the HTML file and use it to create an object that represents the nested structure of the document. It's conventional to call the variable for that nested representation `soup`, but this is just a convention. If there's something that makes more sense to you, modify the code to name the variable as you like.

Can anyone describe what this code is doing?

In [4]:
# create-soup-fail

# Try to load the data into a BeautifulSoup object, which represents
# the document as a nested data structure.
#
# This should produce an error.

with open(results_html_path) as f:
    soup = BeautifulSoup(f, "html.parser")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35789: invalid start byte

We ran into an error! This is because the file uses an encoding other than the default.

Encodings are a somewhat complicated and annoying concept.

tl;dr: If you get a `UnicodeDecodeError` error when reading a text file, try a different encoding. `latin-1` is common.

In [5]:
# create-soup

# Try to load the data into a BeautifulSoup object,
# which represents the document as a nested data structure.
# This time, use the right encoding.

with open(results_html_path, encoding="latin-1") as f:
    # Create the BeautifulSoup object.
    # "html.parser" is the parser that Beautiful Soup should use to
    # break the HTML up into a hierarchy that we can traverse and
    # filter. "html.parser" should work well for many cases, but
    # you might want to consider another parser if the HTML
    # is badly formed.
    soup = BeautifulSoup(f, "html.parser")

Let's look at the HTML.

`soup.prettify()` prints the HTML with nice indendation. Unfortunately, this doesn't help much with this long document. Good thing we have our developer tools!

In [6]:
# pretty-print-soup

# Pretty print the HTML

print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML+RDFa 1.1//EN">
<html class="js" dir="ltr" lang="en" version="HTML+RDFa 1.1" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/terms/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
 <!-- InstanceBegin template="/Templates/_pages_.dwt.asp" codeOutsideHTMLIsLocked="false" -->
 <head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <script async="" src="./files/ga.js" type="text/javascript">
  </script>
  <script type="text/javascript">
   window.NREUM||(NREUM={}),__nr_require=function t(n,e,o){function r(a){if(!e[a]){var i=e[a]={exports:{}};n[a][0].call(i

## CSS selectors

Beautiful Soup has its own method of [navigating](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=select#navigating-the-tree) and [searching](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=select#searching-the-tree) the HTML document.

However, we're going to use [CSS selectors](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=select#css-selectors) to specify elements. This is for a few reasons:

- Maybe you already know a little CSS. Or, learning it can be helpful for visualizing data or making web pages/apps.
- Browser developer tools can tell us the selector for a particular element

### `soup.select_one()` gets one element

If there is more than one matching element, it returns the first one. The value returned by the `select_one()` method is a `Tag`.

In [7]:
# get-first-graf

# Get the first paragraph on the page

first_graf = soup.select_one("p")

type(first_graf)

bs4.element.Tag

Using a `Tag` object like you would a string prints the HTML of the element.

In [8]:
# display-first-graf

# Display the HTML of the first paragraph element.

first_graf

<p>To find a Massage Therapist, use the Massage Therapist Search Tool. Click on search, then use...</p>

### `soup.select()` returns all matching elements as a list

In [9]:
# get-all-grafs

# Get and display all the paragraphs on the page

all_grafs = soup.select("p")

all_grafs

[<p>To find a Massage Therapist, use the Massage Therapist Search Tool. Click on search, then use...</p>,
 <p>If you are applying for a Massage Therapy License in Arizona or if you are Renewing, click on...</p>,
 <p>Anyone claiming to have been harmed by a massage therapist in Arizona may file a complaint...</p>]

We can use our browser developer tools to help find the CSS selector for a particular element.

In the `Elements` tab of developer tools, right/ctrl+click on a particular element and choose `Copy` > `Copy selector` from the context menu.

Try to use `soup.select_one()` and the selector copied from the developer tools to get the *outer* table that contains the list of massage therapists.

In [10]:
# select-table-entry

# Get the table element for the data that we want to scrape

table = soup.select_one(".replace-this-with-your-selector")

table

![](https://www.gannett-cdn.com/-mm-/9d39186ff6bbd045462cf0211336ac42a105219b/c=0-11-271-164/local/-/media/2018/03/22/Phoenix/Phoenix/636573333519944047-javelina2.PNG?auto=webp&format=pjpg&width=1200)

In [11]:
# select-table

# Get the table element for the data that we want to scrape

# If you used your browser's developer tools, it might look something like this:
table = soup.select_one("#block-system-main > div > div > div > div > div > div.views-field.views-field-body > div > div > center > table")

# If you get stuck, type this simpler selector
#table = soup.select_one(".views-field.views-field-body table")

print(table.prettify())

<table bgcolor="#ACC0EA" border="1" bordercolor="#456198" cellpadding="0" cellspacing="0" width="390">
 <tr>
  <td align="center">
   <font face="Verdana, Arial, Helvetica, sans-serif" size="-1">
    <strong>
     Directory Results
    </strong>
   </font>
   <table bgcolor="#DCE4F1" border="0" cellpadding="4" cellspacing="0" width="378">
    <tr>
     <td bgcolor="#456198" colspan="2" height="1">
     </td>
    </tr>
    <tr>
     <td align="left" bgcolor="#DCE4F1">
      <font face="Verdana, Arial, Helvetica, sans-serif" size="-1">
       <strong>
        Ciuffardi,
                                        Diana
       </strong>
      </font>
     </td>
     <td align="right" bgcolor="#DCE4F1">
      <font face="Verdana, Arial, Helvetica, sans-serif" size="-1">
       Active
                                         #MT-26872
      </font>
     </td>
    </tr>
    <tr>
     <td align="right" bgcolor="#DCE4F1" valign="top">
      <font face="Verdana, Arial, Helvetica, sans-serif" size="

### Tag elements have `select()` and `select_one()` methods too

You may have noticed that there's a table inside a table. We can call `select_one()` on the table object we just grabbed to get the innermost table. In this case, the selector will be relative to the parent element. Also, we can no longer just copy and paste the selector from the browser developer tools, but working on a branch of the big HTML tree makes the selectors easier to figure out.

In [12]:
# select-inner-table

# Get the inner-most table

inner_table = table.select_one("table")

print(inner_table.prettify())

<table bgcolor="#DCE4F1" border="0" cellpadding="4" cellspacing="0" width="378">
 <tr>
  <td bgcolor="#456198" colspan="2" height="1">
  </td>
 </tr>
 <tr>
  <td align="left" bgcolor="#DCE4F1">
   <font face="Verdana, Arial, Helvetica, sans-serif" size="-1">
    <strong>
     Ciuffardi,
                                        Diana
    </strong>
   </font>
  </td>
  <td align="right" bgcolor="#DCE4F1">
   <font face="Verdana, Arial, Helvetica, sans-serif" size="-1">
    Active
                                         #MT-26872
   </font>
  </td>
 </tr>
 <tr>
  <td align="right" bgcolor="#DCE4F1" valign="top">
   <font face="Verdana, Arial, Helvetica, sans-serif" size="-1">
    6420 S Camino Dela Tierra #1305
    <br/>
    Tucson, AZ 85746
   </font>
  </td>
  <td align="right" bgcolor="#DCE4F1" valign="top">
   <font face="Verdana, Arial, Helvetica, sans-serif" size="-1">
    Issued: 4/2/2020
    <br/>
    Expires: 1/29/2023
    <br/>
    <!--Status:&nbsp;Approved-->
   </font>
  </td>

### The `.get_text()` method of a `Tag` lets you access the text of all the children

Let's grab the second row of the table.

You can see that the data is contained in two columns (`<td>` elements).

In [13]:
# select-second-row

# Select and display the second row of the data table.

second_row = inner_table.select("tr")[1]

print(second_row.prettify())

<tr>
 <td align="left" bgcolor="#DCE4F1">
  <font face="Verdana, Arial, Helvetica, sans-serif" size="-1">
   <strong>
    Ciuffardi,
                                        Diana
   </strong>
  </font>
 </td>
 <td align="right" bgcolor="#DCE4F1">
  <font face="Verdana, Arial, Helvetica, sans-serif" size="-1">
   Active
                                         #MT-26872
  </font>
 </td>
</tr>



`second_row.get_text()` will show the text inside all the children of an element, without the HTML.

In [14]:
# display-second-row-text

# Display the text inside all elements

print(second_row.get_text())




                                         Ciuffardi,
                                        Diana 



                                        
                                         Active
                                         #MT-26872
                                    




Try to get the text inside the left-hand table cell `<td>` element, the one that contains the massage therapist's name.

In [15]:
# get-rhs-cell-text-blank

# Get and display the massage therapist's name.

# Try your solution here

![](https://media.hswstatic.com/eyJidWNrZXQiOiJjb250ZW50Lmhzd3N0YXRpYy5jb20iLCJrZXkiOiJnaWZcL2phdmVsaW5hLmpwZyIsImVkaXRzIjp7InJlc2l6ZSI6eyJ3aWR0aCI6ODI4fSwidG9Gb3JtYXQiOiJhdmlmIn19)

In [16]:
# get-rhs-cell-text

# Get and display the massage therapist's name.

print(second_row.select_one("td").get_text())



                                         Ciuffardi,
                                        Diana 



## Step back: what makes this page tricky to scrape?

Hint: Remember, HTML is hierarchical.

![](https://i.pinimg.com/originals/ca/ea/99/caea991ceb006d69990be6e02ece6711.jpg)

Each massage therapist's record isn't contained in a parent element, instead, each piece of information is in a separate table row, with no structural relationship to related rows.

Also, multiple pieces of information are contained inside the same element.

This isn't an impossible situation, but it requires looking for other flags that we can use to detect when one record, or piece of information starts or ends.

## Loops

In this course, you've mostly worked with tabular data using Pandas. In many cases, Pandas allows manipulating data in a table without stepping through it row by row (Python programmers often use the term "iterating" or "iteration" to describe this).

However, for programs that aren't transforming or aggregating tabular data, Python, and most other programming languages, have a featured called loops.

The most frequently used loop in Python is the [for loop](https://realpython.com/python-for-loop/).

It takes the basic form:

```
for <var> in <list>:
    # Do something with <var>
    <statement(s)>
```

Here's an example of a really simple for loop that prints Arizona county names:

In [17]:
# basic-for-loop-example

# Iterate over Arizona county names and display them

az_counties = [
    "Apache County",
    "Cochise County",
    "Coconino County",
    "Gila County",
    "Graham County",
    "Greenlee County",
    "La Paz County",
    "Maricopa County",
    "Mohave County",
    "Navajo County",
    "Pima County",
    "Pinal County",
    "Santa Cruz County",
    "Yavapai County",
    "Yuma County",
]

for county in az_counties:
    print(county)

Apache County
Cochise County
Coconino County
Gila County
Graham County
Greenlee County
La Paz County
Maricopa County
Mohave County
Navajo County
Pima County
Pinal County
Santa Cruz County
Yavapai County
Yuma County


## If statements

Another essential flow control feature of Python and most other languages are conditionals - a way to only execute code if some condition is met. Conditionals are written using `if`.

If statements take the form:

```
if <condition>:
    # If the <condition> is met, do something
    <statement(s)>
```

Here's a really basic example where we modify the loop above to only print one county's name.

In [18]:
# basic-for-if-loop-example

# Iterate over Arizona county names and display only Maricopa's

az_counties = [
    "Apache County",
    "Cochise County",
    "Coconino County",
    "Gila County",
    "Graham County",
    "Greenlee County",
    "La Paz County",
    "Maricopa County",
    "Mohave County",
    "Navajo County",
    "Pima County",
    "Pinal County",
    "Santa Cruz County",
    "Yavapai County",
    "Yuma County",
]

for county in az_counties:
    if county == "Maricopa County":
        print(county)

Maricopa County


## Putting it together

Here's an example of using a `for` loop and an `if` statement to only print the names and license numbers of Active licenses.

In [19]:
# for-if-table-filter

# Only display text for active licenses

for tr in inner_table.select("tr"):
    row_text = tr.get_text().strip()
    
    if "Active" in row_text:
        print(row_text)

Ciuffardi,
                                        Diana 



                                        
                                         Active
                                         #MT-26872
Giessman,
                                        Marina 



                                        
                                         Active
                                         #MT-26969
Han,
                                        Katherine 



                                        
                                         Active
                                         #MT-28071
Malaela,
                                        Karalee 



                                        
                                         Active
                                         #MT-21228
Tovar,
                                        Emily 



                                        
                                         Active
                                         #MT-27882
U

## Next steps

### More scraping learning and practice

- NICAR tipsheets
  - [asuozzo/nicar2022-python-scraping](https://github.com/asuozzo/nicar2022-python-scraping)
  - [asuozzo/nicar2019-scraping](https://github.com/asuozzo/nicar2019-scraping)

### Using a real browser

- [Playwright](https://playwright.dev/python/docs/intro)
- [Selenium](https://selenium-python.readthedocs.io/)

### Really massive scrapers

- [Scrapy](https://scrapy.org/)

### Scraping PDFs

- [pdfplumber](https://github.com/jsvine/pdfplumber)

## References

- [Beautiful Soup Documentation](https://beautiful-soup-4.readthedocs.io/en/latest/)
- [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors)
- [HTML elements reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)