# Web scraping

Scraping unstructured information from the web into structured data is a common use of Python in the newsroom. In this session, we're going to use a module called [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to do most of the heavy lifting.

### Quick overview: HTML

To scrape a web page, you need to _sort of_ understand how web pages are made.

Web pages are written in HTML. HTML elements are represented (_usually_) by a pair of tags -- an opening tag and a closing tag.

A table, for example, starts with `<table>` and ends with `</table>`. The first tag tells the browser: "Hey! I got a table here! Render it as a table." The closing tag (note the forward slash) tells the browser: "Hey! I'm all done with that table, thanks." Inside the table are nested tags representing rows and cells.

(There's more to it, but that's probably good for now.)

### Inspect the source!

If I'm thinking about scraping a page, the first thing I do is look at the HTML code that underpins the page. You can do this right from your browser -- I like to use Chrome but Firefox has some good developer tools, as well. (Maybe IE does too, who knows lol)

To "view source" in Chrome, you'd hit `Ctrl+U` on a PC and `Cmd+Opt+U` on a Mac. It's also in the menu bar: View -> Developer -> View Page Source.

You'll get a page showing you all the HTML code that makes up that page.

### BeautifulSoup

BeautifulSoup turns HTML into data objects that Python can work with, allowing you to walk up and down the HTML tag "tree" and target specific elements on the page.

## Warmup

Before we start operating on a live page, let's practice on a sample file at `../practice-table.html`. Normally, the first step to scrape a web page is to fetch it -- later, we'll use `requests` for this again -- but for this example let's pretend we've already got a local copy.

The Mountain Goats [have a new album out](https://themountaingoats.bandcamp.com/album/goths) (it is good, you should buy it); the HTML we're going to operate on is just a `<table>` showing the track listing.

Let's start by importing the `BeautifulSoup` class from our module, which is called `bs4`. We're also

In [1]:
from bs4 import BeautifulSoup

Next, we're going to open the file and use the `read()` method to read its contents into memory. Then let's print the contents of the file.

In [2]:
with open('../practice-table.html', 'r', encoding='utf-8') as html_file:
    html_code = html_file.read()
    print(html_code)

<html>
<table id="empty-table-to-throw-you-off"></table>
<table class="song-table" id="my-cool-table" style="width: 95%;">
  <thead>
    <tr>
      <th>Track Number</th>
      <th>Song Title</th>
      <th>Duration</th>
      <th>Artist</th>
      <th>Album</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Rain in Soho</td>
      <td>4:47</td>
      <td>The Mountain Goats</td>
      <td>Goths</td>
    </tr>
    <tr>
      <td>2</td>
      <td>Andrew Eldritch is Moving Back to Leeds</td>
      <td>4:19</td>
      <td>The Mountain Goats</td>
      <td>Goths</td>
    </tr>
    <tr>
      <td>3</td>
      <td>The Grey King and the Silver Flame Attunement</td>
      <td>4:55</td>
      <td>The Mountain Goats</td>
      <td>Goths</td>
    </tr>
    <tr>
      <td>4</td>
      <td>We Do it Different on the West Coast</td>
      <td>5:21</td>
      <td>The Mountain Goats</td>
      <td>Goths</td>
    </tr>
    <tr>
      <td>5</td>
      <td>Unicorn Tolerance</td>
      <

Now let's feed the file contents to a BeautifulSoup object and assign the result to the variable `soup`. You might get an error unless you also pass `'html.parser'` as the second argument. Now print `type(soup)`.

In [3]:
with open('../practice-table.html', 'r', encoding='utf-8') as html_file:
    html_code = html_file.read()
    soup = BeautifulSoup(html_code, 'html.parser')
    print(type(soup))

<class 'bs4.BeautifulSoup'>


Cool. We're locked and loaded. Our string of HTML is now a tree that we can climb through to find the things we want.

There are a couple of ways to isolate the table we want using the `find` or `find_all` methods -- by class, by ID, by position on the page, by style. (There are others.) Let's try:

In [4]:
with open('../practice-table.html', 'r', encoding='utf-8') as html_file:
    html_code = html_file.read()
    soup = BeautifulSoup(html_code, 'html.parser')
    
    # by position on the page
    # find_all returns a list of matching elements, and we want the second ([1]) one
    # song_table = soup.find_all('table')[1]
    
    # by class name
    # => with `find`, you can pass in a dictionary of element attributes to match on
    # song_table = soup.find('table', {'class': 'song-table'})
    
    # by ID
    # song_table = soup.find('table', {'id': 'my-cool-table'})
    
    # by style
    song_table = soup.find('table', {'style': 'width: 95%;'})
    
    print(song_table)

<table class="song-table" id="my-cool-table" style="width: 95%;">
<thead>
<tr>
<th>Track Number</th>
<th>Song Title</th>
<th>Duration</th>
<th>Artist</th>
<th>Album</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Rain in Soho</td>
<td>4:47</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>2</td>
<td>Andrew Eldritch is Moving Back to Leeds</td>
<td>4:19</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>3</td>
<td>The Grey King and the Silver Flame Attunement</td>
<td>4:55</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>4</td>
<td>We Do it Different on the West Coast</td>
<td>5:21</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>5</td>
<td>Unicorn Tolerance</td>
<td>5:25</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>6</td>
<td>Stench of the Unburied</td>
<td>4:30</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>7</td>
<td>Wear Black</td>
<td>4:11</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr

We've targeted the correct table. Now what if we wanted to print a list of track numbers and song titles? Look at the structure of the table -- a `table` has rows represented by the tag `tr`, and within each row there are cells represented by `td`. The `find_all()` method, you'll recall, returns a _list_. And we know how to iterate over lists: with a `for` loop.

In [5]:
with open('../practice-table.html', 'r', encoding='utf-8') as html_file:
    html_code = html_file.read()
    soup = BeautifulSoup(html_code, 'html.parser')

    song_table = soup.find('table', {'class': 'song-table'})
    
    table_rows = song_table.find_all('tr')
    
    # let's skip the header row
    # more on list slicing: http://pythoncentral.io/how-to-slice-listsarrays-and-tuples-in-python/
    for row in table_rows[1:]:
        # get a list of cells in the row
        cols = row.find_all('td')
        
        # the track number is is in the first ([0]) "column"
        # the `.string` attribute gets the contents of a BeautifulSoup Tag object
        track_number = cols[0].string
        
        # the song title is in the second ([1]) "column"
        song_title = cols[1].string

        print(track_number + '.', song_title)


1. Rain in Soho
2. Andrew Eldritch is Moving Back to Leeds
3. The Grey King and the Silver Flame Attunement
4. We Do it Different on the West Coast
5. Unicorn Tolerance
6. Stench of the Unburied
7. Wear Black
8. Paid in Cocaine
9. Rage of Travers
10. Shelved
11. For the Portuguese Goths Metal Bands
12. Abandoned Flesh


---

Now let's work on a live example.

### Scraping etiquette

**Rule No. 1: Don't hammer their servers.** If feasible, save a copy of the page locally so you only need to fetch it once as you write your script. For me, at least, scraping involves a lot of "try this and see if it works," which can involve running the same script a dozen (or hundreds) of times as you refine it. If you're requesting multiple pages, pause a for a little bit between requests so you don't overload them.

That's good for now. We can go over other rules as problems arise.

We are going to scrape [this website](http://www.nrc.gov/reactors/operating/list-power-reactor-units.html) in just Four! Easy! Steps!

1. Import libraries
2. Grab the contents of the web page
3. Parse the contents of the web page and target the data table
4. Loop over the table and write the contents to a CSV file

### 1. Import libraries

We'll need `requests` to fetch the page, `bs4` (the BeautifulSoup class, at least) to parse it, and our old friend `csv` to write out to a new file.

In [6]:
import requests
from bs4 import BeautifulSoup
import csv

### 2. Grab the contents of the web page

`requests` has a method called `get()` that -- you guessed it! -- gets a page. The `text` attribute of the result will give you the string of HTML we need to hand off to BeautifulSoup.

In [7]:
url = 'http://www.nrc.gov/reactors/operating/list-power-reactor-units.html'
web_page = requests.get(url).text

### 3. Parse the contents, target the data table

Lucky for us, there's only one `<table>` on the page. How do I know this? I viewed the source code and `Ctrl+F`'d for "&lt;table". So we can use the `find()` method to get it once we make soup out of that string of HTML.

In [8]:
soup = BeautifulSoup(web_page, 'html.parser')
table = soup.find('table')

### 4. Loop over the table, write the contents to a CSV

Here's where the action is:

- Write a function, `cleanString()`, that strips whitespace and garbage characters that throw off Windows
- In a `with` block, open a file to write to -- call it `reactors.csv` or something -- in `w` mode
- Define your list of headers
- Create a writer or DictWriter object
- Write headers to the file
- Loop over the rows in the table we targeted, extracting the data and writing to the file

The first cell in each row is kind of tricky: It has a `<br>` tag to break the line, so you'll need to use the [`contents`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children) attribute to get a tag list, then target the items in that list. You can access the `'href'` attribute of the link in the same way that you'd get a value from a dictionary -- with bracket notation.

In [9]:
def cleanString(string_in):
    
    chars_to_replace = ['\n', '\r', '\xa0', '\xc2', '\x0A', '\D0A']

    if not string_in:
        return ''
    else:
        for char in chars_to_replace:
            string_in = string_in.replace(char, '')
            
        return string_in.strip()

with open('reactors.csv', 'w', encoding='utf-8') as outfile:
    
    headers = ['name', 'url', 'reactor_id', 'license_no',
               'reactor_type', 'location', 'owner_operator', 'nrc_region']

    writer = csv.DictWriter(outfile, fieldnames=headers)
    
    writer.writeheader()
    
    # loop over the rows in the table
    for row in table.find_all('tr')[1:]:
        
        # get a list of cells in the row
        cols = row.find_all('td')

        # burst this cell into a list with `contents`
        name_cell_contents = cols[0].contents
        
        # the first item is the name
        name = cleanString(name_cell_contents[0].string)
        
        # the name is wrapped in a link -- let's grab the href
        # and prepend the domain
        url = 'https://www.nrc.gov' + name_cell_contents[0]['href']
        
        # the reactor ID is the third thing in the list
        reactor_id = cleanString(name_cell_contents[2].string)

        license_number = cleanString(cols[1].string)
        reactor_type = cleanString(cols[2].string)
        location = cleanString(cols[3].string)
        owner_operator = cleanString(cols[4].string)
        nrc_region = cleanString(cols[5].string)
        
        # write out to file
        writer.writerow({
            'name': name,
            'url': url,
            'reactor_id': reactor_id,
            'license_no': license_number,
            'reactor_type': reactor_type,
            'location': location,
            'owner_operator': owner_operator,
            'nrc_region': nrc_region
        })

## Next step: Scraping data from more than one page

In theory, we could have just imported that table into Excel. Let's _kick things up a notch_.

Problem: We scraped our table, but there is one key piece of information on each reactor's detail page that we want to grab, too: A link to the PDF of its operating license.

To get this link, we'll repeat the steps we just took to scrape the overview table -- but we'll tweak it a little bit, adding a function that extracts the link from each detail page.

### Extraction function!

Let's write an extraction function, `fetchLicensePDF()`, that will take one argument, `detail_html` -- the string of HTML from a reactor's detail page -- and return the extra link we wanna grab.

The PDF link is in a list inside another table cell, but we're going to target it using the link's text ("Plant Operating License"). It's a relative link, so we're going to prepend the domain before we write out to our file.

Some of the pages don't have this link, so we're going to use a [`try/except`](https://docs.python.org/3/tutorial/errors.html) statement to catch errors.

In [10]:
def fetchLicensePDF(detail_html):
    soup = BeautifulSoup(detail_html, 'html.parser')
    
    try:
        return 'https://www.nrc.gov' + soup.find('a', text='Plant Operating License')['href']
    except:
        return ''

We have our functions. We're ready to write out to a new file -- let's call this one 'reactor-detail.csv'. We're going to repeat the same loop we did when we scraped the first table, except this time, we're going to to do more things:

1. Grab the detail page link and call the `scrapeDetail()` function on it, then write out the values in the returned dictionary alongside the other data
2. Pause for a couple seconds using the `time` module

In [11]:
import time

with open('reactors-detail.csv', 'w', encoding='utf-8') as outfile:
    
    headers = ['name', 'url', 'reactor_id', 'license_no',
               'reactor_type', 'location', 'owner_operator',
               'nrc_region', 'license_link']

    writer = csv.DictWriter(outfile, fieldnames=headers)
    
    writer.writeheader()
    
    for row in table.find_all('tr')[1:]:
        
        cols = row.find_all('td')

        name_cell_contents = cols[0].contents
        name = cleanString(name_cell_contents[0].string)
        url = 'https://www.nrc.gov' + name_cell_contents[0]['href']
        reactor_id = cleanString(name_cell_contents[2].string)

        license_number = cleanString(cols[1].string)
        reactor_type = cleanString(cols[2].string)
        location = cleanString(cols[3].string)
        owner_operator = cleanString(cols[4].string)
        nrc_region = cleanString(cols[5].string)
        
        # now get the license PDF with these two lines
        r = requests.get(url)
        license_link = fetchLicensePDF(r.text)            
                
        writer.writerow({
            'name': name,
            'url': url,
            'reactor_id': reactor_id,
            'license_no': license_number,
            'reactor_type': reactor_type,
            'location': location,
            'owner_operator': owner_operator,
            'nrc_region': nrc_region,
            'license_link': license_link
        })
                
        print('Scraped data for ' + name)
        time.sleep(2)
        
    print('Done! \o/')

Scraped data for Arkansas Nuclear 1
Scraped data for Arkansas Nuclear 2
Scraped data for Beaver Valley 1
Scraped data for Beaver Valley 2
Scraped data for Braidwood 1
Scraped data for Braidwood 2
Scraped data for Browns Ferry 1


KeyboardInterrupt: 