## Webscraping with BeautifulSoup

Last week, we used the Inspector tool in our browser to take a look at patterns in the HTML of some government websites. This week, we'll use Python and a package called BeautifulSoup to parse that HTML into structured data that we can use. We'll be using an example from one of your homework submissions today.

The notebook below is a skeleton of a generic scraper that we'll adapt to the structure of the website selected by the class. This a pretty typical workflow for a scraper project because you'll often use older scrapers you've written as examples for how to scrape new websites. 

In case you haven't already, let's install BeautifulSoup4 and lxml using pip3.

```
pip3 install beautifulsoup4

pip3 install lxml
```

Then we'll import the three open source packages we'll be working with today: `BeautifulSoup` (aka `bs4`), `requests` and `pandas`. We'll also be using Python's built in `time` package.

In [12]:
import time
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Ethical scraping: Set our header

To scrape a website, we'll be using `requests` to send an https request and return back the response containing the html we want to parse. This is the same thing as when you type a url into your browser and push enter. To make sure we're being ethical and up front about what we're doing, it's good practice to send a note in the header about what we're doing and how to contact us if there's an issue.

In [13]:
header_string = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36' 
                 '(KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36 ' 
                 'Hey there, Chad Day here at The Wall Street Journal. '
                 'I am scraping some public data from your site. '
                 'You can reach me at chad.day@wsj.com.')

header = {'User-Agent': header_string}

print(header)

{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36(KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36 Hey there, Chad Day here at The Wall Street Journal. I am scraping some public data from your site. You can reach me at chad.day@wsj.com.'}


## Define the url

Copy and in the url into a string and assign it a variable, like we do with `url` below.

In [16]:
#url = 'https://extapps2.oge.gov/FOIAStatus/FOIAResponse.nsf/2FC940AD2A3190BD8525811B004560CE'

url = 'https://oig.nasa.gov/audits/auditReports.html'

url

'https://oig.nasa.gov/audits/auditReports.html'

## Send the request

Now, we'll put them together and return back the html we want to parse. This is often called "making the soup." Below, requests returns the response from the site. The page html is stored in the `.text` attribute of the response. We pass that text to BeautifulSoup and tell it to parse it using `lxml`, an API that turns the text into heirarchical structured data that we can navigate.

In [17]:
response = requests.get(url, headers=header)

soup = BeautifulSoup(response.text, "lxml")

soup

<!DOCTYPE html>
<html lang="en">
<!-- #BeginTemplate "/Templates/Default.dwt" -->
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- #BeginEditable "title" -->
<title>NASA OIG</title>
<!-- #EndEditable -->
<link href="/favicon.ico?" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<!-- Bootstrap -->
<link href="/css/bootstrap.min.css" rel="stylesheet"/>
<script async="" src="https://www.google-analytics.com/analytics.js"></script>
<script id="_fed_an_ua_tag" language="javascript" src="https://dap.digitalgov.gov/Universal-Federated-Analytics-Min.js?agency=NASA"></script>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
          <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
          <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.

## Inspect the html on the page and try out some searches

BeautifulSoup uses the tags, classes, ids and text of the html to locate the pieces you want. The most common methods are `.find()` and `.find_all()`. They do what they sound like: find pieces of the html that match your search criteria. `.find()` locates the first object that matches the criteria, while `.find_all()` locates all of them and returns a list of what it finds.

Let's see it in this example using an html table of documents released by the Office of Government Ethics under FOIA. You can find the site [here](https://extapps2.oge.gov/FOIAStatus/FOIAResponse.nsf/2FC940AD2A3190BD8525811B004560CE).

This site has a very basic html table in it that we want to extract. The pattern looks like below. First, we see a table tag followed by a tag signifying the start of the body of the table. Then we see a series of `<tr>` tags, which signify the rows of the table. The first row has `<th>` tags for the headers, and the subsequent rows have `<td>` tags containing cells of data.

```
<table>
  <tbody>
    <tr>
      <th>Tracking Number and Date of Release</th>
      <th>Description of Records Sought</th>
      <th>Attachment</th>
    </tr>
    <tr>
      <td>FY 18 - 002 (07/19/2018)</td>
      <td>Description ... </td>
      <td>
        <a href="url...">Link text ... </a>
      </td>
    </tr>
    ...
</table>

```

We'll leverage these patterns, or similar ones in our class example, to extract the data using a for loop and the Python list data structure. A reminder, for loops allow us to do something to each item in a list sequentially.

## Find the section of the page containing our data

With our example, it's the `<table>` tag but which one. There are multiple tables on the page. Let's use `find_all` to create a list and then select the table we want using the index of our list.

In [19]:
years = soup.find_all('div', attrs={'class': 'tab-pane fade'})

print(f'There are {len(years)} years on this page.')

str(years[0])

There are 26 years on this page.


'<div class="tab-pane fade" id="FY-2021">\n<ul class="StackedContainer">\n<li class="RowContainer">\n<div><strong>September 27, 2021</strong></div>\n<div><a href="/docs/IG-21-028.pdf" target="_blank">Final Memorandum, Summary of Results of Incurred Cost Audits</a> <nobr class="nobr">(IG-21-028)</nobr></div>\n</li>\n<li class="RowContainer">\n<div><strong>September 8, 2021</strong></div>\n<div><a href="/docs/IG-21-027.pdf" target="_blank">NASA\'s Construction of Facilities</a> <nobr class="nobr">(IG-21-027)</nobr></div>\n</li>\n<li class="RowContainer">\n<div><strong>August 10, 2021</strong></div>\n<div><a href="/docs/IG-21-025.pdf" target="_blank">NASA\'s Development of Next-Generation Spacesuits</a> <nobr class="nobr">(IG-21-025)</nobr></div>\n</li>\n<li class="RowContainer">\n<div><strong>August 9, 2021</strong></div>\n<div><a href="/docs/IG-21-024.pdf" target="_blank">Review of Coronavirus Aid, Relief, and Economic Security (CARES) Act Funding</a> <nobr class="nobr">(IG-21-024)</nob

## Find the rows

Now that we've located the table, let's create a list of the rows in it using the `<tr>` tags and `find_all()`.

In [20]:
rows = years[0].find_all('li')

rows[0]

<li class="RowContainer">
<div><strong>September 27, 2021</strong></div>
<div><a href="/docs/IG-21-028.pdf" target="_blank">Final Memorandum, Summary of Results of Incurred Cost Audits</a> <nobr class="nobr">(IG-21-028)</nobr></div>
</li>

## Find the data 

Let's take a look at the data in our rows. We'll skip the first row as an example because it only contains our headers. 

In [33]:
data = rows[1].find_all('div')

data

[<div><strong>September 8, 2021</strong></div>,
 <div><a href="/docs/IG-21-027.pdf" target="_blank">NASA's Construction of Facilities</a> <nobr class="nobr">(IG-21-027)</nobr></div>]

In [40]:
data[1].text

"NASA's Construction of Facilities (IG-21-027)"

In [38]:
data[1].find('a')

<a href="/docs/IG-21-027.pdf" target="_blank">NASA's Construction of Facilities</a>

In [32]:
data[1].find('a').get('href')

'September 8, 2021'

## Define a record 

Below we'll use a Python dictionary to define our record. Remember, it's a key-pair data structure. We'll do this because it's very easy to convert lists of dictionaries into pandas dataframes.

In [41]:
rec = {
    'date': data[0].find('strong').text,
    'url_end': data[1].find('a').get('href'),
    'name': data[1].text
}

rec

{'date': 'September 8, 2021',
 'url_end': '/docs/IG-21-027.pdf',
 'name': "NASA's Construction of Facilities (IG-21-027)"}

## Construct the loop

Now, let's put it all together. Notice, I've added a couple things here. Because the url contained in the html omits the domain, I've added the first part as a string and then concatenated them together.

In [45]:
records = []

years = soup.find_all('div', attrs={'class': 'tab-pane fade'})

for year in years:
    rows = year.find_all('li')
    for row in rows:
        data = row.find_all('div')
        url_start = 'https://oig.nasa.gov'
        rec = {
            'date': data[0].find('strong').text,
            'url_end': url_start + data[1].find('a').get('href'),
            'name': data[1].text
        }
        records.append(rec)
    
print(len(records))

671


## Create a `pandas` dataframe

We can pass our records list of dictionaries directly to pandas to create a dataframe.

In [46]:
df = pd.DataFrame(records)

print(f'There are {len(df.index)} rows in the dataframe.')

df.head()

There are 671 rows in the dataframe.


Unnamed: 0,date,url_end,name
0,"September 27, 2021",https://oig.nasa.gov/docs/IG-21-028.pdf,"Final Memorandum, Summary of Results of Incurr..."
1,"September 8, 2021",https://oig.nasa.gov/docs/IG-21-027.pdf,NASA's Construction of Facilities (IG-21-027)
2,"August 10, 2021",https://oig.nasa.gov/docs/IG-21-025.pdf,NASA's Development of Next-Generation Spacesui...
3,"August 9, 2021",https://oig.nasa.gov/docs/IG-21-024.pdf,"Review of Coronavirus Aid, Relief, and Economi..."
4,"July 14, 2021",https://oig.nasa.gov/docs/IG-21-022.pdf,NASA's Management of USRA's Cooperative Agreem...


## Output to a csv 

In [48]:
df.to_csv('./data/nasa_oig.csv', index=False)