## Webscraping with BeautifulSoup

Last week, we used the Inspector tool in our browser to take a look at patterns in the HTML of some government websites. This week, we'll use Python and a package called BeautifulSoup to parse that HTML into structured data that we can use. We'll be using an example from one of your homework submissions today.

The notebook below is a skeleton of a generic scraper that we'll adapt to the structure of the website selected by the class. This a pretty typical workflow for a scraper project because you'll often use older scrapers you've written as examples for how to scrape new websites. 

In case you haven't already, let's install BeautifulSoup4 and lxml using pip3.

```
pip3 install beautifulsoup4

pip3 install lxml
```

Then we'll import the three open source packages we'll be working with today: `BeautifulSoup` (aka `bs4`), `requests` and `pandas`. We'll also be using Python's built in `time` package.

In [None]:
import time
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Ethical scraping: Set our header

To scrape a website, we'll be using `requests` to send an https request and return back the response containing the html we want to parse. This is the same thing as when you type a url into your browser and push enter. To make sure we're being ethical and up front about what we're doing, it's good practice to sent a note in the header so that the site admins can see what we're doing if there's any issue. 

In [None]:
header_string = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36' 
                 '(KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36 ' 
                 'Hey there, Chad Day here at The Wall Street Journal. '
                 'I am scraping some public data from your site. '
                 'You can reach me at chad.day@wsj.com.')

header = {'User-Agent': header_string}

print(header)

## Define the url

Copy and in the url into a string and assign it a variable, like we do with `url` below.

In [None]:
url = 'https://extapps2.oge.gov/FOIAStatus/FOIAResponse.nsf/2FC940AD2A3190BD8525811B004560CE'

## Send the request

Now, we'll put them together and return back the html we want to parse. This is often called "making the soup." Below, requests returns the response data from the site. The page html is stored in the `.text` attribute of the response. We pass that text to BeautifulSoup and tell it to parse it using `lxml` an API that turns the text into heirarchical structured data that we can navigate.

In [None]:
response = requests.get(url, headers=header)

soup = BeautifulSoup(response.text, "lxml")

soup

## Inspect the html on the page and try out some searches

BeautifulSoup uses the tags, classes, ids and text of the html to locate the pieces you want. The most common methods are `.find()` and `.find_all()`. They do what they sound like: find pieces of the html that match your search criteria. `.find()` locates the first object that matches the criteria, while `.find_all()` locates all of them and returns a list of what it finds.

Let's see it in this example using an html table of documents released by the Office of Government Ethics under FOIA. You can find the site [here](https://extapps2.oge.gov/FOIAStatus/FOIAResponse.nsf/2FC940AD2A3190BD8525811B004560CE).

This site has a very basic html table in it that we want to extract. The pattern looks like below. First, we see a table tag followed by a tag signifying the start of the body of the table. Then we see a series of `<tr>` tags, which signify the rows of the table. The first row has `<th>` tags for the headers, and the subsequent rows have `<td>` tags containing cells of data.

```
<table>
  <tbody>
    <tr>
      <th>Tracking Number and Date of Release</th>
      <th>Description of Records Sought</th>
      <th>Attachment</th>
    </tr>
    <tr>
      <td>FY 18 - 002 (07/19/2018)</td>
      <td>Description ... </td>
      <td>
        <a href="url...">Link text ... </a>
      </td>
    </tr>
    ...
</table>

```

We'll leverage these patterns, or similar ones in our class example, to extract the data using a for loop and the Python list data structure. A reminder, for loops allow us to do something to each item in a list sequentially.

## Find the section of the page containing our data

With our example, it's the `<table>` tag but which one. There are multiple tables on the page. Let's use `find_all` to create.a list and then select the table we want using the index of our list.

In [None]:
tables = soup.find_all('table')

print(f'There are {len(tables)} tables on this page.')

str(tables[1])[:200]

## Find the rows

In [None]:
rows = tables[1].find_all('tr')

rows[0]

## Find the data 

We'll skip the first row as an example because it only contains our headers. 

In [None]:
data = rows[1].find_all('td')

data

## Define a record 

Below we'll use a Python dictionary to define our record. Remember, it's a key-pair data structure. We'll do this because it's very easy to convert lists of dictionaries into pandas dataframes.

In [None]:
rec = {
    'number_and_date': data[0].text,
    'description': data[1].text,
    'url': data[2].find('a').get('href'),
}

rec

## Construct the loop

Now, let's put it all together.

In [None]:
records = []

for row in rows[1:]:
    data = row.find_all('td')
    url_start = 'https://extapps2.oge.gov/'
    url_end = data[2].find('a').get('href')
    url = url_start + url_end
    rec = {
        'number_and_date': data[0].text,
        'description': data[1].text,
        'url': url,
    }
    records.append(rec)
    
records[:3]

## Create a `pandas` dataframe

We can pass our records list of dictionaries directly to pandas to create a dataframe.

In [None]:
df = pd.DataFrame(records)

print(f'There are {len(df.index)} rows in the dataframe.')

df.head()

## Output to a csv 

In [None]:
df.to_csv('./data/oge_foias.csv', index=False)