# Web scraping with Python

This notebook demonstrates how you can use the Python programming language to scrape information from a web page.

The goal today: Scrape the main table on [the first page of FDA warning letters issued in 2019](https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2019/default.htm) and, if time, write the data to a CSV.

If you're relatively new to Python, it might be helpful to have [this Python syntax cheat sheet](Python%20syntax%20cheat%20sheet.ipynb) open in another tab as you work through this notebook.

### Table of contents

- [Using Jupyter notebooks](#Using-Jupyter-notebooks)
- [What _is_ a web page, anyway?](#What-is-a-web-page,-anyway?)
- [Inspect the source](#Inspect-the-source)
- [Import libraries](#Import-libraries)
- [Request the page](#Request-the-page)
- [Turn your HTML into soup](#Turn-your-HTML-into-soup)
- [Targeting and extracting data](#Targeting-and-extracting-data)
- [Write the results to file](#Write-the-results-to-file)

### Using Jupyter notebooks

There are several ways to write and run Python code on your computer. One way -- the method we're using today -- is to use [Jupyter notebooks](https://jupyter.org/), which run in your browser and allow you to intersperse documentation with your code. They're handy for bundling your code with a human-readable explanation of what's happening at each step. Check out some examples from the [L.A. Times](https://github.com/datadesk/notebooks) and [BuzzFeed News](https://github.com/BuzzFeedNews/everything#data-and-analyses).

**To add a new cell to your notebook**: Click the + button in the menu.

**To run a cell of code**: Select the cell and click the "Run" button in the menu, or you can press Shift+Enter.

**One common gotcha**: The notebook doesn't "know" about code you've written until you've _run_ the cell containing it. For example, if you define a variable called `my_name` in one cell, and later, when you try to access that variable in another cell but get an error that says `NameError: name 'my_name' is not defined`, the most likely solution is to run (or re-run) the cell in which you defined `my_name`.

### What _is_ a web page, anyway?

Basically, a collection of specifically formatted text files stored on a computer (a _server_) that's probably sitting on a rack in a giant data center somewhere.

Mostly you'll be dealing with `.html` (HyperText Markup Language) files that typically include references to `.css` (Cascading Style Sheet) files, which determine how the page looks, and/or `.js` (JavaScript) files, which add interactivity.

Today, we'll focus on the HTML, which provide structure to the page.

HTML elements are represented by a pair of tags -- an opening tag and a closing tag.

A table, for example, starts with `<table>` and ends with `</table>`. The first tag tells the browser: "Hey! I got a table here! Render it as a table." The closing tag (note the forward slash!) tells the browser: "Hey! I'm all done with that table, thanks." Inside the table are nested more HTML tags representing rows (`<tr>`) and cells (`<td>`).

HTML elements can have any number of attributes, such as classes --

`<table class="cool-table">`

-- styles --

`<table style="width:95%;">`

-- hyperlinks to other pages --

`<a href="https://ire.org">Click here to visit IRE's website</a>`

-- and IDs --

`<table id="cool-table">`

-- that will be useful to know about when we're scraping.

### Inspect the source

You can look at the HTML that makes up a web page by _inspecting the source_ in a web browser. We like Chrome and Firefox for this; today, we'll use Chrome.

To "view source" in Chrome, hit `Ctrl+U` on a PC or `⌘+Opt+U` on a Mac. (It's also in the menu bar: View > Developer > View Page Source.)

You'll get a page showing you all of the HTML code that makes up that page. Ignore 99% of it and try to locate the element(s) that you want to target (use `Ctrl+F` on a PC and `⌘+F` to find).

You can also inspect specific elements on the page by right-clicking on the page and selecting "Inspect" or "Inspect Element" from the context menu that pops up. Hover over elements in the "Elements" tab to highlight them on the page.

Open up a Chrome browser and inspect the table on the [first page of FDA's list of warning letters issued in 2018](https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2019/default.htm). Find the table we want to scrape.

Is it the only table on the page? If not, does it have any attributes that would allow you to target it?



### Import libraries

Step one is to _import_ two third-party Python libraries that will help us scrape this page:
- `requests` is the de facto standard for making HTTP requests, similar to what happens when you type a URL into a browser window and hit enter.
- `bs4`, or BeautifulSoup, is a popular library for parsing HTML into a data structure that Python can work with. In this case, though, we only need the `BeautifulSoup` object that comes with `bs4`, so we'll only import that. (Bonus: That's the convention you'll see when Googling for help.)

These libraries are installed separately from Python on a per-project basis ([read more about our recommendations for setting up Python projects here](https://docs.google.com/document/d/1cYmpfZEZ8r-09Q6Go917cKVcQk_d0P61gm0q8DAdIdg/edit#heading=h.od2v1nkge5t1)).

Run this cell (you'll only have to do this once):

In [None]:
import requests
from bs4 import BeautifulSoup

### Request the page

Next, we'll use the `get()` method of the `requests` library (which we just imported) to grab the web page.

While we're at it, we'll _assign_ all the stuff that comes back to a new variable using `=`.

The variable name is arbitrary, but it's usually good to pick something that describes whatever value it's pointing to.

Notice that the URL we're grabbing is wrapped in quotes, making it a _string_ that Python will interepret as text (as opposed to numbers, booleans, etc.). You can read up more on Python data types and variable assignment [here](Python%20syntax%20cheat%20sheet.ipynb).

Run this cell:

In [None]:
fda_page = requests.get('https://www.fda.gov/ICECI/EnforcementActions/WarningLetters/2019/default.htm')

Nothing appears to have happened, which is (usually) a good sign.

If you want to make sure that your request was successful, you can check the `status_code` attribute of the Python object that was returned:

In [None]:
fda_page.status_code

A `200` code means all is well. `404` means the page wasn't found, etc. ([Here's one of our favorite lists of HTTP status codes](https://http.cat/) ([or here, if you prefer dogs](https://httpstatusdogs.com/)).)

The object being stored as the `fda_page` variable came back with a lot of potentially useful information we could access. Today, we're mostly interested in the `.text` attribute -- the HTML that makes up the web page, same as if we'd viewed the page source. Let's take a look:

In [None]:
fda_page.text

### ✍️ Try it yourself

Use the code blocks below to experiment with requesting web pages and checking out the HTML that gets returned.

Some ideas to get you started:
- `'http://ire.org'`
- `'https://web.archive.org/web/20031202214318/http://www.tdcj.state.tx.us:80/stat/finalmeals.htm'`
- `'https://www.nrc.gov/reactors/operating/list-power-reactor-units.html'`

### Turn your HTML into soup

The HTML in the `.text` attribute of the request object is just a string -- a big ol' chunk of text.

Before we start targeting and extracting pieces of data in the HTML, we need to turn that chunk of text into a data structure that Python can work with. That's where the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) (`bs4`) library comes in.

We'll create a new instance of the `BeautifulSoup` object we imported earlier, and we need to give it two things:
- The HTML we'd like to parse -- `fda_page.text`
- A string with the name of the type of parser to use -- `html.parser` is the default and usually fine, but [there are other options](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser)

We'll save the parsed HTML as a new variable, `soup`.

In [None]:
soup = BeautifulSoup(fda_page.text, 'html.parser')

Nothing happened, which is good! You can take a look at what `soup` is, but it looks pretty much like `fda_page.text`:

In [None]:
soup

If you want to be sure, you can use the Python function `type()` to check what sort of object you're dealing with:

In [None]:
# the `str` type means a string, or text
type(fda_page.text)

In [None]:
# the `bs4.BeautifulSoup` type means we successfully created the object
type(soup)

### ✍️ Try it yourself

Use the code blocks below to experiment fetching HTML and turning it into soup (if you fetched some pages earlier and saved them as variables, that'd be a good start).

### Targeting and extracting data

Now that we have BeautifulSoup object loaded up, we can go hunting for the specific HTML elements that contain the data we need. Our general strategy:
1. Find the main table with the data we want to grab
2. Get a list of rows (the `tr` element, which stands for "table row") in that table
3. Use a Python `for loop` to go through each table row and find the data inside it (`td`, or "table data")

To accomplish this, we'll use two `bs4` methods:
- [`find()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find), which returns the first element that matches whatever criteria you hand it
- [`find_all()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all), which returns a _list_ of elements that match the criteria. ([Here's how Python lists work](Python%20syntax%20cheat%20sheet.ipynb#Lists).)

#### Find the table

To start with, we need to find the table. There are several ways to accomplish this, but because this is the only table on the page (view source and `Ctrl+F` to search for `<table` to confirm), we can simply say, "Look through the `soup` object and find the table tag."

Translated, the code is: `soup.find('table')`. While we're at it, save the results of that search to a new variable, `table`.

Run these cells:

In [None]:
table = soup.find('table')

In [None]:
table

#### Find the rows in the table

Next, use the `find_all()` method to drill down and get a list of rows in the table:

In [None]:
rows = table.find_all('tr')

In [None]:
rows

To see how many items are in this list -- in other words, how many rows are in the table -- you can use the `len()` function:

In [None]:
len(rows)

#### Loop through the rows and extract the data

Next, we can use a [`for` loop](Python%20syntax%20cheat%20sheet.ipynb#for-loops) to go through the list of rows and start grabbing data from each one.

Quick refresher on _for loop_ syntax: Start with the word `for` (lowercase), then a variable name to stand in for each item in the list that you're looping over, then the word `in` (lowercase), then the name of the list holding the items (`rows`, in our case), then a colon, then an indented block of code describing what we're doing to each item in the list.

Each piece of data in the row will be stored in a `td` tag, which stands for "table data." So inside the loop -- in the indented block -- we'll use the `find_all()` method to get a list of every `td` tag inside the row. And from there, we can access the content inside each tag.

Our goal is to end up with a _list_ of data for each row that we will eventually write out to a file. Typically you'd probably do the work of looping and inspecting the results, step by step, in one code cell. But to show the thinking of how you might approach this (and to practice the syntax), we'll start by just printing out each row and then build from there. (`print()` an empty line, too, to help you see what you're working with.)

In [None]:
for row in rows:
    print(row)
    print()

Notice that the first item that prints is the header row with the column labels, and that it contains `th` ("table header") tags instead of `td` ("table data") tags.

Because our next step in our `for` loop was going to be, "Find all of the `td` tags in this row," this is something we need to deal with. We could just tell the script "Find `td` _or_ `th` tags," or we could just skip the first row altogether. I usually just skip the first row using _list slicing_: adding square brackets with some instructions about which items in the list you want to select.

Here, the syntax would be: `rows[1:]`, which means, take everything in the `rows` list from the item in position 1 to then end. Like many programming languages, Python starts counting at 0, so the result will leave off the first item in the list -- i.e. the item in position 0, i.e. the headers.

In [None]:
for row in rows[1:]:
    print(row)
    print()

Now we're cooking with gas. Let's start pulling out the data in each row. Start by using `find_all()` to grab a lit of `td` tags:

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    print(cells)
    print()

Now we have, for each row, a _list_ of `td` tags. Next step would be to look at the table and start grabbing specific values based on their position in the list and assigning them to human-readable variable names.

Quick refresher on list syntax: To access a specific item in a list, use square brackets `[]` and the index number of the item you'd like to access. For instance, to get the first item in the cell, use `[0]`.

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    date = cells[0]
    print(date)

This is returning the entire Tag object -- we just want the contents inside it. You can access the `.string` attribute of the tag to get the text inside:

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    date = cells[0].string
    print(date)

In the next cell (`[1]`), the `.string` attribute will give you the name of the company, which is something you'd probably want. But how to get the link to the actual letter? If you inspect the HTML, the URL is attached to the `href` attribute in the `a` tag inside the `td`.

The syntax to drill down to get the URL is: `cells[1].a['href']`.

In human words, this means:
- Go to the `cells` list (which contains all of the `td` tags in that `tr`) and grab the second `[1]` item: `cells[1]`
- Grab the `a` tag inside of that `td` tag: `cells[1].a`
- Access the `href` attribute attached to that `a` tag: `cells[1].a['href']`

Which gets us this far:

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    date = cells[0].string
    company = cells[1].string
    link = cells[1].a['href']

    # create a list for your data to live in
    data_out = [date, company, link]
    print(data_out)

One last thing on the URL: As it's stored, it's a _relative_ link -- it doesn't include the actual domain information (`https://www.fda.gov`). It'd be a good idea to prepend this to make it a _fully qualified_ URL before we do anything to it.

In Python, to concatenate two strings, just use a plus sign:

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    date = cells[0].string
    company = cells[1].string
    link = 'https://www.fda.gov' + cells[1].a['href']

    # create a list for your data to live in
    data_out = [date, company, link]
    print(data_out)

### ✍️ Try it yourself

Now that you've gotten this far, see if you can isolate the other three pieces of data in each row.

In [None]:
for row in rows[1:]:
    cells = row.find_all('td')
    date = cells[0].string
    company = cells[1].string
    link = 'https://www.fda.gov' + cells[1].a['href']

    # issuing office
    
    # subject
    
    # close-out date
    
    # create a list for your data to live in
    # add your new items to this list
    data_out = [date, company, link]

    print(data_out)

### Write the results to file

Now that we've targeted our lists of data for each row, we can use Python's built-in [`csv`](https://docs.python.org/3/library/csv.html) module to write each list to a CSV file.

First, import the csv module.

In [None]:
import csv

Now define a list of headers to match the data (each column header will be a string) -- run this cell:

In [None]:
headers = ['letter_date', 'company', 'link', 'issuing_office', 'subject', 'close_out_date']

Now, using something called a `with` block, open a new CSV file to write to and write some code to do the following things:
- Create a `csv.writer` object
- Write out the list of headers using the `writerow()` method of the `csv.writer` object
- Drop in the `for` loop you just wrote and, instead of printing, use the `writerow()` method of the `csv.writer` object to write the list of data to file

In [None]:
# create a file called 'fda-data.csv' in write ('w') mode
# specify that newlines are terminated by an empty string (this deals with a PC-specific problem)
# and use the `as` keyword to name the open file handler (the variable name is arbitrary)
with open('fda-data.csv', 'w', newline='') as outfile:
    # go to the csv module we imported and make a new .writer object attached to the open file
    # and save it to a variable
    writer = csv.writer(outfile)

    # write out the list of headers
    writer.writerow(headers)
    
    # paste in the for loop you wrote earlier here -- watch the indentation!
    # it should be at this indentation level, like this
    # for row in rows[1:]:
    #     cells = row.find_all('td')
    #     etc. ...
    # but at the end, instead of `print(data_out)`
    # make it `writer.writerow(data_out)`

If you look in the folder, you should see a new file: `fda-data.csv`. Hooray!

🎉 🎉 🎉

### ✍️ Try it yourself

Putting it all together:
- Find a website you'd like to scrape
- Use `requests` to fetch the HTML
- Use `bs4` to parse the HTML and isolate the data you're interested in
- Use `csv` to write the data to file