# Our first web scraper

### So, how will we scrape [this website](http://www.nrc.gov/reactors/operating/list-power-reactor-units.html)?

1. We will import some libraries that:
    - Act like an internet browser
    - Parse HTML code
    - Read and write CSV files
2. Grab the contents of the web page.
3. Parse the contents of the web page and target only the data table.
4. Open a blank CSV file to store the information in the data table.
5. Loop through each row in the online data table:
    - Extract each element (cell) and store it in a variable
    - Write those variables as a row into the CSV file
6. Close the CSV file.
7. Rejoice.

### Why will we scrape this way?

While code-free tools are handy in a pinch, scripts written in Python or another language are more flexible and adaptable. They can also run automatically in the background on a schedule. Also, you don't have to worry about a service or a tool ever disappearing, making all your hard work for naught.

### 1. Import libraries to do the heavy lifting

We're going to bring in three outside modules to help us scrape this page.

- **requests** will act like an internet browser and collect HTML
- **BeautifulSoup** will parse the HTML code and allow us to isolate a data table
- **csv** will allow us to write what we find to a nicely formatted file

### 2. Grab the contents of a web page.

The page we want is located here: http://www.nrc.gov/reactors/operating/list-power-reactor-units.html

**requests** has a method called *get*, which is analagous to a browser like Firefox or Chrome fetching the HTML code for display.

We can check this quickly to see if we've gotten the expected raw HTML code by using another **requests** method that returns the HTML code as plain text.

### 3. Parse the HTML and target the table

Now we can send our HTML code to **BeautifulSoup**, which is specifically designed to navigate the structural elements of the document, breaking off the pieces we choose. In this case, we are after the web page's only table -- it has all the data we need.

**BeautifulSoup** has methods called *find* and *find_all* designed to target HTML tags. While *find* picks up the first matching instance, *find_all* locates all matching instances and returns them as a kind of list. We will use this to our advantage in a moment.

Again, we can check to see if we've isolated the table.

### 4. Open a blank CSV file for data storage

We need a place for all this data to go once we start scraping it; we can open a new blank file and then use the **csv** method *writer* to create an object (stay with me now) that we can order around with some basic commands, making it write data to the new blank file.

Let's write our inaugural row to the file: the header that specifies what all the different columns are. We'll use **csv**'s *writerow* to send a list of what we would like written to the file: `"NAME", "LINK", "DOCKET", "LICENSE_NUM", "TYPE", "LOCATION", "OWNER", "REGION"`

### 5. Loop through each row in the table, extract data and write it to the file

Here comes the tricky part: we have to actually scrape the data out of the table we isolated.

To do that, we need to not only loop through every row in the table, but also each cell in every row.

Remember, if I want to do something to each item in this list without having to retype it repeatedly, this basic syntax, in pseudocode:

```
for [a list item] in [some list]:
    do a thing with [a list item]
```

That thing will then happen with the first list item, the second, the third, etc., until the end of the list is reached.

Let's grab one row from this table to see what we might have to do in order to extract the text from each cell into a variable.

Let's hone in on that first cell containing a reactor name, a docket number and a partial URL leading to the reactor page.

How could we pick out the text components and the URL from the contents? By using `BeautifulSoup`'s `.text` method to isolate text within HTML tags and `.get` to slice out the URL.

Based on this, you should kind of get the idea now about what the process will be like to dive into other cells in this row. Instead of extracting information from `cell_list[0]`, we'll be going into `cell_list[1]`, `cell_list[2]`, etc.

So now we should be able to dive into the table with this long-ish list of things to grab and pass into variables. We'll make a list of HTML snippets wrapped in `<tr>` tags (the table rows), and then a list within that of the actual data cells inside each `<td>`. We'll crawl through those, extracting data and passing it to variables. At the end of each iteration, we'll write the row to the output file; it will then start all over again with the next row.

**One point of weirdness**: This webpage is encoded in UTF-8, meaning it has the ability to have characters that fall outside the western ASCII set. Python 2 doesn't like this. There are some characters that aren't part of ASCII in the location and owner columns, so we'll have to encode them before they are written to the CSV.

This loop has done all the work! Just one thing left to do:

### 6. Close the file

Some of it just hangs out in the computer's memory until you close the file and commit it all to disk. 