# Beginnning web scraping: The FDIC's list of failed banks

This notebook walks you through the basics of web scraping by extracting the list of the Federal Deposit Insurance Commmision's [list of failed banks](https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/). For the purposes of the exercise, we are going to ignore the fact the data can be downloaded directly.

## Import libraries

You will need three libraries to scrape the website: [csv](https://docs.python.org/3/library/csv.html), [Requests](https://requests.readthedocs.io/en/latest/) and [BeautifulSoup4](https://beautiful-soup-4.readthedocs.io/en/latest/).
- **csv**: This library handles the reading and writing of CSV files. It is part of the standard library, meaning it comes packaged with Python unlike the other two libraries.
- **Requests**: Requests is what you will use to actually get the webpage from the Internet. It needs to be installed before you can use it - `pip install requests`.
- **BeautifulSoup4**: Also known as **bs4**, this library is used to parse HTML and extract data from it. It also needs to be installed before use - `pip install bs4`.

If you have this notebook running in Jupyter Lab, you should have all the needed libraries installed. If not, refer to the [README](./README.md) in this repository.

In [1]:
import csv

from bs4 import BeautifulSoup
import requests

## Making a web request

The first step of each scrape is requesting the web page we want to extract information from. We do this providing a url and using Requests to make either a **get** or **post** request.
> A **get** request is the most common type of request. It includes all the information a web server needs to return content to your browser in the url. Another common type of request is a **post** request. In this case additional information needs to be collected and sent to the web server. This is most often done through a form filled out by the user - a common example would be the search field on Google.

To make the web request store the url in a variable called `url`. Next we make our web request using the `requests.get()` method and assign that to a variable called `response`. Finally, we check to make sure there were no problems making the request using the `response.raise_for_status()` method.

In [2]:
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'

response = requests.get(url)
response.raise_for_status()

The `response` variable now contains a bunch of information about the web request we just made, but for now we are only interested in the HTML we just downloaded.

## What is HTML

[HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) or HyperText Markup Language, is what is used to display content on a web page. It consists of a series of **elements** or **tags** to describe how information should be structured on a page. These tags can be recognized by the opening `<` and closing `>` angled brackets. Tags can be nested within each other, creating a heirarchy or tree of the content. A very basic web page looks like this:
```html
<html>
  <head>
    <title>This is an example page</title>
  </head>
  <body>
    <p>This text is stored in a paragraph element.</p>
  </body>
</html>
```
In this example tags operate in pairs. The `<html>` tag signifies the beginning of the HTML while the `</html>` tag siginifies the end. Everything in between those two tags is considered part of the HTML and the web page. Everything between `<head>` and `</head>` contains metadata about the page while everything in between `<body>` and `</body>` is the content actually displayed on the page by a web browser. In this case it is a single paragraph, denoted by the `<p>` tag.

Note the indentation, it indicates the heirarchy of content - the `<title>` tag belongs in the `<head>` of the page, but not the `<body>`. This is also called the tree.

## Parsing HTML

The HTML we just downloaded is stored in `response.text`. We need to load that into a parser so we can easily navigate it and extract the information we are looking for. To do this we load the data into BeautifulSoup and assigning it to a variable called `soup`. We are also going to create two new variables and assign empty lists to them. The `fieldnames` variable will hold a list of column names and `results` will hold all of our parsed data after we extract it from the web page.

In [3]:
soup = BeautifulSoup(response.text)

fieldnames = []
results = []

### Extracting the table

The information we want to extract is in a table on the web page. A table is often structured like this in HTML:

```html
<table>
    <thead>
        <tr> 
            <th>Pet Owner</th>
            <th>Pet Type</th>
            <th>Pet Name</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Eric</td>
            <td>Dog</td>
            <td>Marco</td>
        </tr>
        <tr>
            <td>Joe</td>
            <td>Dog</td>
            <td>Leopold</td>
        </tr>
        <tr>
            <td>Cheryl</td>
            <td>Dog</td>
            <td>Tank</td>
        </tr>
    </tbody>
</table>
```
The table's column names are stored in `<thead>` while the actual data is stored in `<tbody>`. Each row of the table is signified by a `<tr>` tag. Fieldnames are enclosed in `<th>` tags while each individual data point is stored in `<td>` tags.

We can use this structure to start finding and extracting the data. Start by finding the table itself using the `soup.find()` method. There is only one `<table>` on this example so we can use `soup.find()` safely. If there were more than one table on the page, it would only return the first `<table>` element it finds.

In [4]:
table = soup.find('table')

### Extracting field names
Next use the information stored in the `table` variable to find the `<thead>` element containing the column names.

In [5]:
thead = table.find('thead')

Here is a view of what the HTML in the table head looks like:
```html
<thead class="dataTables-content-header bg-blue">
    <tr>
        <th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
            <p class="font-serif-xs text-light margin-0 padding-0 text-white">
                <span class="dtfullname">Bank Name</span>
                <span class="dtmobilename">Bank</span>
            </p>
        </th>
        <th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
            <p class="font-serif-xs text-light margin-0 padding-0 text-white">
                <span class="dtfullname">City</span>
                <span class="dtmobilename">City</span>
            </p>
        </th>
        <th class="text-no-wrap text-left padding-left-2 desktop:padding-left-1 padding-right-105 padding-top-2 padding-bottom-1">
            <p class="font-serif-xs text-light margin-0 padding-0 text-white">
                <span class="dtfullname">State</span>
                <span class="dtmobilename">St</span>
            </p>
        </th>
        <th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
            <p class="font-serif-xs text-light margin-0 padding-0 text-white">
                <span class="dtfullname">Cert</span>
                <span class="dtmobilename">Cert</span>
            </p>
        </th>
        <th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
            <p class="font-serif-xs text-light margin-0 padding-0 text-white">
                <span class="dtfullname">Acquiring Institution</span>
                <span class="dtmobilename">AI</span>
            </p>
        </th>
        <th class="text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
            <p class="font-serif-xs text-light margin-0 padding-0 text-white">
                <span class="dtfullname">Closing Date</span>
                <span class="dtmobilename">Closing</span>
            </p>
        </th>
        <th class="text-border-right white text-no-wrap text-left padding-left-2 padding-right-105 padding-top-2 padding-bottom-1">
            <p class="font-serif-xs text-light margin-0 padding-0 text-white">
                <span class="dtfullname">Fund</span>
                <span class="dtmobilename">Fund</span>
            </p>
        </th>
    </tr>
</thead>
```

Each column name is stored within the `<th>` tag, but there are two versions - one for desktop and another for mobile. We need to find all `<th>` tags and loop through them, extracting only the column name meant for desktop display since it contains the most information. We use `find_all()` instead of `find` since we want to capture every occurence of the `<th>` tag and not just the first one. For each `<th>` element we want to navigate the HTML heirarchy by selecting the `<p>` tag and the first `<span>` element - the one meant for desktop display. Then we extract the fieldname from the results and add it to our list of fieldnames. In BeautifulSoup we access the values we want using the `.text` attribute.

We are going to add one last column name to `fieldnames` - **url**. This column will be for the link we see for each bank in the data.

In [6]:
for th in thead.find_all('th'):
    fieldname = th.p.span.text
    fieldnames.append(fieldname)

fieldnames.append('url')
fieldnames

['Bank Name',
 'City',
 'State',
 'Cert',
 'Acquiring Institution',
 'Closing Date',
 'Fund',
 'url']

This list of field names will go into our `results` list so they can be written out to our CSV later.

In [7]:
results.append(fieldnames)

### Exracting the data

Next we are going to extract the data from the table. Start by isolating the `<tbody>` element and finding all rows (`<tr>` tags) within the table body. Notice again we are using `find` to find a single element and `find_all` to find multiple elements.

In [8]:
tbody = table.find('tbody')
trs = tbody.find_all('tr')

The `trs` variable is now a list of all the rows in the table's body. Here is what the first row looks like:

```html
<tr>
    <td>
        <a href="/resources/resolutions/bank-failures/failed-bank-list/citizensbank.html">Citizens Bank</a>
    </td>
    <td>Sac City</td>
    <td>IA</td>
    <td>8758</td>
    <td>Iowa Trust &amp; Savings Bank</td>
    <td>November 3, 2023</td>
    <td>10545</td>
</tr>
```

For each row we need to do the following:
- Create a variable `values` to hold the information we extract
- Find all `<td>` tags which contains the actual data values
- Loop through the `<td>` tags to extract each one's value and store it in the `values` variable
- Find the link - `<a>` tag - in the row and extract its **href**, the address of the link.
- Repair the link so it has the full url before adding it to our values.
- Finally add the list of `values` to our `results` variable so it can be written out to a CSV.


To do this we will need a couple of `for` loops, one within the other.

In [9]:
for tr in trs:
    values = []
    tds = tr.find_all('td')

    for td in tds:
        values.append(td.text)

    bank_link = tr.find('a')
    href = bank_link['href']
    bank_url = 'https://www.fdic.gov' + href
    values.append(bank_url)

    results.append(values)

Do a couple of quick checks.

Count number of records:

In [10]:
len(results)

569

View the first five results:

In [11]:
results[:5]

[['Bank Name',
  'City',
  'State',
  'Cert',
  'Acquiring Institution',
  'Closing Date',
  'Fund',
  'url'],
 ['Citizens Bank',
  'Sac City',
  'IA',
  '8758',
  'Iowa Trust & Savings Bank',
  'November 3, 2023',
  '10545',
  'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/citizensbank.html'],
 ['Heartland Tri-State Bank',
  'Elkhart',
  'KS',
  '25851',
  'Dream First Bank, N.A.',
  'July 28, 2023',
  '10544',
  'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/heartlandtristate.html'],
 ['First Republic Bank',
  'San Francisco',
  'CA',
  '59017',
  'JPMorgan Chase Bank, N.A.',
  'May 1, 2023',
  '10543',
  'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/first-republic.html'],
 ['Signature Bank',
  'New York',
  'NY',
  '57053',
  'Flagstar Bank, N.A.',
  'March 12, 2023',
  '10540',
  'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/signature-ny.html']]

## Write out the results

Now we are ready to write our data out to a CSV file. To do this we open up a file in write mode and use Python's **csv** library to write out the results. 

In [12]:
with open('./data/raw/fdic_failed_banks.csv', 'w') as outfile:
    output = csv.writer(outfile)
    output.writerows(results)

## Conclusion
Congratulations!! You've written your first web scraper. Web sites are not always so easy to scrape, please reach out to Big Local News if you are having difficulty scraping a particular site and we can help answer any questions.