# Scrape FERC enforcement cases

By [Ben Welsh](mailto:ben.welsh@latimes.com)

This script was written as part of an analysis conducted by the Los Angeles Times. For more information refer to the [final notebook](02_analyze.ipynb).

### Import Python tools

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

### Download the penalties homepage

This contains a navigation bar with links to pages for individual years.

In [2]:
homepage_url = "https://www.ferc.gov/enforcement/civil-penalties/civil-penalty-action.asp"

In [3]:
homepage_html = requests.get(homepage_url).content

In [4]:
homepage_soup = BeautifulSoup(homepage_html, 'html5lib')

Zoom in on the table with the data we want

In [5]:
homepage_table = homepage_soup.find("table", id="right_module")

In [6]:
homepage_years = homepage_table.findAll("li")

The first list item will be the year of the current page. The rest will be previous years with links.

In [7]:
current_year = int(homepage_years[0].contents[0].strip())

Pull out all the previous years

In [8]:
previous_years = dict((
    (int(i.find("a").contents[0].strip()), i.find("a")['href'])
        for i in homepage_years[1:]
))

Combine the soup for all the years

In [9]:
soup_dict = {current_year: homepage_soup}

In [10]:
for year, url in previous_years.items():
    year_url = "https://www.ferc.gov{}".format(url)
    year_html = requests.get(year_url).content
    year_soup = BeautifulSoup(year_html, 'html5lib')
    soup_dict[year] = year_soup

In [11]:
def parse_table(year, soup):
    container = soup.find("div", attrs={"class": "container"})
    table = container.find("table")
    raw_rows = table.findAll("tr")
    clean_rows = []
    for row in raw_rows[1:]:
        raw_cells = row.findAll("td")
        clean_cells = [year,]
        for cell in raw_cells:
            cleaned = " ".join(cell.get_text().strip().split())
            clean_cells.append(cleaned)
        clean_rows.append(clean_cells)
    return clean_rows

Parse and combine all the rows

In [12]:
row_list = []

In [13]:
for year, soup in soup_dict.items():
    year_rows = parse_table(year, soup)
    row_list.extend(year_rows)

Export it

In [14]:
df = pd.DataFrame(row_list)

In [15]:
df.columns = [
    'raw_year',
    'raw_subject',
    'raw_sanctions',
    'raw_description'
]

In [16]:
df.sort_values(["raw_year","raw_subject"], ascending=False).to_csv(
    "./input/scraped-ferc-civil-penalties.csv",
    index=False,
    encoding="utf-8"
)