# Extract data from a PDF

**Problem**: You have data trapped in a PDF.

**Solution**: Use a Python library like [`pdfplumber`](https://github.com/jsvine/pdfplumber) to extract it.

The PDF we'll be working with today is [a list of licensed debt collectors in Colorado](https://coag.gov/sites/default/files/contentuploads/cp/ConsumerCreditUnit/InternetReports/carreport_0.pdf). The data start on page 2, and each page has headers.

Our steps:
1. Import dependencies
2. Download the PDF
3. Create an empty pandas data frame
4. Create a function that extracts data from the table on a single PDF page and returns a data frame
5. Loop over the pages, call the function on each page and append the resulting data frame to our empty data frame
6. Clean up the data a bit
7. Do some basic analysis

# 1. Import dependencies

In [None]:
import requests
import pdfplumber
import pandas as pd

## 2. Download the PDF

In [None]:
PDF_URL = ('https://coag.gov/sites/default/files/contentuploads/'
          'cp/ConsumerCreditUnit/InternetReports/carreport_2.pdf')

r = requests.get(PDF_URL)

with open('../data/collections.pdf', 'wb') as f:
    for block in r.iter_content(1024):
        f.write(block)

## 3. Create an empty data frame and define the columns

We're going to create an empty data frame. By looking at the source PDF, we can also define its column headers.

In [None]:
cols = ['bizname', 'license_loc', 'instate_loc', 'mailing_loc',
        'license_no', 'lic_date', 'status', 'cr_date', 'action']

df = pd.DataFrame(columns=cols)

## 5. Create a function to extract data from a single PDF page

This function will be called on every PDF page we hand it. Its job is simple: Take a `pdfplumber.Page` object, extract the table and return the data in a data frame with the same headers as the empty one we just created.

{tk - link to notebook on functions}

In [None]:
def page_to_df(page):
    
    # find the table on the page and extract the data
    table = page.extract_table()
    
    # grab all rows in the table except for the first one,
    # which is the header row
    lines = table[1:]
    
    # return the data in a data frame
    return pd.DataFrame(lines, columns=cols)

## 6. Loop over the pages and call the function on each page

As we extract the data from each page, we'll append the data frame returned by our function to the empty data frame (`df`) that we created earlier. This code block takes a little while to run.

In [None]:
# open the PDF
with pdfplumber.open('../data/collections.pdf') as pdf:
    
    # skip the first page, which doesn't have a data table
    pages_with_data = pdf.pages[1:]
    
    # loop over the pages with data
    for page in pages_with_data:
        
        # call the extraction function to grab the data from this page
        df_to_append = page_to_df(page)
        
        # append it to our main dataframe, chopping off the index column
        df = df.append(df_to_append, ignore_index=True)

Before we continue, let's take a look at what we've got using the pandas `head()` method.

In [None]:
df.head()

I notice two things:
- `\n` newline breaks are being interpreted literally as text -- let's globally replace those
- The license date is coming in as a string, not a date, and we might be interested in doing some date filtering later -- let's coerce those values to date objects

## 7. Clean up the data a bit

In [None]:
# kill line breaks
df.replace('\n', ' ', inplace=True, regex=True)

# coerce license date col to datetime and sort descending
df.lic_date = pd.to_datetime(df.lic_date)
df = df.sort_values('lic_date', ascending=False)

In [None]:
df.head()

## 8. Do some basic analysis

Let's get a feel for how many records there are and figure out how many of debt collectors have been subject to some kind of "action."

According to the Colorado Attorney General (see page 1 of the PDF), the presence of "Yes" in the "action" column means that the company has been

> subject to legal or administrative action by this office or the licensee entered into a voluntary settlement with this office. If the entry is "yes," the licensee may have been subject to one or more letters of admonition, suspension of the license, a judgment or order against the licensee, or other action, including payments (fines, penalties, consumer refunds, or other monetary payments.) Additionally, "yes" may mean that the licensee's records include a voluntary settlement or stipulation with this office. If a licensee has been disciplined, it might still retain its license. Actions and settlements are matters of public record although research, copying, and mailing costs may apply. Contact this office for more information.

In [None]:
# how many records are there?
record_count = len(df)

In [None]:
# let's look just at collectors who have had some action taken against them
action = df[df.action == 'Yes']

story_sentence = ('Of {:,} licensed debt collectors in Colorado, {:,}'
                 ' ({:.2f}%) have been subject to some form of legal '
                 'or administrative action, according to an analysis of'
                 ' Colorado Secretary of State data.')

print(story_sentence.format(record_count, len(action), (len(action) / record_count) * 100))

In [None]:
# what else?