# Extract data from a PDF - CO debt collectors

The PDF we'll be working with here is [a list of licensed debt collectors in Colorado](https://coag.gov/sites/default/files/contentuploads/cp/ConsumerCreditUnit/InternetReports/carreport_0.pdf). The data start on page 2, and each page has headers.

Our steps:
1. Import dependencies
2. Create an empty pandas data frame and define the columns
3. Create a function that extracts data from the table on a single PDF page and returns a data frame
4. Loop over the pages, call the function on each page and append the resulting data frame to our empty data frame
5. Clean up the data a bit
6. Do some basic analysis

### 1. Import dependencies

In [1]:
import pdfplumber
import pandas as pd

### 2. Create an empty data frame and define the columns

We're going to [create an empty data frame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). By looking at the source PDF, we can also define its column headers.

In [5]:
PDF_FILE = '../../data/collections.pdf'

cols = ['bizname', 'license_loc', 'instate_loc', 'mailing_loc',
        'license_no', 'lic_date', 'status', 'cr_date', 'action']

df = pd.DataFrame(columns=cols)

### 3. Create a function to extract data from a single PDF page

This function will be called on every PDF page we hand it. Its job is simple: Take a `pdfplumber.Page` object, extract the table and return the data in a data frame with the same headers as the empty one we just created.

👉 For more details on functions, [see this notebook](../../reference/Functions.ipynb).

In [6]:
def page_to_df(page):
    
    # find the table on the page and extract the data
    table = page.extract_table()
    
    # grab all rows in the table except for the first one,
    # which is the header row
    lines = table[1:]
    
    # return the data in a data frame
    return pd.DataFrame(lines, columns=cols)

### 4. Loop over the pages and call the function on each page

As we extract the data from each page, we'll [append](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html) the data frame returned by our function to the empty data frame (`df`) that we created earlier. This code block takes a little while to run.

👉 For more details on lists, slicing and _for loops_, [see this notebook](../../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#for-loops).

In [7]:
# open the PDF
with pdfplumber.open(PDF_FILE) as pdf:
    
    # use list slicing to skip the first page, which doesn't have a data table
    pages_with_data = pdf.pages[1:]
    
    # loop over the pages with data
    for page in pages_with_data:
        
        # call the extraction function to grab the data from this page
        df_to_append = page_to_df(page)
        
        # append this data to our main dataframe, chopping off the index column
        df = df.append(df_to_append, ignore_index=True)

Before we continue, let's take a look at what we've got using the pandas `head()` method.

In [8]:
df.head()

Unnamed: 0,bizname,license_loc,instate_loc,mailing_loc,license_no,lic_date,status,cr_date,action
0,1ST CREDIT OF AMERICA LLC,"300 N ELIZABETH ST STE 220-B\nCHICAGO, IL 60607","3025 S PARKER RD STE 711\nAURORA, CO 80014","300 N ELIZABETH ST STE 220-B\nCHICAGO, IL 60607",988412,2/20/2004,C,5/15/2007,Yes
1,1ST NATIONAL RECOVERY \nSOLUTIONS LLC,"5497 BROADWAY ST\nLANCASTER, NY 14086","600 17TH ST STE 800 NORTH\nDENVER, CO 80202","5497 BROADWAY ST\nLANCASTER, NY 14086",989708,8/15/2007,E,3/8/2010,
2,1ST NATIONWIDE \nCOLLECTION AGENCY INC,"3760 CALLE TECATE STE B\nCAMARILLO, CA 93012","3025 S PARKER RD STE 711\nAURORA, CO 80014","PO BOX 1418\nCAMARILLO, CA 93011-1418",989591,3/6/2007,C,11/12/2008,
3,21ST MORTGAGE \nCORPORATION,"620 MARKET ST\nKNOXVILLE, TN 37902","3455 W SERVICE RD\nEVANS, CO 80620","PO BOX 477\nKNOXVILLE, TN 37901-0477",991831,4/16/2013,A,Active,
4,24 ASSET MANAGEMENT \nCORP,"2020 CAMINO DEL RIO N STE 900\nSAN DIEGO, CA 9...","80 GARDEN CTR STE 3\nBROOMFIELD, CO 80020","2020 CAMINO DEL RIO N STE \n900\nSAN DIEGO, CA...",990402,11/13/2009,C,1/6/2016,


### 5. Clean up the data a bit

I notice two things:
- `\n` newline breaks are being interpreted literally as text -- let's globally [replace](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) those
- The license date is coming in as a string, not a date, and we might be interested in doing some date filtering later -- let's coerce those values to date objects with the [`to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) method

In [16]:
# kill line breaks
df.replace('\n', ' ', inplace=True, regex=True)

# coerce license date col to datetime
df.lic_date = pd.to_datetime(df.lic_date, errors='coerce')

In [17]:
df.head()

Unnamed: 0,bizname,license_loc,instate_loc,mailing_loc,license_no,lic_date,status,cr_date,action
0,1ST CREDIT OF AMERICA LLC,"300 N ELIZABETH ST STE 220-B CHICAGO, IL 60607","3025 S PARKER RD STE 711 AURORA, CO 80014","300 N ELIZABETH ST STE 220-B CHICAGO, IL 60607",988412,2004-02-20,C,5/15/2007,Yes
1,1ST NATIONAL RECOVERY SOLUTIONS LLC,"5497 BROADWAY ST LANCASTER, NY 14086","600 17TH ST STE 800 NORTH DENVER, CO 80202","5497 BROADWAY ST LANCASTER, NY 14086",989708,2007-08-15,E,3/8/2010,
2,1ST NATIONWIDE COLLECTION AGENCY INC,"3760 CALLE TECATE STE B CAMARILLO, CA 93012","3025 S PARKER RD STE 711 AURORA, CO 80014","PO BOX 1418 CAMARILLO, CA 93011-1418",989591,2007-03-06,C,11/12/2008,
3,21ST MORTGAGE CORPORATION,"620 MARKET ST KNOXVILLE, TN 37902","3455 W SERVICE RD EVANS, CO 80620","PO BOX 477 KNOXVILLE, TN 37901-0477",991831,2013-04-16,A,Active,
4,24 ASSET MANAGEMENT CORP,"2020 CAMINO DEL RIO N STE 900 SAN DIEGO, CA 92108","80 GARDEN CTR STE 3 BROOMFIELD, CO 80020","2020 CAMINO DEL RIO N STE 900 SAN DIEGO, CA 9...",990402,2009-11-13,C,1/6/2016,


## 6. Do some basic analysis

Let's get a feel for how many records there are and figure out how many of debt collectors have been subject to some kind of "action."

According to the Colorado Attorney General (see page 1 of the PDF), the presence of "Yes" in the "action" column means that the company has been

> subject to legal or administrative action by this office or the licensee entered into a voluntary settlement with this office. If the entry is "yes," the licensee may have been subject to one or more letters of admonition, suspension of the license, a judgment or order against the licensee, or other action, including payments (fines, penalties, consumer refunds, or other monetary payments.) Additionally, "yes" may mean that the licensee's records include a voluntary settlement or stipulation with this office. If a licensee has been disciplined, it might still retain its license. Actions and settlements are matters of public record although research, copying, and mailing costs may apply. Contact this office for more information.

👉 We're about to do some string formatting. For more details on string formatting, [check out this notebook](../../reference/String%20formatting.ipynb).

In [18]:
# how many records are there?
record_count = len(df)

In [24]:
# let's look just at collectors who have had some action taken against them
action = df[df.action == 'Yes']

# use string formatting to write a formatted sentence
# https://docs.python.org/3.4/library/string.html#format-examples
tpl = ('Of {colcount:,} licensed debt collectors in Colorado, {actioncount:,}'
       ' ({actionpct:.2f}%) have been subject to some form of legal '
       'or administrative action, according to an analysis of'
       ' Colorado Secretary of State data.')

story_sentence = tpl.format(colcount=record_count,
                            actioncount=len(action),
                            actionpct=(len(action) / record_count) * 100)

print(story_sentence)

Of 2,402 licensed debt collectors in Colorado, 687 (28.60%) have been subject to some form of legal or administrative action, according to an analysis of Colorado Secretary of State data.


In [None]:
# what else?