# Extract data from a PDF - CO debt collectors

The PDF we'll be working with here is [a list of licensed debt collectors in Colorado](https://coag.gov/sites/default/files/contentuploads/cp/ConsumerCreditUnit/InternetReports/carreport_0.pdf). The data start on page 2, and each page has headers.

Our steps:
1. Import dependencies
2. Create an empty pandas data frame and define the columns
3. Create a function that extracts data from the table on a single PDF page and returns a data frame
4. Loop over the pages, call the function on each page and append the resulting data frame to our empty data frame
5. Clean up the data a bit
6. Do some basic analysis

### 1. Import dependencies

In [None]:
# import pdfplumber and pandas


### 2. Create an empty data frame and define the columns

We're going to [create an empty data frame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). By looking at the source PDF, we can also define its column headers.

In [None]:
# path to the PDF we'll be parsing

# define the column names in a list

# create an empty dataframe


### 3. Create a function to extract data from a single PDF page

This function will be called on every PDF page we hand it. Its job is simple: Take a `pdfplumber.Page` object, extract the table and return the data in a data frame with the same headers as the empty one we just created.

👉 For more details on functions, [see this notebook](../../reference/Functions.ipynb).

In [None]:
# define the function -- takes one page

    
    # find the table on the page and extract the data

    
    # grab all rows in the table except for the first one,
    # which is the header row

    
    # return the data in a data frame


### 4. Loop over the pages and call the function on each page

As we extract the data from each page, we'll [append](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html) the data frame returned by our function to the empty data frame (`df`) that we created earlier. This code block takes a little while to run.

👉 For more details on lists, slicing and _for loops_, [see this notebook](../../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#for-loops).

In [None]:
# open the PDF

    
    # use list slicing to skip the first page, which doesn't have a data table

    
    # loop over the pages with data

        
        # call the extraction function to grab the data from this page

        
        # append this data to our main dataframe, chopping off the index column


Before we continue, let's take a look at what we've got using the pandas `head()` method.

In [None]:
# check the output with `head()`


### 5. Clean up the data a bit

I notice two things:
- `\n` newline breaks are being interpreted literally as text -- let's globally [replace](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) those
- The license date is coming in as a string, not a date, and we might be interested in doing some date filtering later -- let's coerce those values to date objects with the [`to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) method

In [None]:
# kill line breaks


# coerce license date col to datetime


In [None]:
# check the output with `head()`


## 6. Do some basic analysis

Let's get a feel for how many records there are and figure out how many of debt collectors have been subject to some kind of "action."

According to the Colorado Attorney General (see page 1 of the PDF), the presence of "Yes" in the "action" column means that the company has been

> subject to legal or administrative action by this office or the licensee entered into a voluntary settlement with this office. If the entry is "yes," the licensee may have been subject to one or more letters of admonition, suspension of the license, a judgment or order against the licensee, or other action, including payments (fines, penalties, consumer refunds, or other monetary payments.) Additionally, "yes" may mean that the licensee's records include a voluntary settlement or stipulation with this office. If a licensee has been disciplined, it might still retain its license. Actions and settlements are matters of public record although research, copying, and mailing costs may apply. Contact this office for more information.

👉 We're about to do some string formatting. For more details on string formatting, [check out this notebook](../../reference/String%20formatting.ipynb).

In [None]:
# how many records are there? use `len()`


In [None]:
# let's filter to look just at collectors who have had some action taken against them


# use string formatting to write a formatted sentence
# https://docs.python.org/3.4/library/string.html#format-examples

# construct your sentence

# and print it


In [None]:
# what else?