# Batch PDF parsing

## Overview

ICE has a [page](https://www.ice.gov/facility-inspections) of PDFs describing detention facility inspections. Most inspections involve a "cover letter" that lists violations standards and their components and a "summary review form" that tabulates incidents and provides narratives.

There are 132 inspections between July 2018 and November 2019, and we'd like to parse some statistics about violations and incidents from those reports.

In this notebook, I'll demonstrate parsing grievances filed from summary forms with pdfplumber.

## The Plan

1. Study the PDFs
    1. Single PDF
        1. which page has the table?
    2. All the PDFs
        1. create a meta table
2. Extract data
    1. Single PDF
    2. All the PDFs

### Study the PDFs

In [None]:
# examine the directory
!ls pdf/

In [None]:
import pdfplumber

In [None]:
pdf = pdfplumber.open(
    'pdf/jenaLaSalle_SIS_09-26-2019.pdf'
)

In [None]:
pdf.pages

In [None]:
len(pdf.pages)

In [None]:
page = pdf.pages[0]

In [None]:
page.to_image()

Where are the grievances?

In [None]:
page = pdf.pages[5]
page.to_image()

### Check a different PDF

In [None]:
pdf = pdfplumber.open(
    'pdf/allenParishDetFac_SIS_02-14-2019.pdf'
)

In [None]:
page = pdf.pages[5]

What happened?

In [None]:
len(pdf.pages)

### Collect metadata about our PDFs

How many pages does each PDF have?

In [None]:
from glob import glob

In [None]:
glob('pdf/*')

In [None]:
for file_name in glob('pdf/*'):
    pdf = pdfplumber.open(file_name)
    print(
        file_name,
        len(pdf.pages)
    )

In [None]:
import pandas as pd

In [None]:
payload = []

for file_name in glob('pdf/*'):
    pdf = pdfplumber.open(file_name)
    payload.append({
        'file_name': file_name,
        'pages': len(pdf.pages)
    })

payload

In [None]:
pdf_meta = pd.DataFrame(
    payload
)

pdf_meta

We know these records are different page lengths, but where are the grievance tables?

Regular expression can help find patterns in text.

In [None]:
import re

In [None]:
pattern_grievance_table = r'Grievances:|grievances'

In [None]:
payload = []


for file_name in glob('pdf/*'):
    pdf = pdfplumber.open(file_name)
    for page_index, page in enumerate(pdf.pages):
        if re.search(
                pattern_grievance_table,
                page.extract_text()
            ):
            payload.append({
                'file_name': file_name,
                'table_page': page_index
            })
            
pdf_meta = pd.DataFrame(payload).merge(
    pdf_meta,
    on='file_name'
)

pdf_meta

Now we know where the grievance data are on each document. Time to parse!

### Single PDF processing

In [None]:
pdf_row = pdf_meta.loc[
    lambda x: x['file_name'] == 'pdf/allenParishDetFac_SIS_02-14-2019.pdf'
].iloc[0]

pdf_row

In [None]:
pdf = pdfplumber.open(pdf_row['file_name'])

page = pdf.pages[
    pdf_row['table_page']
]

page.to_image()

In [None]:
page.to_image().debug_tablefinder()

The debugger looks messy because the PDF has inconsistent rows.

We could parse the whole page and clean it up as a second step, but we can also zoom in to a smaller area and extract it with more accuracy.

In this case, we just want the "Greivances Received" row. We can `crop` the page using pixel coordinates.

pdfplumber can crop pdfs with bounding boxes, which are `"4-tuple with the values (x0, top, x1, bottom)"`

Another way to remember it: `(left, top, right, bottom)`

In [None]:
crop_coordinates = (0, 300, page.width, page.height)

page_cropped = page.crop(crop_coordinates)

page_cropped.to_image().debug_tablefinder()

Getting the coordinates just right can be tricky.

I opened a PDF in Adobe Illustrator and used its interface to find coordinates, but we can also programatically find coordinates.

pdfplumber provides the coordinates of `words`, which we can use for our crop coordinates.

In [None]:
page.extract_words()

In [None]:
coordinate_grievances = list(filter(
    lambda x: x['text'] == 'Grievances:',
    page.extract_words()
))[0]

coordinate_grievances

Trial and error!

In [None]:
coordinate_left = coordinate_grievances['x1'] + 100
coordinate_right = page.width
coordinate_top = coordinate_grievances['top'] - 5
coordinate_bottom = coordinate_grievances['top'] + 30

In [None]:
crop_coordinates = (coordinate_left, coordinate_top, coordinate_right, coordinate_bottom)

crop_coordinates

In [None]:
page.within_bbox(
    crop_coordinates
).to_image().debug_tablefinder()

Much beter, let's get the table!

In [None]:
extracted_table = page.within_bbox(
    crop_coordinates
).extract_table()

extracted_table

The data are just one row, but let's make a pandas DataFrame out of it.

In [None]:
pd.DataFrame(
    extracted_table,
    columns=['q1', 'q2', 'q3', 'q4']
).astype(int).assign(
    total = lambda x: x.sum(axis=1),
    file=file_name
)

Now let's extract grievances from a nine-page PDF.

In [None]:
pdf_row = pdf_meta.loc[
    lambda x: x['pages'] == 9
].iloc[0]

pdf = pdfplumber.open(pdf_row['file_name'])

page = pdf.pages[
    pdf_row['table_page']
]

page.to_image()

In [None]:
coordinate_grievances = list(filter(
    lambda x: x['text'] == 'grievances',
    page.extract_words()
))[0]

# I found these values after trial and error!
coordinate_left = coordinate_grievances['x1']
coordinate_right = page.width
coordinate_top = coordinate_grievances['top'] - 5
coordinate_bottom = coordinate_grievances['top'] + 30

crop_coordinates = (coordinate_left, coordinate_top, coordinate_right, coordinate_bottom)

page.within_bbox(
    crop_coordinates
).to_image(resolution=150).debug_tablefinder()

The crop looks close, but notice tablefinder is adding more lines than we need. We can use one of the many pdfplumber [table settings](https://github.com/jsvine/pdfplumber#table-extraction-settings) to fine tune the table. In this case, I found the `snap_tolerance` setting helped with the multiple lines.

In [None]:
pd.DataFrame(
    page.within_bbox(
        crop_coordinates
    ).extract_table({
        'snap_tolerance': 8
    }),
    columns = [
        'ice',
        'non_ice',
        'total'
    ]
).astype(int).assign(
    file=file_name
)

We've extracted data from a 4-page PDF and a 9-page PDF, and we can apply these techniques to our entire batch.

> "Even just learning how to write a simple loop is very helpful to convert PDFs." - Todd Wallack, [NICAR 2017](https://www.ire.org/resource-center/audio/1223/)

In [None]:
def process_row(row):
    pdf = pdfplumber.open(row['file_name'])
    page = pdf.pages[row['table_page']]
    page_count = row['pages']
    
    coordinate_grievances = list(filter(
        lambda x: re.search(
            pattern_grievance_table, x['text']
        ),
        page.extract_words()
    ))[0]
    
    coordinate_left = coordinate_grievances['x1']
    coordinate_right = page.width
    
    if row['pages'] == 9:
        coordinate_top = coordinate_grievances['top'] - 5
        coordinate_bottom = coordinate_grievances['top'] + 30
        
    else:
        coordinate_left = coordinate_left + 100
        coordinate_top = coordinate_grievances['top'] - 5
        coordinate_bottom = coordinate_grievances['top'] + 30

    crop_coordinates = (coordinate_left, coordinate_top, coordinate_right, coordinate_bottom)
        
        
    page = page.within_bbox(crop_coordinates)
    
    if page_count == 9:
        return pd.DataFrame(
            page.extract_table({
                'snap_tolerance': 8
            }),
            columns = [
                'ice',
                'non_ice',
                'total'
            ]
        ).astype(int).assign(
            total=lambda x: x['ice'] + x['non_ice'],
            file=row['file_name']
        )

    else:
        return pd.DataFrame(
            page.extract_table(),
            columns=[
                'q1',
                'q2',
                'q3',
                'q4'
            ]
        ).astype(int).assign(
            total=lambda x: x.sum(axis=1),
            file=row['file_name']
        )


In [None]:
payload = []

for index, row in pdf_meta.iterrows():
    payload.append(process_row(row))

In [None]:
def process_row(row):
    pdf = pdfplumber.open(row['file_name'])
    page = pdf.pages[row['table_page']]
    page_count = row['pages']
    
    coordinate_grievances = list(filter(
        lambda x: re.search(
            pattern_grievance_table, x['text']
        ),
        page.extract_words()
    ))[0]
    
    coordinate_left = coordinate_grievances['x1']
    coordinate_right = page.width
    
    if row['pages'] == 9:
        coordinate_top = coordinate_grievances['top'] - 5
        coordinate_bottom = coordinate_grievances['top'] + 30
        
    else:
        coordinate_left = coordinate_left + 100
        coordinate_top = coordinate_grievances['top'] - 5
        coordinate_bottom = coordinate_grievances['top'] + 30

    crop_coordinates = (coordinate_left, coordinate_top, coordinate_right, coordinate_bottom)
        
        
    page = page.within_bbox(crop_coordinates)
    
    if page_count == 9:
        return pd.DataFrame(
            page.extract_table({
                'snap_tolerance': 8
            }),
            columns = [
                'ice',
                'non_ice',
                'total'
            ]
        ).astype(int).assign(
            total=lambda x: x['ice'] + x['non_ice'],
            file=row['file_name']
        )

    else:
        return pd.DataFrame(
            page.extract_table(),
            columns=[
                'q1',
                'q2',
                'q3',
                'q4'
            ]
        ).replace('N/A', 0).astype(int).assign(
            total=lambda x: x.sum(axis=1),
            file=row['file_name']
        )

In [None]:
payload = []

for index, row in pdf_meta.iterrows():
    payload.append(process_row(row))

In [None]:
pd.concat(payload)[['file', 'total']]

## Conclusion

pdfplumber helped collect data from the reports, but we also used [Overview](https://blog.overviewdocs.com/) for text search and OCR of non-machine-readable documents.

Look at all the PDFs we parsed! But processing actually invovled quite a few things. We:

- created pandas data frames about our data
- wrote `for` loops
- used regular expressions
- filtered lists
- used trial-and-error to find the perfect PDF coordinates
- debugged errors