NICAR: PDF processing with OCR and command-line tools

A tutorial on extracting text from PDFs and optical character recognition using tesseract, ImageMagick and other open source tools. Updated for NICAR 2023.

Introduction

This class seeks to help you solve a common problem in journalism: Data stored in a computer generated PDF or worse an image PDF. We'll first walk through how to extract text from a computer-generated PDF using command line tools. Then we'll step up to Optical Character Recognition, or OCR, to work on image files.

New for NICAR 2023

We had some extra time at the end of class to show how you can use pdfplumber to extract data from one of the searchable PDFs we created. You can find the notebook here.

Installation

First things first, we need to install the tools we'll be using. (NICAR attendees using lab laptops: IRE has already completed the install).

Xpdf is an open source toolkit to work with pdfs. We'll be using its tool, pdftotext.
tesseract is our OCR engine. It was first developed by HP but for the last decade or so it's been maintained by Google.
ImageMagick is an open source image processing and conversion power tool.
Ghostscript is an interpreter for PDFs and Adobe's PostScript language.

This class will be following the Mac install instructions but you can find Windows and Linux in the following documentation.

Xpdf documentation
tesseract documention.
ImageMagick documentation
Ghostscript documentation

For Mac, we'll be using the Homebrew package manager. You can install it here.

For tesseract, you will use the following command.

brew install tesseract

For Xpdf, you will use this.

brew install xpdf

We will also install libtiff, a dependency for ImageMagick that we will need.

brew install libtiff

Then we'll install ghostscipt

brew install ghostscript

And for ImageMagick you will use this.

brew install imagemagick

To install all of them at once, you can run the following

brew install xpdf tesseract libtiff ghostscript imagemagick

Some important updates for 2023

This class doesn't assume that you're working on a Mac with an OS greater than 13 but if you are, you can take advantage of some built-in OCR features. For example, some of the image PDFs you open up from this repo will be automatically OCR'd and have selectable text when you open them on newer Macs.

You can also try out textra an open-source project from The Washington Post's Dylan Freedman that uses those built in Mac OCR tools to extra text from image PDFs.

All of the tools below are also available to you on Mac OS 13 or higher.

If you're not on a newer Mac, fear not! Hopefully some of the tools below will help you out.

How to think about this class

Text extraction from PDFs and OCRing image files is much more of an art than a science. The tools below are meant to give you a lot of different options to try to get the text out of an image PDF and into a format that you can work with. Sometimes they will work amazingly well. Sometimes they won't work at all. Oftentimes, it's somewhere in the middle where the data will require some clean up.

If you're comforable in Python, I recommend pairing these tools with other great open-source projects such as Jeremy Singer-Vine's pdfplumber, which we used extensively to build several Python parsers for the Capital Assets project and many others at The Wall Street Journal.

Files

We'll be using a number of files for our examples. You can find them in here.

Scenario 1: Analyzing a computer generated pdf with embedded text (searchable pdf)

We want to extract the text from a searchable pdf for analysis of some type. There are many GUI software programs you can use to do this. They all have strengths and weaknesses.

Cometdocs
Tabula (free and great for tabular data!)
Adobe Acrobat Pro ($$)
Abbyy Finereader ($$ but also very accurate)
PDFElement

For this tutorial, we're going to use an open source powertool from Xpdf called pdftotext. The construction of the command is pretty intuitive. You point it at a file and it outputs a text file.

I often use this tool to check for hidden text, particularly in documents that have redactions.

Our example is from 2019 when lawyers for Paul Manafort accidentally filed a document in court that wasn't properly redacted — even though the document contained blacked out sections, the text was still present in the document. You can read more about it here.

One way to get to this text is just to copy and paste the sections out. But this can be tedious if there are a lot of sections or you have a large document. A faster and easier method is Xpdf's pdftotext.

Our document has several sections like this.

But since we can tell that there's text underneath there, let's run it through pdftotext and see what comes out.

`pdftotext` command construction

pdftotext /path/to/my/file.pdf name-of-my-text-file.txt

So for our file it would look something like this within the files directory.

pdftotext Manafort_filing.pdf manafort_filing.txt

You can also run it from the repo parent directory.

pdftotext files/manafort/Manafort_filing.pdf files/manafort/manafort_filing.txt

Now, if we look in our newly created text file, we can find full extracted text — including the parts that are blacked out in the PDF.

Let's take a look at another one of our files involving tabular data, found here. This is a salary roster of Trump White House employees. We'll be using a single image page of this file for a later example.

As mentioned before, Tabula is a great tool for getting tabular data out of pdf files, but I wanted to give you another option using pdftotext that works well with fixed-width data files like this White House salaries listing. It also has the added benefit of being easily scriptable.

pdftotext command for tables

pdftotext -table /path/to/my/file name-of-my-text-file.txt

We'll test it out on the file. You can cd to it in the /files/tabular directory or just use the path.

pdftotext -table 07012018-report-final.pdf tabular-test.txt

Or use this command from the repo parent directory

pdftotext -table files/tabular/07012018-report-final.pdf files/tabular/tabular-test.txt```

You should get something like this:

For comparison, try using just pdftotext.

pdftotext files/tabular/07012018-report-final.pdf files/tabular/raw-test.txt

You should get something like this (very bad stuff):

Now that we've walked through one way to extract text from computer generated (nice) pdfs, let's move on to working with image pdfs.

Scenario 2: Basic text extraction from image files

Extracting text from image files is perhaps one of the most common problems reporters face when they get data from government agencies or are trying to build their own databases from scratch (paper records, the dreaded image pdf of an Excel spreadsheet, etc.) To do this, we use OCR and in this example, tesseract.

Basics of `tesseract`

tesseract has many options. You can see them by typing:

tesseract -h

We're not going to go into detail on many of these options but you can read more here

The basic command structure looks like this:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

Let's look at a single image file. In this case, that's the wh_salaries.png file in our imgs folder. This is the first page of our White House salaries PDF but notice that it is not searchable.

This is perhaps the most simple use of tesseract. We will feed in our image file and have it output a searchable pdf.

In /files/single_img directory, use the following command.

tesseract wh_salaries.png out pdf

You start with a file like this:

You should get a file name out.pdf and you can see that it's searchable.

Scenario 3: Combining our skills to make a searchable pdf out of an image pdf.

Converting pdfs to images to prepare for OCR using ImageMagick

So far, we've covered extracting text from computer generated files and doing some basic OCR. Now, we'll turn to creating searchable pdfs out of image files. To do this, we'll be adding another command line tool called ImageMagick, an image editing and manipulation software.

We will be using the convert tool from ImageMagick.

ImageMagick has some great documentation that explains its many options. You can find it here

convert [options ...] file [ [options ...] file ...] [options ...] file

If you're familiar with photography or document scanning, you know that the proper image resolution is essential for electronic imaging. When it comes to OCR, this is even more true.

The general standard for OCR is 300 dpi, or 300 dots per inch, though ABBYY recommends using 400-600 for font sizes smaller than 10 point. In ImageMagick, this is specified using the density flag. Below we are telling ImageMagick to take our pdf document and convert it to an image with 300 dpi.

Important

Before we go on from here, let's make sure we have the tiff delegate installed. You can check like this:

convert -list configure

Scroll down to DELEGATES and make sure it includes tiff

For example:

DELEGATES      bzlib mpeg freetype jng jpeg lzma png tiff xml zlib

If you don't have tiff in the list, follow these steps:

First check to make sure that libtiff and ghostscript are installed. You can do this by running

brew list

If ghostscript is not in the list, then install it using brew.

brew install ghostscript

If libtiff is not in the list, then install it using brew.

brew install libtiff

Now check to make sure that imagemagick is recognizing libtiff is installed as a dependency.

brew info imagemagick

If you're good to go, it should look something like this:

==> Dependencies
Build: pkg-config ✔
Required: freetype ✔, jpeg ✔, libheif ✔, libomp ✔, libpng ✔, libtiff ✔, libtool ✔, little-cms2 ✔, openexr ✔, openjpeg ✔, webp ✔, xz ✔

Now that we've installed ghostscript and the tiff delegate, let's continue on with our example.

Example with the image file Russia findings document

First, we have to convert it to an image so we can run it through tesseract.

We'll use ImageMagick's convert tool.

convert russia_findings.pdf russia_findings.tiff

On a Mac, an easy way to find the dpi of an image is to use Preview. Open the image in preview, go to Tools and click Show Inspector.

So let's take a look at our image we just created.

Open in Preview

Go to `Show Inspector`

Inspector pane 1

Inspector pane 2

So our dpi is 72, which likely is fine for this document but let's go ahead and up that using convert. This will increase the file size of the tiff we create (so warning about file bloat) but it's only a temporary file that we're using to get the best text recognition.

Let's do this with our Russia document.

convert -density 300 russia_findings.pdf -depth 8 -strip -background white -alpha off russia_findings.tiff

So let's break this down.

convert - invokes ImageMagick's convert tool

-density - ups the dpi of our image to 300

russia_finding.pdf - our file that we're converting to an image.

-depth 8 - "This the number of bits in a color sample within a pixel. Use this option to specify the depth of raw images whose depth is unknown such as GRAY, RGB, or CMYK, or to change the depth of any image after it has been read", according to ImageMagick documentation.

-strip - strips off any junk on the file (profiles, comments, etc.)

-background white - sets the background to white to help with contrasting our text

-alpha off -generally the transparency of the image. A great explanation here

Now we run this tiff through tesseract

tesseract russia_findings.tiff russia_findings_enh pdf

And you've got a searchable pdf!

Let's take a look at the underlying text now.

pdftotext russia_findings_enh.pdf russia_text.txt

Where to go from here:

Scripting and Batch process

Now that we've walked through how some of these tools work, you can put them all together into bash scripts if you like. I've included an example script in this repo that seeks to hold down file bloat but it may require some tweaking for your specific use case. OCRing is not a perfect science and most of the time, it takes some trial and error to find the right settings for the documents you're working with.

Try it out on russia_findings.pdf in the image_pdfs folder. (You will likely need to run chmod u+x im_ocr.sh)

Output a searchable pdf

./im_ocr.sh /files/image_pdfs/russia_findings.pdf pdf

In the IRE mac lab, the terminals by default use zsh, so we'll want to invoke bash manually instead like below.

bash im_ocr.sh /files/image_pdfs/russia_findings.pdf pdf

Output a text file

./im_ocr.sh /files/image_pdfs/russia_findings.pdf txt

In the IRE mac lab, the terminals by default use zsh, so we'll want to invoke bash manually instead like below.

bash im_ocr.sh /files/image_pdfs/russia_findings.pdf txt

Judicial Public Financial Disclosure Example

Public financial disclosures of federal judges are multi-page documents but they are released as extremely long, single tiff files. You can find a similar test file here

And you'll notice that the pages need to be split.

The workflow below walks through one example of how to solve the problem using ImageMagick and Tesseract.

This blows up the images, adjusts the image resolution, ups the contrast to help bring out the text. It then outputs a grayscale version, set at 8-bit depth, named Walker16_enh.tiff.

convert -resize 400% -density 450 -brightness-contrast 5x0 Walker16.tiff -set colorspace Gray -separate -average -depth 8 -strip Walker16_enh.tiff

Next we use ImageMagick's crop to split it up into a multi-page pdf.

To find the dimensions, first use Preview's Inspector tool. You 'll see the dimensions of the entire image file. (NOTE: This screenshot is from a different file since I added this later.)

The first value is the width and the second value is the length. To get the pixel length of each page, just divide by the number of pages you should have in the final file.

convert Walker16_enh.tiff -crop 3172x4200 Walker16_to_ocr.tiff

Then we convert that image into a searchable pdf.

tesseract Walker16_to_ocr.tiff -l eng Walker16 pdf

Exploring the various options and fine-tuning your skills with ImageMagick can help prepare you for the next big step: Batch processing of documents.

As mentioned above, you should definitely check out pdfplumber and Jeremy's tutorial on how to get started with it found here.

Sources and references

I originally created this tutorial for NICAR 2019. It was update for NICAR 2022. It relies on many helpful open source resources that deserve credit. They are listed below. Thanks for sharing your work with the rest of the world.

Tesseract documentation

ImageMagick documentation

pdftotext documentation

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
files		files
imgs		imgs
notebooks		notebooks
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
im_ocr.sh		im_ocr.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NICAR: PDF processing with OCR and command-line tools

Introduction

New for NICAR 2023

Installation

Some important updates for 2023

How to think about this class

Files

Scenario 1: Analyzing a computer generated pdf with embedded text (searchable pdf)

`pdftotext` command construction

pdftotext command for tables

Scenario 2: Basic text extraction from image files

Basics of `tesseract`

Scenario 3: Combining our skills to make a searchable pdf out of an image pdf.

Converting pdfs to images to prepare for OCR using ImageMagick

Important

If you don't have tiff in the list, follow these steps:

Example with the image file Russia findings document

Open in Preview

Go to `Show Inspector`

Inspector pane 1

Inspector pane 2

Now we run this tiff through tesseract

Where to go from here:

Scripting and Batch process

Output a searchable pdf

Output a text file

Judicial Public Financial Disclosure Example

Sources and references

About

Releases

Packages

Languages

License

chadday/nicar_ocr

Folders and files

Latest commit

History

Repository files navigation

NICAR: PDF processing with OCR and command-line tools

Introduction

New for NICAR 2023

Installation

Some important updates for 2023

How to think about this class

Files

Scenario 1: Analyzing a computer generated pdf with embedded text (searchable pdf)

pdftotext command construction

pdftotext command for tables

Scenario 2: Basic text extraction from image files

Basics of tesseract

Scenario 3: Combining our skills to make a searchable pdf out of an image pdf.

Converting pdfs to images to prepare for OCR using ImageMagick

Important

If you don't have tiff in the list, follow these steps:

Example with the image file Russia findings document

Open in Preview

Go to Show Inspector

Inspector pane 1

Inspector pane 2

Now we run this tiff through tesseract

Where to go from here:

Scripting and Batch process

Output a searchable pdf

Output a text file

Judicial Public Financial Disclosure Example

Sources and references

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`pdftotext` command construction

Basics of `tesseract`

Go to `Show Inspector`

Packages