Skip to content
A tutorial on optical character recognition using tesseract, ImageMagick and other open source tools
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
files
imgs Add more images Mar 7, 2019
.DS_Store A few last minute changes Mar 8, 2019
LICENSE Initial commit Jan 20, 2019
README.md

README.md

nicar_ocr

A tutorial on extracting text from PDFs and optical character recognition using tesseract, ImageMagick and other open source tools

Introduction

This class seeks to help you solve a common problem in journalism: Data stored in a computer generated PDF or even worse an image PDF. We'll first walk through how to do some quick text extraction using a command line tool. Then we'll step up to Optical Character Recognition, or OCR, to work on image files.

Installation

First things first, we need to install the tools we'll be using.

  • Xpdf is an open source toolkit to work with pdfs. We'll be using its tool, pdftotext.

  • tesseract is our OCR engine. It was first developed by HP but for the last decade or so it's been maintained by Google.

  • ImageMagick is an open source image processing and conversion power tool.

  • Ghostscript is an interpreter for PDFs and Adobe's PostScript language.

Since this is a Mac-based class, we'll be following Mac install instructions but you can find Windows and Linux in the following documentation.

For Mac, we'll be using the Homebrew package manager. You can install it here. So for tesseract, you will use the following command.

brew install tesseract

For Xpdf, you will use this.

brew install xpdf

We will also install libtiff, a dependency for ImageMagick that we will need.

brew install libtiff

Then we'll install ghostscipt

brew install ghostscript

And for ImageMagick you will use this.

brew install imagemagick

Files

We'll be using a number of files for our examples. You can find them in here.

Scenario 1: Analyzing a computer generated pdf with embedded text (searchable pdf)

This is probably the easiest problem to solve dealing with pdfs. We want to extract the text from a searchable pdf for analysis of some type.

There are many GUI software programs you can use to do this. They all have strengths and weaknesses.

For this tutorial, we're going to use an open source powertool from Xpdf called pdftotext. The construction of the command is pretty intuitive. You point it at a file and it outputs a text file.

I often use this tool to check for hidden text, particularly in documents that are redacted. Our example is from just a few months ago when lawyers for Paul Manafort accidentally filed a document that wasn't properly redacted. Reporters, including my colleague Michael Balsamo, quickly realized that even though the document contained blacked out sections, the text of those passages was still present. That text revealed Manafort had shared polling data with a Russian associate during the 2016 election.

One way to get to this text is just to copy and paste the sections out. But this can be tedious, particularly if there are a lot of sections or you have a large document. A faster and easier to read method is what we're going to do with Xpdf's pdftotext.

Our document has several sections like this.

Alt Text

But since we can tell that there's text underneath there, let's run it through pdftotext and see what comes out.

pdftotext command construction

pdftotext /path/to/my/file.pdf name-of-my-text-file.txt

So for our file it would look something like this.

pdftotext Manafort_filing.pdf manafort_filing.txt

But that's just one limited use case. Extracting this text can then be fed into databases or used for visualations.

Let's take a look at another one of our files involving tabular data, found here. This is a salary roster of Trump White House employees. We'll be using a single image page of this file for a later example.

Alt Text

As mentioned before, Tabula is a great tool for getting tabular data out of pdf files, but I wanted to give you another option using pdftotext that works well with fixed-width data files like this White House salaries listing. It also has the added benefit of being easily scriptable.

pdftotext command for tables

pdftotext -table /path/to/my/file name-of-my-text-file.txt

We'll test it out on the file. You can cd to it in the /files/tabular directory.

pdftotext -table 07012018-report-final.pdf tabular-test.txt

You should get something like this:

Alt Text

For comparison, try using just pdftotext.

pdftotext 07012018-report-final.pdf test.txt

You should get something like this (very bad stuff):

Alt Text

Now that we've walked through the basics of text extraction with computer generated (nice) pdfs, let's go onto the harder use cases.

Scenario 2: Basic text extraction from image files

Extracting text from image files is perhaps one of the most common problems reporters face when they get data from government agencies or are trying to build their own databases from scratch (paper records, the dreaded image pdf of an Excel spreadsheet, etc.) To do this, we use OCR and in this example, Tesseract.

Basics of tesseract

Tesseract has many options. You can see them by typing:

tesseract -h

We're not going to go into detail on many of these options but you can read me here

The basic command structure looks like this:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

Let's look at a single image file. In this case, that's the wh_salaries.png file in our imgs folder. This is the first page of our White House salaries pdf but notice that it is not searchable.

This is perhaps the most simple use of tesseract. We will feed in our image file and have it output a searchable pdf.

In /files/single_img directory, use the following command.

tesseract wh_salaries.png out pdf

You start with a file like this:

Alt Text

You should get a file name out.pdf and you can see that it's searchable.

Alt Text

Scenario 3: Combining our skills to make a searchable pdf out of an image pdf.

Converting pdfs to images to prepare for OCR using ImageMagick

So far, we've covered extracting text from computer generated files and doing some basic OCR. Now, we'll turn to creating searchable pdfs out of image files. To do this, we'll be adding another command line tool called ImageMagick, an image editing and manipulation software.

We will be using the convert tool from ImageMagick.

ImageMagick has some great documentation that explains all of its many options. You can find it here

convert [options ...] file [ [options ...] file ...] [options ...] file

If you're familiar with photography or document scanning, you know that the proper image resolution is essential for electronic imaging. When it comes to OCR, this is even more true.

The general standard for OCR is 300 dpi, or 300 dots per inch, though ABBYY recommends using 400-600 for font sizes smaller than 10 point. In ImageMagick, this is specified using the density flag. Below we are telling ImageMagick to take our pdf document and convert it to an image with 300 dpi.

Important

Before we go on from here, let's make sure we have the tiff delegate installed. You can check like this:

convert -list configure

Scroll down to DELEGATES and make sure it includes tiff

For example:

DELEGATES      bzlib mpeg freetype jng jpeg lzma png tiff xml zlib

IF you don't have tiff in the list, follow these steps:

First check to make sure that libtiff and ghostscript are installed. You can do this by running

brew list

If ghostscript is not in the list, then install it using brew.

brew install ghostscript

If libtiff is not in the list, then install it using brew.

brew install libtiff

Now check to make sure that imagemagick is recognizing libtiff is installed as a dependency.

brew info imagemagick

If you're good to go, it should look something like this:

==> Dependencies
Build: pkg-config ✔
Required: freetype ✔, jpeg ✔, libheif ✔, libomp ✔, libpng ✔, libtiff ✔, libtool ✔, little-cms2 ✔, openexr ✔, openjpeg ✔, webp ✔, xz ✔

Now that we've installed ghostscript and the tiff delegate, let's continue on with our example.

Example with the image file Russia findings document

Alt Text

First, we have to convert it to an image so we can run it through tesseract.

We'll use ImageMagick's convert tool.

convert russia_findings.pdf russia_findings.tiff

On a Mac, an easy way to find the dpi of an image is to use Preview. Open the image in preview, go to Tools and click Show Inspector.

So let's take a look at our image we just created.

Open in Preview

Alt Text

Go to 'Show Inspector'

Alt Text

Inspector pane 1

Alt Text

Inspector pane 2

Alt Text

So our dpi is 72, which likely is fine for this document but let's go ahead and up that using convert. This will increase the file size of the tiff we create (so warning about file bloat) but it's only a temporary file that we're using to get the best text recognition.

Let's do this with our Russia document.

convert -density 300 russia_findings.pdf -depth 8 -strip -background white -alpha off russia_findings.tiff

So let's break this down.

convert - invokes ImageMagick's convert tool

-density - ups the dpi of our image to 300

russia_finding.pdf - our file that we're converting to an image.

-depth 8 - "This the number of bits in a color sample within a pixel. Use this option to specify the depth of raw images whose depth is unknown such as GRAY, RGB, or CMYK, or to change the depth of any image after it has been read", according to ImageMagick documentation.

-strip - strips off any junk on the file (profiles, comments, etc.)

-background white - sets the background to white to help with contrasting our text

-alpha off -generally the transparency of the image. A great explanation here

Now we run this tiff through tesseract

tesseract russia_findings.tiff -l eng russia_findings_enh pdf

And you've got a searchable pdf!

Let's take a look at the underlying text now.

pdftotext russia_findings_enh.pdf russia_text.txt

We also could have just outputted directly to a text file like this.

tesseract russia_findings.tiff -l eng russia_findings_enh txt

Where to go from here:

OCRing is not a perfect science and most of the time, it isn't simple. One recent example: public financial disclosures of federal judges are multi-page documents but they are released as extremely long, single tiff files.

Alt Text

And you'll notice that the pages need to be split.

Alt Text

The workflow below walks through one example of how to solve the problem using ImageMagick and Tesseract.

This blows up the images, adjusts the image resolution, ups the contrast to help bring out the text. It then outputs a grayscale version, set at 8-bit depth, named Walker16_enh.tiff.

convert -resize 400% -density 450 -brightness-contrast 5x0 Walker16.tiff -set colorspace Gray -separate -average -depth 8 -strip Walker16_enh.tiff

Next we use ImageMagick's crop to split it up into a multi-page pdf.

COMING SOON HERE: More details of how to find the dimensions.

convert Walker16_enh.tiff -crop 3172x4200 Walker16_to_ocr.tiff

Then we convert that image into a searchable pdf.

tesseract Walker16_to_ocr.tiff -l eng Walker16 pdf

Exploring the various options and fine-tuning your skills with ImageMagick can help prepare you for the next big step: Batch processing of documents, which you can hear more about here at NICAR.

Sources and references

I created this tutorial for NICAR 2019 but it relies on many helpful open source resources that deserve credit. They are listed below. Thanks for sharing your work with the rest of the world.

Tesseract documentation

ImageMagick documentation

pdftotext documentation

You can’t perform that action at this time.