# How To OCR with Python & Tesseract: The Basics

**<< Previous module: [What is OCR?](03-WhatIsOCR.ipynb) <<**

*1-2 hours*

<div class="alert alert-block alert-info">
    <strong>Learning Objectives:</strong>
    <p>By the end of this module, you should be able to</p>
    <ul>
        <li>describe the necessary input formats for OCR;</li>
        <li>explain the importance of performing adjustments (pre-processing) to inputs before running OCR;</li>
        <li>identify adjustments that will likely improve text inputs;</li>
        <li>perform pre-processing steps using Python;</li>
        <li>describe and implement basic OCR steps using Python and Tesseract in Jupyter Notebooks.</li>
    </ul>
</div>

## Table of Contents

- [Quick Review: OCR Inputs & Outputs](#quick-review)
- [The OCR Process](#ocr-process)
- [Preparing Texts for OCR (Pre-Processing)](#pre-processing)
- [Performing OCR](#performing-ocr)
- [Resources](#resources)

## Quick Review: OCR Inputs & Outputs <a class="anchor" id="quick-review"></a>

### Inputs

We've already discussed this in the [last module](03-WhatIsOCR.ipynb), but it bears repeating. In order to perform OCR on a text corpus, we need the following:

- A **single file folder** containing all of the corpus files. If the corpus is small enough (e.g. 1 book), this could be simply a single file.
- All corpus files should be of the **same file format**.
- The chosen file format should be **interoperable** (usable by many software and operating systems) and stable (changes rarely if ever).


- For our work with Python and Tesseract, the files should be **images**, which means that each file will correspond to 1 single-sided page (if in a book format) in the corpus. 

To keep all of these image files organized, we recommend creating a file structure that looks like the below: 1 file folder for the entire corpus, and 1 subfolder for each volume in the corpus containing an image file for every page in the volume.

<img src="images/08-ocr-01.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of a file structure for image files to be OCR'ed." title="Screenshot of a file structure for image files to be OCR'ed." />

Note that the file naming structure identifies *both* which volume the images are part of *and* which scanned page they correspond to, which helps us maintain the order of the volume. These numbers *may not* correspond to page numbers because the scanning included outer and inner covers as well as title pages, etc.

Note that we are working with .jpg files here. These are files that we [downloaded from the Internet Archive](02-GatheringACorpus.ipynb#how-to-download) and with which any computer should be able to work. The process we'll be using, though, can also be run with .png, .tiff, .jp2, and other common interoperable image formats.

### Outputs

Here's what we'll produce through the OCR process in this module: **1 file folder containing 1 file per volume in the .txt (plain text) format.** The plain text format is interoperable, stable, and fully computer readable, meaning it will be ready for performing computational analysis in whatever tools you might choose to work with. We'll demonstrate some analysis tools in Python in the [final module in this series](06-ExploratoryAnalysis.ipynb).

<img src="images/08-ocr-02.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of the file structure for files after OCR." title="Screenshot of the file structure for files after OCR." />

## The OCR Process<a class="anchor" id="ocr-process"></a>

<img src="images/noun_arrow with loops_2073885.png" width="40%" style="float:right;" alt="arrow with loops by Kalinin Ilya from the Noun Project" title="arrow with loops by Kalinin Ilya from the Noun Project" />

**Producing OCR'ed text is an iterative, rather than a linear, process.** To get the best possible output involves multiple steps and in some instances repetition of steps. Here's an overview of what these steps can look like. 

*Keep in mind that this process can vary not only based on the complexity and legibility of your corpus but also based on the resources you have. Consider your expertise, whether you are working with a team or by yourself, how much time you have, and your ultimate research goal. These should all factor in to how complex you make your OCR process.*

### Pre-Processing:
This phase is all about testing to figure out the best adjustments and OCR settings for your corpus:

1. **Create a folder of sample text from your corpus.** The size of the sample may depend on the corpus' size and homogeneity or heterogeneity, but it should be an amount that you and/or your team could review manually in a reasonably short period of time.
2. **Run OCR on your sample.**
3. **Review the output** to identify errors, **looking especially for error *patterns*** that could be addressed at a corpus level.
4. **Create a list of errors and [possible adjustments](#pre-processing)** that you might use to address the errors. **Order the list based on which errors should be solved first--which might address the largest number of errors.** For example, it would be more important to fix rotated or skewed pages across the sample/corpus before trying to use erosion or dilation to make specific pages more legible to Tesseract. 
5. **Make the first adjustment** on your list to the sample.
6. **Re-run OCR on your sample.**
7. **Review the output.** Has the output improved noticeably? Are there still errors and error patterns? 
8. **Repeat some or all of the above steps:** Depending on your findings, you might continue applying adjustments from your list, re-running OCR, and reviewing outputs, or you might be ready to move on to the next step. 

Depending on the *complexity* of your corpus, you may want to select a few different samples to complete this process with and then compare the adjustments you make across all samples to see which adjustments work best overall, and which might *introduce* errors to certain parts of the corpus. 

If need be, you could consider running OCR on separate parts of your corpus. The *On The Books* team did this because marginalia (text printed in the margins) appear through a significant portion of the corpus but are no longer printed after a certain point. Removing marginalia wasn't necessary once it stopped appearing in the text, so that step could be skipped for the later portion of texts.

### Performing OCR:
Once you're satisfied with the pre-processing on your sample(s), here's where you run the actual OCR. This part of the process may also be iterative:

1. **Apply your chosen pre-processing adjustments** to your entire corpus.
2. **Run OCR** on your entire corpus.
3. **Pull samples from your output to review.** Do you notice any recurring or new errors? If so, you may need to return to pre-processing to assess and address these errors.
4. **Repeat steps 1-3 as needed.** If you have a very large corpus, you may consider running these steps in batches and iterating through each batch.

### "Cleaning" OCR:
**This part of the process is often best performed with a combination of manual (human) and automated (computer) steps.** This is where you may be addressing not only errors in the OCR itself but also issues with the original printing, as we describe below with regard to [hyphenated words at the end of lines](#hyphens). As with pre-processing, how complex you make iterations in this phase depends on your corpus and your resources:

1. **Use Python to check for and correct possible spelling errors.** This step should focus on common words and avoid proper nouns. As with any automated step, it's possible that new errors will be introduced here. 

    1. If there is a known and small quantity of proper nouns used in individual texts or across the corpus, and these are consistently "read" incorrectly by Tesseract, it may be possible to use Python to correct these.
    
2. If your corpus is small enough and/or you have a team that can help you, **read through the corpus** to manually check for and correct errors. This may be a moment to correct proper nouns. If you have a team, it may be advisable to have texts read and corrected by multiple team members. It will be important that these team members have access to both inputs and outputs, and perhaps even lists of proper nouns, to be able to compare the original scans with the computer-readable versions. You may even want to set up a process whereby reviewers can flag words they are not sure about so that another reviewer can provide their opinion so that you and/or another project manager making a final decision on uncertain words.

The above process could be broken down further to address smaller issues incrementally and iteratively. It may also be useful to break your corpus into units of analysis before or during this process to assist with cleaning. We will discuss the splitting process when we structure our OCR'ed data in the [next module](05-StructuringOCRData.ipynb).

## Preparing Texts for OCR (Pre-Processing)<a class="anchor" id="pre-processing"></a>

Remember all of the questions we considered in the [last module](03-WhatIsOCR.ipynb)? The ones that asked us to think about text format (print/handwritten), format and orientation on the page, legibility, and so on? In this section, we'll walk through the steps that need to be done to address especially **format and orientation on a page and legibility** before performing OCR.

### The Files

#### Which files?

For this tutorial, we've given you only a sample of pages to work with from the 1955 NC session laws, beginning with the first chapter (page 1). You may have noticed that the file name ends with `0057` rather than `0001`. This is because scanned session law volumes often include other content, such as a scan of the outer covers, a table of contents, and the North Carolina state constitution, appearing before or after the laws themselves. Because the *On The Books Team* is only interested in the laws themselves, we went through each volume and removed all images that we did't need. The lesson here: **don't run OCR on content that you don't need.** 

<img src="images/08-ocr-03.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of the filenames for the 1955 volume, showing that the first page of laws is the file ending in 0057." title="Screenshot of the filenames for the 1955 volume, showing that the first page of laws is the file ending in 0057." />

When you are preparing a corpus for OCR, in addition to becoming familiar with potential issues you see on each page, remove all files that don't need to be OCR'ed. *If you want to keep a copy of the complete un-OCR'ed corpus, make a copy of it **before** deleting image files from the version that you will use to perform OCR.* If you find that there is a consistent number of pages that you want to remove at the beginning and/or end of each corpus, it's possible to use Python to do this. Use caution if you choose this route, though, as it can be hard to undo deletions performed programmatically. At the end of the day *you probably know better than your computer* which files you need and which you don't.

#### Missing pages? Duplicate pages?

Something else to note about the 1955 laws in particular: The first two files in the scanned volume, ending `0000` and `0001`, happen to be scans of pages in the middle of the volume. These could be included at the beginning for a few reasons: 
- the archivists performing the scan accidentally skipped these pages during the initial scan; or
- the initial scan of these pages was in some way inadequate (perhaps the page was blurred, or the scanner malfunctioned).

<img src="images/08-ocr-04.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of first files in the 1955 volume, which are pages scanned from the middle of the volume." title="Screenshot of first files in the 1955 volume, which are pages scanned from the middle of the volume." />

Regardless of the reason, now is the right moment to **check whether these pages were indeed skipped** and make a note of this so that you can be sure they are included in the OCR and final dataset. As it happens, these pages were *not* skipped but were duplicates, so we would remove them. If you *don't* find pages like these out of order in your own corpus, either indicated by page numbers in the images themselves or by numbered file names skipping an integer, *it may not be worth your time to check every page* (particularly if you are dealing with thousands of pages). It's possible to identify skipped pages in the data structuring process and add them to the larger dataset. 

#### A Note on Cropping

Our duplicate pages do also point out something else important: **whether you need to crop scanned pages.**

<img src="images/08-ocr-05.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of the duplicate images of page 592, the left image not cropped and the right image cropped." title="Screenshot of the duplicate images of page 592, the left image not cropped and the right image cropped." />

When documents are scanned, often there is more included in the image than just the document itself: the stand or supports for the document, color calibration targets, rulers, and anything else in close proximity to the document.  Archivists preparing scanned materials for the Internet Archive and other digital repositories may crop out all parts of a scanned image that are *not* part of the document, aiming to create image files of a relatively uniform size.

If your images have not been cropped already, **here are a few resources for learning how to batch crop images:**
- In Python: [this Jupyter Notebook explains how to prepare to crop](https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/examples/marginalia_determination/marginalia_determination.ipynb), and [this Notebook implements the crop along with other adjustments we'll explore further here](https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/examples/adjustment_recommendation/adjRec.ipynb)
- [In Photoshop](https://www.linkedin.com/learning-login/share?forceAccount=false&redirect=https%3A%2F%2Fwww.linkedin.com%2Flearning%2Flearning-photoshop-automation%3Ftrk%3Dshare_ent_url%26shareId%3D9wq0fJRcSEOBjgywFT9gOA%253D%253D&account=42563596) (UNC or LinkedIn Learning log in required)

<div class="alert alert-block alert-success">
    <p><strong>If you're doing the scanning yourself or will be working with someone to newly digitize materials,</strong> it's a good idea to carefully plan your scanning process. Every step matters in terms of generating the best possible OCR results. Digital NC have posted their <a href="https://www.digitalnc.org/policies/digitization-guidelines/" alt="Digital NC digitization guidelines">digitization guidelines</a> along with <a href="https://www.digitalnc.org/about/what-we-use-to-digitize-materials/" alt="Digital NC scanning equipment">descriptions of their scanning equipment</a>. These can provide a helpful starting point if you will be beginning your project with undigitized materials.</p>

### The Text

Now let's take a closer look at the text we'll use to practice performing OCR:

<div class="alert alert-block alert-warning">
    <p>Drawing from <a href="https://onthebooks.lib.unc.edu/laws/all-laws/" alt="On the Books corpus"><em>On The Books</em> corpus</a>, we'll be working with the North Carolina session laws from 1955: <a href="https://archive.org/details/sessionlawsresol1955nort/" alt="1955 NC session laws on the Internet Archive">https://archive.org/details/sessionlawsresol1955nort/</a>.</p>
    <p>Open the 1995 volume in the Internet Archive and skim through, considering the following questions:</p>
    <ul>
        <li>How is the text formatted on the page?</li>
        <li>How is the text oriented on the page?</li>
        <li>Do you notice any pages or sections that might cause an error in the OCR? What are they? What kind of error do you think they might cause?</li>
    </ul>
</div>

There are a few possible considerations with this volume that we'll want to address with pre-processing:
- broken words
- possible noise (e.g. the visible shadow of text from other pages)
- slanted, or skewed, text
- crooked, or rotated, text
- occasional pencil marks on the text

Did you notice any other possible issues? 

### The Code

To better understand why we're going to spend significant time pre-processing, let's look at the very basic OCR process with pyTesseract.

First, we need to run the following line of code to finish installing Tesseract. **Wait until the following code finishes (it will take 1-2 minutes), and you don't see a star by the code or hourglass in your browser tab before you continue.**

In [None]:
# Install tesseract on Binder.
# The exclamation runs the command as a terminal command.
# This may take 1-2 minutes.
# Source: Nathan Kelber & JStor Labs Constellate team.
%conda install -c conda-forge -y tesseract
%conda install -c conda-forge pytesseract

Has the above code finished running? If so, you can now run this code to OCR 1 page of text:

In [None]:
# Import the Image module from the Pillow Library, which will help us access the image.
from PIL import Image

# Import the pytesseract library, which will run the OCR process.
import pytesseract

# Open a specific image file, convert the text in the image to computer-readable text (OCR),
# and then print the results for us to see here.
print(pytesseract.image_to_string(Image.open("sessionlawsresol1955nort_0057.jpg"), lang="eng"))

Looks pretty simple, right? If you haven't run it yet, go ahead. 

We used [this image file](sample/sessionlawsresol1955nort_0057.jpg) to test out the code. Open it up and compare it to the text we just printed. What do you notice about the layout, text format, and characters in the image and text versions?

<img src="images/sessionlawsresol1955nort_0057.jpg" width="30%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="A test OCR page; the first page of the 1955 North Carolina session laws." title="A test OCR page; the first page of the 1955 North Carolina session laws." />
<br/>

### Pre-Processing

While the above looks promising--and we could go ahead and create a loop to run this code on every image file in the 1955 volume--we always run the risk of introducing errors into our plain text output that can be avoided by some pre-processing steps. So let's pause on the OCR for a bit and look at some of the steps you might need to take into consideration with your own materials. Although Tesseract does an overall good job addressing these issues when they are minor, it may be worth your while to fix any issues you notice *before* running Tesseract to avoid introducing errors into OCR'ed text in the first place:

#### Rescaling
**The higher quality the digitization, the better the OCR**--this is the general rule. "Quality" has a lot to do with the OCR requirements we've already covered as well as those we'll cover below. We can begin, though, with the number of pixels per image--that is, the number of pixels per *inch*. Remember that computers present images as a grid of pixels, usually squares but sometimes rectangles, and that each carry specific color information. Put hundreds, thousands, millions of pixels together, and we have an image. 

<img src="images/07-ocr-01.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of text stored in an image format from a page of North Carolina laws" title="Screenshot of text stored in an image format from a page of North Carolina laws" />

A common way for computer programmers to measure image quality is by assessing the number of pixels per inch (ppi). This is important for many reasons: a photographer will want to keep their number of pixels high (perhaps 300 ppi) in preparation for printing, but a web designer will want a much lower number of pixels (72 ppi) to keep an image looking crisp while also keeping file sizes small to avoid slowing down webpage loading time. If you've ever opened a webpage and seen text but had to wait a few seconds for images to load, you've seen the difference between how long it takes for text vs. an image to load. The more pixels, the larger the file (in kilobytes, megabytes, or even gigabytes), and large files take longer to move from a server to your computer--add in low bandwidth internet, and the load time increases exponentially. 

So, what's the difference? Let's look:

<div class="row" style="padding-bottom:20px;">
    <div class="column">
<img src="images/08-ocr-06.jpeg" width="40%" style="float:left; padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="An image of the letter S at 72 ppi." title="An image of the letter S at 72 ppi." />
    </div>
    <div class="column">  
<img src="images/08-ocr-07.jpeg" width="39%" style="float:right; padding-top:20px; margin-right:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="An image of the letter S at 300 ppi." title="An image of the letter S at 300 ppi." />
    </div>
</div>

<br/>
The image on the left shows a scanned letter S at 72 ppi. The visible squares represent individual pixels. Note that each pixel represents one color from the page, and there is a transition between pixels representing ink and those representing paper. 

The image on the right is the same letter S rescaled to 300 ppi. The squares here appear smaller because there are far more of them. Note that instead of there being only a line 1-2 pixels wide making up the S shape, there are far more--far more for Tesseract to "read" and interpret.

<div class="alert alert-block alert-success">
    <p><a href="https://tesseract-ocr.github.io/tessdoc/ImproveQuality" target="blank">Per its documentation</a>, Tesseract works best with an image resolution of 300 ppi. The documentation actually uses "dpi", or <a href="https://en.wikipedia.org/wiki/Dots_per_inch" target="blank">"dots per inch"</a>. If you're beginning your project by scanning materials, this unit will be important when you set up your scanner, but once you move into image processing, we're dealing with <a href="https://en.wikipedia.org/wiki/Pixel_density" target="blank">pixels per inch</a>. These are not the same, but many people use dpi and ppi interchangeably.</p>
</div>

The images we downloaded from the Internet Archive are 72ppi--optimized for web viewing. In an ideal world, we'd go back and use the .jp2 files, which are higher resolution. While it's <mark style="background-color:lightblue;">possible</mark> to rescale (increase the ppi) of our sample images, increasing ppi involves using algorithms that can change, in minute ways, an image file. Those minute changes could make a larger impact on OCR results depending on the original image. For this reason, **it's always better to start by creating high resolution image files in the scanning process.** For our purposes here, PyTesseract has already proven that it's working well with the .jpg 72 ppi versions of the Internet Archive files. If you're interested in learning more about rescaling images, though, try this [GeeksForGeeks tutorial](https://www.geeksforgeeks.org/python-pil-image-resize-method/) as a starting point.

#### Rotating & Deskewing
Sometimes, in spite of everyone's best efforts, a document is scanned at a slight angle, either rotated slightly on the scan bed or perhaps photographed at a slight angle, introducing a skew. The result can look like these images:

<div class="row"  style="padding-bottom:20px;">
    <div class="column">
<img src="images/sessionlawsresol1955nort_0057_rotated.jpg" width="40%" style="float:left; padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="A test OCR page that has been rotated; the first page of the 1955 North Carolina session laws." title="A test OCR page that has been rotated; the first page of the 1955 North Carolina session laws." />
    </div>
    <div class="column">  
<img src="images/sessionlawsresol1955nort_0057_skewed.jpg" width="37%" style="padding-left:10px; padding-top:20px; float:right; margin-right: 40px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="A test OCR page that has been skewed; the first page of the 1955 North Carolina session laws." title="A test OCR page that has been skewed; the first page of the 1955 North Carolina session laws." />
    </div>
</div>

**Left:** A rotated version of our sample page. **Right:** A skewed version of our sample page.

Let's run these through PyTesseract to see how it handles them:

In [None]:
# OCR the rotated page.
print(pytesseract.image_to_string(Image.open('images/sessionlawsresol1955nort_0057_rotated.jpg')))

How do the OCR results for the rotated page compare to our attempt with the original (unrotated) file? 

Now let's try the skewed version:

In [None]:
# OCR the skewed page.
print(pytesseract.image_to_string(Image.open('images/sessionlawsresol1955nort_0057_skewed.jpg')))

Look at this output carefully. The skewed page might appear to be largely similar to the original (non-skewed) page we tried first. Where are there errors, though? What kinds of problems might these errors cause for later analysis?

**Why do errors occur when reading a rotated or skewed text?** Tesseract has been programmed to expect to "read" a language in the same way a human would. We read English left to right and from the top of a page down. Although we are able to parse text even when viewing it at an angle (maybe you can even read text upside down), Tesseract doesn't do this well. It will still attempt to read a rotated line from left to right--it won't know to follow the text as it slants down or up. So it returns its interpretation of the letters that fall within its line of "sight." This is why, particularly with rotated texts, we may receive symbols and other unexpected characters.

#### Removing Noise
Images can't produce sound, but they can still have *noise*. In an image, noise is a **random variation in brightness or color**. Let's look again at our S from earlier:

<img src="images/08-ocr-06.jpeg" width="51%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="An image of the letter S at 72 ppi." title="An image of the letter S at 72 ppi." />

The pixels surrounding the S represent the color of the paper the S was printed on. The pixels are not all one color, though. That variation is noise. In these images, the noise has already been minimized in the scanning process: if you open one of the images and zoom in, you may notice that blank page space surrounding text appears to have many pixels that are close to the same color. 

**Tesseract removes noise on its own, but this process can also introduce errors in images that have a high amount of noise.** If you want to learn more about noise and removing it using Python [here's a good place to start](https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_photo/py_non_local_means/py_non_local_means.html). 


#### Inverting & Binarizing
Early OCR programs required light text on dark backgrounds to operate correctly. In recent years, many OCR programs have moved to preferring dark text on light backgrounds. This means that **inversion** is typically not an issue historians need to worry about since most printed documents are dark text on light background. There might be some exceptions to this if you are working with, for example, images of microfiche. 

Just for fun, let's see [how Python handles image inversion](https://pillow.readthedocs.io/en/latest/reference/ImageOps.html#PIL.ImageOps.invert) (and you can use this if you ever do need to OCR microfiche):

In [None]:
# Import the modules we need from the PIL library.
from PIL import Image
from PIL import ImageOps

# Open the original image file.
file = Image.open("sessionlawsresol1955nort_0057.jpg")

# Use the ImageOps.invert function to invert the colors in the original file.
inverted_file = ImageOps.invert(file)

# Save the newly inverted image file.
inverted_file.save("sessionlawsresol1955nort_0057_inverted.jpg")

You'll find the result saved in the [same folder as this Notebook](sessionlawsresol1955nort_0057_inverted.jpg), or you can preview it below:

<img src="images/sessionlawsresol1955nort_0057_inverted.jpg" width="40%" style="padding-top:20px; margin-bottom:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Inverted version of page 57 from the 1955 North Carolina Session Laws." title="Inverted version of page 57 from the 1955 North Carolina Session Laws." />

Where inversion switches the colors in an image file **binarization converts an image so that it shows image data in only two pixel "colors": black and white**. Tesseract does this as part of its OCR process, but it might be worthwhile to do this ahead of time if you're trying to reduce noise or see where, for example, a shadow on a page may introduce problems for Tesseract. A step in that direction that can help is converting images to grayscale:

In [None]:
# Import the modules we need from the PIL library.
from PIL import Image
from PIL import ImageOps

# Open the original image file.
file = Image.open("sessionlawsresol1955nort_0057.jpg")

# Use the ImageOps.invert function to invert the colors in the original file.
inverted_file = ImageOps.grayscale(file)

# Save the newly inverted image file.
inverted_file.save("sessionlawsresol1955nort_0057_grayscale.jpg")

The result ([again saved with this Notebook](sessionlawsresol1955nort_0057_grayscale.jpg)) in our case is less exciting because our image already had dark ink on a light background. Nonetheless, it helps us see more clearly where there might be variations in background color that will interfere with OCR. In this case, a smudge in the top right corner, far from the text, has become much more visible, as has the shadow near the gutter (where the page meets the volume's binding). Gutter shadows can cause problems if there is little margin between the gutter and page text. Tesseract has an [example of a page](https://tesseract-ocr.github.io/tessdoc/ImproveQuality#binarisation) that demonstrates much more dramatically how binarization can reveal shadows on a page.

<img src="images/sessionlawsresol1955nort_0057_grayscale.jpg" width="40%" style="padding-top:20px; margin-bottom:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Grayscale version of page 57 from the 1955 North Carolina Session Laws." title="Grayscale version of page 57 from the 1955 North Carolina Session Laws." />

#### Dilation and erosion
Finally, there is dilation and erosion. As we can see in our sample page, printers often varied font thickness when they set the text type. Bold might be used for headings while thinner fonts might be used for smaller text. Depending on the print quality, bolded text might have additional ink around it, while thinner text might not have enough ink. Variation in ink thickness can throw Tesseract off, so **eroding bolded text** (making it thinner) and **dilating very thin text** (adding thickness) can help address this issue.

Performing erosion and dilation in Python requires some additional understanding of image processing. We won't cover it here (and our samples don't need it!), but [this GeeksForGeeks tutorial](https://www.geeksforgeeks.org/erosion-dilation-images-using-opencv-python/) explains the basics and provides sample code.

#### Identifying Layout & Text Order

There are many instances when we might be working with printed documents that have text arranged in a variety of ways--not just in a single column or orientation on the page.

While Tesseract does have tools for estimating a document's orientation, on its own it is not wel equipped to identify text order, recognize images, or understand arrangement of text and images on a page--tasks that many refer to as ["document layout analysis"](https://en.wikipedia.org/wiki/Document_layout_analysis) or "page layout analysis." These analysese need to be performed before running Tesseract in order to proide it with th ecorrect ordering and layout information--or, rather, in order to focus its attention on specific parts of a document in a specific sequence. Here is an overview of that workflow:

1. Identify the areas on a page that you want Tesseract to focus on. It may be that you want to include only *some* parts of a page and not others. Consider whether this area might be similar or different on different documents.

<img src="images/chronam_daybook_19151112_pellagra_full.jpg" width="40%" style="padding-top:20px; margin-bottom:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Newspaper page showing text in two columns with a vertical line separating them and an image in the middle of the right column. Source: Chronicling America" title="Newspaper page showing text in two columns with a vertical line separating them and an image in the middle of the right column. Source: Chronicling America" />


2. For each page, calculate the area that you want to *include* in the OCR. To do this, use pixels as cartesian/XY coordinates to mark out an area's corners. The outlines created by these coordinates are referred to as "bounding boxes" and may include as much or as little text as needed. This may be automated in a variety of ways using Python but may need some human intervention.

<img src="images/chronam_daybook_19151112_pellagra_full_bboxes.png" width="40%" style="padding-top:20px; margin-bottom:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Newspaper page showing text in two columns with a vertical line separating them and an image in the middle of the right column. Each column is annotated with a pink bounding box to indicate specific, separate areas of text. Source: Chronicling America" title="Newspaper page showing text in two columns with a vertical line separating them and an image in the middle of the right column. Each column is annotated with a pink bounding box to indicate specific, separate areas of text. Source: Chronicling America" />


3. Create a dataset of all of the bounding boxes on each page. To do this, you may need to specify particular features about the document, such as whether columns are separated by a vertical line or blank space.


4. Feed these bounding boxes and the content within them in their "reading" order to Tesseract for OCRing.

Here is [an example of how *On The Books* did this to exclude marginalia from its OCR](https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/examples/marginalia_determination/marginalia_determination.ipynb).

There ar a variety of tools you might use to do this. *On The Books* used [Pillow](https://pypi.org/project/Pillow/) and [NumPy](https://numpy.org/) Python libraries. [OpenCV](https://opencv.org/)'s computer vision tools can also be used for this as can tools such as [Kraken](http://kraken.re/) and [OCRopus](https://ocropus.github.io/). The [Coursera course](https://www.coursera.org/learn/python-project) demonstrates how to do this with OpenCV and Kraken.

## Performing OCR<a class="anchor" id="performing-ocr"></a>

Now that we have an understanding of pre-processing steps and their role in reducing OCR errors, let's return to our original sample to get a sense of where we might need to make adjustments. This time, we'll run the code on our full ten-page sample:

In [None]:
# Import PyTesseract and PIL, an image processing library used by PyTesseract, to complete the OCR.
from PIL import Image
import pytesseract

# Import os, a module for file management.
import os

# Import re, a module that we can use to search text.
import re

# Import glob, a module that helps with file management.
import glob

# Open the file folder where our sample pages are stored.
# Look only for the files ending with the ".jpg" file extension.
sampleFilePath = glob.glob("sample/*.jpg")

# Create a folder for the volume in the output directory (/sample).
outDir = "sample_output"
newDir = os.path.normpath(outDir)

# If you're running this script a second or third time, the sample_output folder will already exist. 
# The following statement checks whether it already exists and then creates the
# sample_output folder if it doesn't exist (e.g. if the statement below is False).
if os.path.exists(newDir) == False:
    os.mkdir(newDir)

# Adding a "/" after newDir ("sample_output") makes it into a file path that
# we'll use to move our output file to the correct folder later in this script.
newDir = newDir + "/"
    
# For each file in the sample folder:
for file in sampleFilePath:
    
    # Open a file.
    with open(file, 'rb') as inputFile:
        
        # Read the file using PIL's Image module.
        img = Image.open(inputFile)
    
        # Run OCR on the open file.
        ocrText = pytesseract.image_to_string(img)
        
        # Get a file name -- without the extension -- to use when we name the output file.
        fileName = file.strip('.jpg')
        
        # The current file name also includes its folder name (sampleFilePath, "sample/").
        # We want to store our text output files in a different folder so that we can use 
        # them in future without altering the original image files. The following two 
        # lines use the re module to rename the path from "sample/" to "sample_output/",
        # which also changes the final destination for our next text file.
        currentFolder = "sample/"
        fileName = re.sub(currentFolder, newDir, fileName)

        # Create and open a text file, name it to match its input file,
        # and write the OCR'ed text to the file.
        with open(fileName + ".txt", "w") as outFile:
            outFile.write(ocrText)
        
        print(fileName, " successfully created.")
    
    # Loop back to check for another image file, run OCR on that file, 
    # and write its OCR to a new output file. When no more files remain,
    # this loop will end, and the script will be finished.

### Review the OCR output.

In your Finder or File Explorer, locate the ["sample_output" folder](sample_output) accompanying this tutorial, and take a look at the text files it should now contain. (Note that we have included these files with the tutorial in case you run into trouble running the script above.) Compare them to the .jpg image files in the "sample" folder. What do you notice?

.

.

.

.

Did you look yet?

.

.

.

.

.

Now that you've looked at all of the files, look more closely at `sessionlawsresol1955nort_0057.jpg` and `sessionlawsresol1955nort_0057.txt`.

### Checking for misspellings <a class="anchor" id="hyphens"></a>

Although it appears that this page has been entirely correctly OCR'ed, there are two issues that show up in this text file that we want to address in all of our OCR'ed files:

1. The original printers **broke words at the end of some lines**. For example, `Dis-trict` and, at the very end of the page, `twenty-`. How do we deal with this without removing words that *should* be hyphenated?


2. **How would we know how accurate this simple script might be when applied to the entire volume, or to the entire corpus?** 

In addition to being hyphenated, `Dis-trict` may be misspelled as `Dis-triet` in our output -- is this just one instance, or does this error recur? If it's recurring, we can use Python to fix it across the corpus. This could be more efficient than having to read the entire OCR'ed corpus. A good starting point is to get a sense of just how accurate the OCR process has been, that is **check its readability**, before we start trying to identify and fix spelling errors.

***Note:*** *The errors you see when you run these scripts may vary depending on the version of Tesseract you are using. At the time of writing, these modules rely on Tesseract version 4.1.1. If you see different errors, apply the concepts in this section to different errors and/or documents.*

**In the following script (broken into multiple chunks) we'll check for OCR accuracy by generating a readability score.** During this process, we'll remove the hyphens at the end of lines to help us with spellchecking, but we may find that we introduce new issues for the spellcheck:

In [None]:
# To begin, there are a number of modules and libraries we need to import
# to extend Python's functionality:

# Import PyTesseract and PIL, an image processing library used by PyTesseract, to complete the OCR.
from PIL import Image
import pytesseract

# Import os, a module for file management.
import os

# Import re, a module that we can use to search text.
import re

# Import glob, a module that helps with file management.
import glob

# Import the SpellChecker module, which we'll use to look for likely misspelled words.
from spellchecker import SpellChecker

# Import the word_tokenize module from the nltk ("Natural Language Processing Kit") library.
# NLTK is a powerful toolset we can use to manipulate and analyze text data.
from nltk import word_tokenize

# We'll also need the pandas library, which is a powerful toolset for managing data.
# We'll learn more about pandas in the exploratory analysis modules.
import pandas as pandas

# This statement confirms that the above code was run without issue.
print("Modules & libraries imported. Ready for the next step.")

In [None]:
# Now we'll set up variables that we'll use to give Python 
# information and structure information that Python returns.
# These include the location of the original image files and the
# place we want to store our OCR'ed text, as well as a spellcheck
# dictionary and a dataframe (essentially, an empty table) we'll use 
# to structure readability information along with the OCR'ed text.

# Open the file folder where our sample pages are stored.
# Look only for files ending with the ".jpg" file extension.
sampleFilePath = glob.glob("sample/*.jpg")

# Before we loop through each page, we'll augment our spellchecker 
# dictionary to include place names specific to North Carolina. 
# Our script for gathering these place names is available here: 
# https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/examples/adjustment_recommendation/geonames.py

# Load the spellchecker dictionary.
spell = SpellChecker()

# Add the place name words from the "geonames.txt" file to the 
# spellchecker dictionary.
spell.word_frequency.load_text_file("geonames.txt")

# We'll use Pandas to create a dataframe (an empty table--
# explained further in the next tutorial!) that can hold 
# information about an OCR'ed page and display it in a tabular format.
# This dataframe will start out empty with only its column headers 
# defined. We'll add information to it one page at a time. So each
# row will represent 1 page.
df = pandas.DataFrame(columns=["file_name","token_count","unknown_count","readability","unknown_words","text"])

# This statement confirms that the above code was run without issue.
print("Variables created. Ready for the next step.")

In [None]:
# Now we'll remove hyphens from the text and run the spellcheck script.

# For each file in the sample folder:
for file in sampleFilePath:
    
    # Open a file.
    with open(file, 'rb') as inputFile:
        
        # Get a file name--without the extension-- 
        # to use when we name the output file.
        fileName = os.path.split(file)[1]
        
        # Read the file using PIL's Image module.
        img = Image.open(inputFile)
    
        # Run OCR on the open file.
        ocrText = pytesseract.image_to_string(img)
        
        # Join hyphenated words that are split between lines by 
        # looking for a hyphen followed by a newline character: "-\n"
            # "\n" is an "escape character" and represents the 
            # "newline," a character that is usually invisible 
            # to human readers but that computers use to mark the 
            # end/beginning of a line. Each time you press the 
            # Enter/Return key on your keyboard, an invisible "\n" 
            # is created to mark the beginning of a new line.
        ocrText = ocrText.replace("-\n","")
        
        # Now we'll check spellings and insert corrections!
        
        # First, we'll use NLTK to "tokenize" text. 
            # "Tokenize" here means to take a page of our OCR'ed text,
            # which Python is currently reading as one big glob of data,
            # and separate each word out so that it can be read as an
            # individual piece of data within a larger data structure 
            # (a list). This process also removes punctuation.
        tokens = word_tokenize(ocrText)
        
        # Next, we'll convert all of those tokens (words) into 
        # lowercase because the spellcheck dictionary is in all 
        # lowercase, and the spellcheck process is case sensitive.
        tokens = [token for token in tokens if token.isalpha()]
        
        # We'll make sure that our text data complies with a universal 
        # text format so that all characters in the data and the 
        # spellchecker can be matched.
        tokens = [token.encode("utf-8", errors = "replace") for token in tokens]
        
        # Now we can get all of the words that don't match the 
        # spellchecker dictionary or our list of place names--
        # these are the potential spelling errors.
        unknown = spell.unknown(tokens)
        
        # Let's use a little math to find out how many potential 
        # spelling errors were identified. As part of this process, 
        # we'll create a "readability" score that will give us a 
        # percentage of how readable each file is--how much of the 
        # OCR'ed is "correct."
        
        # If the list of unknown tokens (words) is greater than 0 
        # (i.e. if the list is not empty):
        if len(unknown) != 0:
            
               # Following order of operations, here's what's happening 
               # in the readability variable below:
               # 1. Divide the number of unknown tokens (len(unknown)) 
                    # by the total number of tokens on the page
                    # (len(tokens)). Use "float" to specify that Python
                    # returns a decimal number:
                        # (float(len(unknown))/float(len(tokens))
               # 2. Multiply the number from step 1 by 100.
                    # (float(len(unknown))/float(len(tokens)) * 100)
               # 3. Subtract the number from step 2 from 100.
                    # 100 - (float(len(unknown))/float(len(tokens)) * 100)
               # 4. Round the number from step 3 to 2 decimal places
                    # round(100 - (float(len(unknown))/float(len(tokens)) * 100), 2)
            
           readability = round(100 - (float(len(unknown))/float(len(tokens)) * 100), 2)
        
        # If the list of unknown tokens is empty (or equal to 0), then readability is 100!
        else:
           readability = 100
    
        # Let's create a record of the readability information 
        # for this page that we'll add to the dataframe. 
        # The following is a Python dictionary, another way of 
        # storing data. Each word or phrase to the left of the : is a
        # "key" -- think of it as a column header. Each piece of 
        # information to the right is a "value" -- information 
        # written in a single cell below each header. 
        # Altogether, this dictionary represents 1 row ("imgRecord") 
        # in a table (or dataframe).
        imgRecord = {
                "file_name" : fileName,
                "token_count" : len(tokens),
                "unknown_count" : len(unknown),
                "readability" : readability,
                "unknown_words" : list(unknown),
                "text" : ocrText
                }
        
        # Here's where we'll add all the information we gathered in 
        # imgRecord as a row in our dataframe.
        df = df.append(imgRecord, ignore_index=True)

        
        # This statement lets us know if a page has been successfully 
        # checked for readability.
        print(fileName, "checked for readability.")
    
# This time, instead of creating individual .txt files for each page,
# we're going to save all of the OCR'ed text and readability 
# information to a single .csv ("comma separated value") file. 
# We can view this file format as a table. Having everything stored 
# like this will help us with clean up and future analysis.
df.to_csv(r'sample_output/sample_output_spellchecked.csv', header=True, index=True, sep=',')

# We have the data stored in a file now, but we can also 
# preview it here:
df

<div class="alert alert-block alert-warning">
    <strong>Take a look at the data preview above.</strong>
    <ul>
        <li>Can you identify what each of the columns represents? Which columns are you unsure of?</li>
        <li>How do you interpret the readability column?</li>
        <li>What do you notice about the unknown words column?</li>
        <li>What do you notice about the text column?</li>
    </ul>
</div>

Open [sample_output_spellchecked.csv](sample_output/sample_output_spellchecked.csv) to view the full dataset. You'll find it in the sample_output folder. **Let's take a look at each column:**

- **file_name**: The name for the corresponding image file. For now, this is the only information in the table that identifies where the rest of the information in each row comes from (which page).
- **token_count**: The total number of tokens (words) found in each page.
- **unknown_count**: The number of unknown ("misspelled") words found in each page.
- **readability**: Think of this as the percentage of the page that was readable.
- **unknown_words**: A list of tokens (words or in some cases characters) that were not listed in the spellchecker.
- **text**: The OCR'ed text output from each page. The output here includes all <a href="https://en.wikipedia.org/wiki/Escape_character#JavaScript" target="blank">escape characters</a>, so it may look as if a lot of erronenous characters have been added. In the [next tutorial](05-StructuringOCRData.ipynb), we'll see how including these in our OCR'ed text can be useful.

**Let's consider what these columns can tell us:**
- The number of uknown words in each page is low (3 max.), and the readability score for each is near 100. This means that *on each page there are only a few errors that need to be addressed.*
- The list of unknown words shows that there are some errors that repeat on all or most pages. These include `b` and `ch`. A closer look at the text column shows that `b` is part of the escape character formatting that's been added. Meanwhile `ch` is likely the abbreviation for `chapter` that occurs frequently throughout the text. These "errors" can be ignored.
- What else do the columns tell you?

<div class="alert alert-block alert-warning">
    <p><strong>A note on checking spelling:</strong> Python and the spellchecker module use a list of words that <em>we</em> provide to check what is "correct" or "incorrect" spelling. So the spellcheck process is only as good as the information we provide. If you are working with text that includes abbreviations, non-English words, words written in dialect, or words that are in any other way not "standard," it's important to include them. This may mean taking time to put together a list of words to add to the spellchecker. In some cases, though, this may not be practical--particularly for texts with a lot of dialog written in dialect, or for non-standard spellings. In these cases, you may need to check spellings manually. Here is an example where it is not the <em>algorithm</em> but the <em>data</em> provided <em>to</em> the algorithm that can produce invalid assessments of a text's "readability."</p>
    <p><em>What does it mean that these spellcheck tools have been created to prioritize standardized modern language?</em></p>
</div>

As an example, let's look again at `ch` in the text column. Note that it appears at the beginning of most pages. In some cases, it's written `cu` instead of `ch`, but our spellchecker didn't recognize that as an error. Furthermore, the alternating upper and lowercase characters in `session` were ignored because we changed all words to lowercase for the spellcheck. This should tell us that, although overall readability is high from a computer's perspective, there may be some issues not identified in this process that we need to address in order to make this dataset more *human* readable. We'll look at some ways to better address *human readability* in the [next tutorial](05-StructuringOCRData.ipynb).

<div class="alert alert-block alert-success">
    <strong>Review:</strong> 
    <p>In this module, we covered the basic steps for performing OCR, including how to prepare for OCR, how to run a basic OCR script, and how to assess the readability, or "accuracy," of your OCR output. But this is just the beginning in terms of producing a fully digitized text dataset that we can use for analysis. In the next tutorial, we'll take a look at a few techniques for "cleaning" our OCR'ed text in preparation for analysis.</p>
</div>

## Resources <a class="anchor" id="resources"></a>

### OCR Research & Evaluation

- [Improving the quality of the output.](https://tesseract-ocr.github.io/tessdoc/ImproveQuality) *Tesseract documentation.*
- Rebecca Bakker. ["OCR for Digital Collections."](https://digitalcommons.fiu.edu/cgi/viewcontent.cgi?article=1047&context=glworks) *FIU Digital Commons.*
- Ryan Baumman. ["Automatic evaluation of OCR quality."](https://ryanfb.github.io/etc/2015/03/16/automatic_evaluation_of_ocr_quality.html) */etc.*
- Brandon W. Hawk. ["OCR and Medieval Manuscripts: Establishing a Baseline."](https://brandonwhawk.net/2015/04/20/ocr-and-medieval-manuscripts-establishing-a-baseline/) *Brandon W. Hawk.* (This post is a comparison of ABBYY FineReader & Adobe Acrobat OCR technologies as applied to medieval texts.)
- Ray Smith. ["An Overview of the Tesseract OCR Engine."](https://research.google/pubs/pub33418/) *Google Research.*
- Simon Tanner. ["Deciding whether Optical Character Recognition is feasible."](https://www.kb.nl/sites/default/files/docs/OCRFeasibility_final.pdf) *King's Digital Consultancy Services.*

### Tutorials

*The following is a list of tutorials that include different scholars' approaches to OCR. Some also use Tesseract, but most use different scripting or programming languages. There is no single best way to do OCR, so if you have the time they worth trying to see which works best for your project.*

- Andrew Akhlaghi. ["OCR and Machine Translation."](http://programminghistorian.org/en/lessons/OCR-and-Machine-Translation) *The Programming Historian.* (Note that this tutorial uses Tesseract but works with the bash scripting language instead of Python.)
- Ryan Baumman. ["Command-Line OCR with Tesseract on Mac OS X."](https://ryanfb.github.io/etc/2014/11/13/command_line_ocr_on_mac_os_x.html) */etc.*
- Shawn Graham. "Extracting Text from PDFs; Doing OCR; all within R." *Electric Archaeology.* (This blog post describes a method for OCR using the R programming language.)
- Moritz MÃ¤hr. ["Working with batches of PDF files."](https://programminghistorian.org/en/lessons/working-with-batches-of-pdf-files) *The Programming Historian.* (Note that this tutorial uses Tesseract and works in the command line without Python.)
- Rebecca Tarnopol. ["How to OCR Documents for Free in Google Drive."](https://business.tutsplus.com/tutorials/how-to-ocr-documents-for-free-in-google-drive--cms-20460) *TutsPlus.*

### Further Reading

*In case you are interested in scholars' and practioners' debates around OCR.

- Karen Coyle. ["Digital Urtext."](https://kcoyle.blogspot.com/2012/04/digital-urtext.html) *Coyle's InFormation.*
- *Humanities Commons* [search for "OCR"](https://hcommons.org/?s=ocr).
- Ray Smith, Daria Antonova, and Dar-Shyang Lee. ["Adapting the Tesseract open source OCR engine for multilingual OCR."](https://dl.acm.org/doi/10.1145/1577802.1577804) MOCR '09: Proceedings of the International Workshop on Multilingual OCR.
- Ted Underwood. ["The challenges of digital work on early-19c collections."](https://tedunderwood.com/2011/10/07/the-challenges-of-digital-work-on-early-19c-collections/) *The Stone and the Shell.*
- [TranScriptorium's handwritten text recognition project results.](https://cordis.europa.eu/project/id/600707/results)
- ["How to Transcribe Documents with Transkribus - Introduction."](https://readcoop.eu/transkribus/howto/how-to-transcribe-documents-with-transkribus-introduction/) *Read Coop.*

**>> Next module: [Structuring OCR'ed Text as Data](05-StructuringOCRData.ipynb) >>**

*This module is licensed under the [GNU General Public License v3.0](https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/LICENSE). Individual images and data files associated with this module may be subject to a different license. If so, we indicate this in the module text.*