PDF Scanner Tesseract OCR

Natural Language Processing Internship Project at PSP Investments

Table content:

Project Goals
Business values
Approaches
1. PyPDF2
2. Tika-python
Solution
Drawbacks
Tesseract OCR – developed by google
Pytesseract
Program overview
Preprocessing input PDFs
Program pipeline
Named Entity Recognition (NER)

Project Goals:

Entities extractions: Getting useful information (such as People, Date, Location, Organizations, etc.) from raw data stored in PDFs.
Create semantic mappings between related entities.

Business values:

Replacing the tedious labor work of reading documents
Provide comprehensive summarization
Feature clustering

Approaches:

Before working on the NLP analysis part. We first need to implement a python script that is able to read in PDF files.

Originally, I had attempted to use various PDF parsers, libraries like:

1. PyPDF2:

A pure-python library that is built as a PDF toolkit, capable of extracting document information (title, author, …), splitting documents page by page, merging documents page by page, cropping pages, etc.

2. Tika-python:

A Python port of the Apache Tika library that makes Tika available using the Tika REST Server.

However, all of the modules listed above won’t suit our use cases. Due to the fact that most of our input PDF files are generated from scanned documents instead of generated electronically.

Solution:

A solution would be instead of treating our input PDF files as traditional PDF files (electronically generated), we treat them as images (JPEG, PNG, etc.), and perform text detection on them.

Drawbacks:

Yet, there are some limitations to this method, the accuracy of the text extraction varies depending on the quality of the images. In order to improve the performance of our program, we may be required to train a custom machine learning model (a format specifically for PSP’s documents).

Tesseract OCR

Tesseract is an optical character recognition engine for various operating systems.
It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006.
In 2006, Tesseract was considered one of the most accurate open-source OCR engines then available.

Pytesseract

At the moment, I am using a library called pytesseract.
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.
Python-tesseract is a wrapper for Google's Tesseract-OCR Engine.

Program overview:

In the beginning, without any preprocessing on the input files, my program is able to read in PDFs and extract their content in an approximately 65-70% accuracy.

Note: for electronically generated PDFs, the accuracy is ~100%.

THAT’S NOT GOOD ENOUGH! 😡

Preprocessing input PDFs:

In order to improve the accuracy of our program, it requires more pre-processing on the images before we extract the text from them.

I will give a brief overview of how my program works below.

Program pipeline:

Execute the python file ocr.py
Pass in the PDF filename under the flag --pdf
In this case, I am passing in a PDF file called sample.pdf which is located in the directoy called pdf (this is optional)

python .\ocr.py --pdf .\pdfs\sample.pdf

The program will automatically creates a folder, which named the same as the input PDF filename, under the “pdfs/converted_pdf_images” directory.

Display the structure of the directories for clarification by using the command below:

cd \PATH-TO-ocr.py\pdfs\converted_pdf_images
tree pdfs

The program will then convert the input PDF page-by-page into two .png files (original, preprocessed), a .txt file, correspondingly.

PDF -> Image (.png, .jpeg, etc.)

The conversion is done by the use of a library called Wand. It is a ctypes-based simple ImageMagick binding for Python.

Noise removal

Preprocessing: Morphological Transformations

Simple operations based on the image shape.
It is normally performed on binary images. It needs two inputs, one is our original image, second one is called structuring element or kernel which decides the nature of operation.
Depending on the font size, we might want to adjust the size of the kernel.
Two basic morphological operators are Erosion and Dilation.

Erosion:

The basic idea of erosion is just like soil erosion only, it erodes away the boundaries of foreground object (Always try to keep foreground in white).
The kernel slides through the image (as in 2D convolution). A pixel in the original image (either 1 or 0) will be considered 1 only if all the pixels under the kernel is 1, otherwise it is eroded (made to zero).
In short, all the pixels near boundary will be discarded depending upon the size of kernel. So the thickness or size of the foreground object decreases or simply white region decreases in the image.
It is useful for removing small white noises, detach two connected objects etc.

Dilation:

Opposite of erosion
Here, a pixel element is ‘1’ if at least one pixel under the kernel is ‘1’.
It increases the white region in the image or size of foreground object increases. Normally, in cases like noise removal, erosion is followed by dilation.
Useful in joining broken parts of an object.

Opening:

Opening is just another name of erosion followed by dilation. It is useful in removing noise, as mentioned previously.

Closing:

Closing is the opposite of Opening, dilation followed by erosion.
It is useful in closing small holes inside the foreground objects, or small black points on the object.
In our case, Closing yielded a better result than Opening.

Edge-preserving smoothing

Image blurring is usually achieved by convolving the image with a low-pass filter kernel. In order to blur the image or to remove noise.

Averaging:

After convolving an image with a normalized box filter, this simply takes the average of all the pixels under the kernel area and replaces the central element.

Gaussian blurring:

This works in a similar fashion to Averaging, but it uses Gaussian kernel, instead of a normalized box filter, for convolution.
Here, the dimensions of the kernel and standard deviations in both directions can be determined independently.
Gaussian blurring is very useful for removing gaussian noise from the image.
On the contrary, gaussian blurring does not preserve the edges in the input.

Binarization

For a computer, all inputs eventually boils down to 1’s and 0’s. Thus, converting images to black and white immensely helps Tesseract recognize characters. However, this might fail if the input documents lack contrast or have a slightly darker background.

Otsu’s Threshold:

This method particularly works well with bimodal images, which is an image whose histogram has two peaks. If this is the case, we might be keen on picking a threshold value between these peaks.

Output (sample.pdf)


Original image Preprocessed image

Named Entity Recognition (NER)

In progress

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
PDF2News		PDF2News
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Scanner Tesseract OCR

Table content:

Project Goals:

Business values:

Approaches:

1. PyPDF2:

2. Tika-python:

Solution:

Drawbacks:

Tesseract OCR

Pytesseract

Program overview:

Preprocessing input PDFs:

Program pipeline:

PDF -> Image (.png, .jpeg, etc.)

Noise removal

Preprocessing: Morphological Transformations

Erosion:

Dilation:

Opening:

Closing:

Edge-preserving smoothing

Averaging:

Gaussian blurring:

Binarization

Otsu’s Threshold:

Output (sample.pdf)

Named Entity Recognition (NER)

About

Releases

Packages

Contributors 2

Languages

License

dwtcourses/PDF_Scanner_Tesseract_OCR

Folders and files

Latest commit

History

Repository files navigation

PDF Scanner Tesseract OCR

Table content:

Project Goals:

Business values:

Approaches:

1. PyPDF2:

2. Tika-python:

Solution:

Drawbacks:

Tesseract OCR

Pytesseract

Program overview:

Preprocessing input PDFs:

Program pipeline:

PDF -> Image (.png, .jpeg, etc.)

Noise removal

Preprocessing: Morphological Transformations

Erosion:

Dilation:

Opening:

Closing:

Edge-preserving smoothing

Averaging:

Gaussian blurring:

Binarization

Otsu’s Threshold:

Output (sample.pdf)

Named Entity Recognition (NER)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages