Skip to content

Angular SPA (Backend: Node.js): Computer Vision / NLP Project at PSP Investments

License

Notifications You must be signed in to change notification settings

dwtcourses/PDF_Scanner_Tesseract_OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 

Repository files navigation

GitHub

PDF Scanner Tesseract OCR

Natural Language Processing Internship Project at PSP Investments

Table content:

Project Goals:

  • Entities extractions: Getting useful information (such as People, Date, Location, Organizations, etc.) from raw data stored in PDFs.
  • Create semantic mappings between related entities.

Business values:

  • Replacing the tedious labor work of reading documents
  • Provide comprehensive summarization
  • Feature clustering

Approaches:

Before working on the NLP analysis part. We first need to implement a python script that is able to read in PDF files.

Originally, I had attempted to use various PDF parsers, libraries like: 

1. PyPDF2:

A pure-python library that is built as a PDF toolkit, capable of extracting document information (title, author, …), splitting documents page by page, merging documents page by page, cropping pages, etc.

A Python port of the Apache Tika library that makes Tika available using the Tika REST Server.

However, all of the modules listed above won’t suit our use cases. Due to the fact that most of our input PDF files are generated from scanned documents instead of generated electronically.


Solution:

A solution would be instead of treating our input PDF files as traditional PDF files (electronically generated), we treat them as images (JPEG, PNG, etc.), and perform text detection on them.

ee_sample

Drawbacks:

Yet, there are some limitations to this method, the accuracy of the text extraction varies depending on the quality of the images. In order to improve the performance of our program, we may be required to train a custom machine learning model (a format specifically for PSP’s documents).

Tesseract OCR

google_tesseract

  • Tesseract is an optical character recognition engine for various operating systems.

  • It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006.

  • In 2006, Tesseract was considered one of the most accurate open-source OCR engines then available.

Pytesseract

pytesseract

  • At the moment, I am using a library called pytesseract.

  • Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.

  • Python-tesseract is a wrapper for Google's Tesseract-OCR Engine.

Program overview:

In the beginning, without any preprocessing on the input files, my program is able to read in PDFs and extract their content in an approximately 65-70% accuracy. 

Note: for electronically generated PDFs, the accuracy is ~100%.

THAT’S NOT GOOD ENOUGH! 😡

trash_data


Preprocessing input PDFs:

In order to improve the accuracy of our program, it requires more pre-processing on the images before we extract the text from them.

I will give a brief overview of how my program works below.

Program pipeline:

  1. Execute the python file ocr.py
  2. Pass in the PDF filename under the flag --pdf
  3. In this case, I am passing in a PDF file called sample.pdf which is located in the directoy called pdf (this is optional)
python .\ocr.py --pdf .\pdfs\sample.pdf
  1. The program will automatically creates a folder, which named the same as the input PDF filename, under the “pdfs/converted_pdf_images” directory.

Display the structure of the directories for clarification by using the command below:

cd \PATH-TO-ocr.py\pdfs\converted_pdf_images
tree pdfs

ee_sample

  1. The program will then convert the input PDF page-by-page into two .png files (original, preprocessed), a .txt file, correspondingly.

dir_sample

PDF -> Image (.png, .jpeg, etc.)

The conversion is done by the use of a library called Wand. It is a ctypes-based simple ImageMagick binding for Python.


Noise removal

Preprocessing: Morphological Transformations

  • Simple operations based on the image shape.

  • It is normally performed on binary images. It needs two inputs, one is our original image, second one is called structuring element or kernel which decides the nature of operation.

  • Depending on the font size, we might want to adjust the size of the kernel.

  • Two basic morphological operators are Erosion and Dilation.


Erosion:

  • The basic idea of erosion is just like soil erosion only, it erodes away the boundaries of foreground object (Always try to keep foreground in white).

  • The kernel slides through the image (as in 2D convolution). A pixel in the original image (either 1 or 0) will be considered 1 only if all the pixels under the kernel is 1, otherwise it is eroded (made to zero).

  • In short, all the pixels near boundary will be discarded depending upon the size of kernel. So the thickness or size of the foreground object decreases or simply white region decreases in the image.

  • It is useful for removing small white noises, detach two connected objects etc.

transform-erosion


Dilation:

  • Opposite of erosion

  • Here, a pixel element is ‘1’ if at least one pixel under the kernel is ‘1’.

  • It increases the white region in the image or size of foreground object increases. Normally, in cases like noise removal, erosion is followed by dilation.

  • Useful in joining broken parts of an object.

transform-dilation


Opening:

  • Opening is just another name of erosion followed by dilation. It is useful in removing noise, as mentioned previously.

opening


Closing:

  • Closing is the opposite of Opening, dilation followed by erosion.

  • It is useful in closing small holes inside the foreground objects, or small black points on the object.

  • In our case, Closing yielded a better result than Opening.

opening


Edge-preserving smoothing

Image blurring is usually achieved by convolving the image with a low-pass filter kernel. In order to blur the image or to remove noise.

Averaging:

  • After convolving an image with a normalized box filter, this simply takes the average of all the pixels under the kernel area and replaces the central element. 

Gaussian blurring:

  • This works in a similar fashion to Averaging, but it uses Gaussian kernel, instead of a normalized box filter, for convolution.

  • Here, the dimensions of the kernel and standard deviations in both directions can be determined independently.

  • Gaussian blurring is very useful for removing gaussian noise from the image.

  • On the contrary, gaussian blurring does not preserve the edges in the input.

G-Blur


Binarization

For a computer, all inputs eventually boils down to 1’s and 0’s. Thus, converting images to black and white immensely helps Tesseract recognize characters. However, this might fail if the input documents lack contrast or have a slightly darker background.

Otsu’s Threshold:

  • This method particularly works well with bimodal images, which is an image whose histogram has two peaks. If this is the case, we might be keen on picking a threshold value between these peaks.

otsu


Output (sample.pdf)

output
Original image                                                     Preprocessed image

Named Entity Recognition (NER)

In progress

About

Angular SPA (Backend: Node.js): Computer Vision / NLP Project at PSP Investments

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published