Natural Language Processing Internship Project at PSP Investments
- Project Goals
- Business values
- Approaches
- Solution
- Drawbacks
- Tesseract OCR – developed by google
- Pytesseract
- Program overview
- Preprocessing input PDFs
- Program pipeline
- Named Entity Recognition (NER)
- Entities extractions: Getting useful information (such as People, Date, Location, Organizations, etc.) from raw data stored in PDFs.
- Create semantic mappings between related entities.
- Replacing the tedious labor work of reading documents
- Provide comprehensive summarization
- Feature clustering
Before working on the NLP analysis part. We first need to implement a python script that is able to read in PDF files.
Originally, I had attempted to use various PDF parsers, libraries like:
1. PyPDF2:
A pure-python library that is built as a PDF toolkit, capable of extracting document information (title, author, …), splitting documents page by page, merging documents page by page, cropping pages, etc.
2. Tika-python:
A Python port of the Apache Tika library that makes Tika available using the Tika REST Server.
However, all of the modules listed above won’t suit our use cases. Due to the fact that most of our input PDF files are generated from scanned documents instead of generated electronically.
A solution would be instead of treating our input PDF files as traditional PDF files (electronically generated), we treat them as images (JPEG, PNG, etc.), and perform text detection on them.
Yet, there are some limitations to this method, the accuracy of the text extraction varies depending on the quality of the images. In order to improve the performance of our program, we may be required to train a custom machine learning model (a format specifically for PSP’s documents).
-
Tesseract is an optical character recognition engine for various operating systems.
-
It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006.
-
In 2006, Tesseract was considered one of the most accurate open-source OCR engines then available.
-
At the moment, I am using a library called pytesseract.
-
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.
-
Python-tesseract is a wrapper for Google's Tesseract-OCR Engine.
In the beginning, without any preprocessing on the input files, my program is able to read in PDFs and extract their content in an approximately 65-70% accuracy.
Note: for electronically generated PDFs, the accuracy is ~100%.
THAT’S NOT GOOD ENOUGH! 😡
In order to improve the accuracy of our program, it requires more pre-processing on the images before we extract the text from them.
I will give a brief overview of how my program works below.
- Execute the python file ocr.py
- Pass in the PDF filename under the flag --pdf
- In this case, I am passing in a PDF file called sample.pdf which is located in the directoy called pdf (this is optional)
python .\ocr.py --pdf .\pdfs\sample.pdf
- The program will automatically creates a folder, which named the same as the input PDF filename, under the “pdfs/converted_pdf_images” directory.
Display the structure of the directories for clarification by using the command below:
cd \PATH-TO-ocr.py\pdfs\converted_pdf_images
tree pdfs
- The program will then convert the input PDF page-by-page into two .png files (original, preprocessed), a .txt file, correspondingly.
The conversion is done by the use of a library called Wand. It is a ctypes-based simple ImageMagick binding for Python.
-
Simple operations based on the image shape.
-
It is normally performed on binary images. It needs two inputs, one is our original image, second one is called structuring element or kernel which decides the nature of operation.
-
Depending on the font size, we might want to adjust the size of the kernel.
-
Two basic morphological operators are Erosion and Dilation.
-
The basic idea of erosion is just like soil erosion only, it erodes away the boundaries of foreground object (Always try to keep foreground in white).
-
The kernel slides through the image (as in 2D convolution). A pixel in the original image (either 1 or 0) will be considered 1 only if all the pixels under the kernel is 1, otherwise it is eroded (made to zero).
-
In short, all the pixels near boundary will be discarded depending upon the size of kernel. So the thickness or size of the foreground object decreases or simply white region decreases in the image.
-
It is useful for removing small white noises, detach two connected objects etc.
-
Opposite of erosion
-
Here, a pixel element is ‘1’ if at least one pixel under the kernel is ‘1’.
-
It increases the white region in the image or size of foreground object increases. Normally, in cases like noise removal, erosion is followed by dilation.
-
Useful in joining broken parts of an object.
- Opening is just another name of erosion followed by dilation. It is useful in removing noise, as mentioned previously.
-
Closing is the opposite of Opening, dilation followed by erosion.
-
It is useful in closing small holes inside the foreground objects, or small black points on the object.
-
In our case, Closing yielded a better result than Opening.
Image blurring is usually achieved by convolving the image with a low-pass filter kernel. In order to blur the image or to remove noise.
- After convolving an image with a normalized box filter, this simply takes the average of all the pixels under the kernel area and replaces the central element.
-
This works in a similar fashion to Averaging, but it uses Gaussian kernel, instead of a normalized box filter, for convolution.
-
Here, the dimensions of the kernel and standard deviations in both directions can be determined independently.
-
Gaussian blurring is very useful for removing gaussian noise from the image.
-
On the contrary, gaussian blurring does not preserve the edges in the input.
For a computer, all inputs eventually boils down to 1’s and 0’s. Thus, converting images to black and white immensely helps Tesseract recognize characters. However, this might fail if the input documents lack contrast or have a slightly darker background.
- This method particularly works well with bimodal images, which is an image whose histogram has two peaks. If this is the case, we might be keen on picking a threshold value between these peaks.
Original image Preprocessed image |
In progress