# Searchable PDF: Use Tabula

refer to https://medium.com/analytics-vidhya/how-to-extract-multiple-tables-from-a-pdf-through-python-and-tabula-py-6f642a9ee673

+ !pip uninstall tabula
+ !pip3 install tabula-py # old version has issues with read_pdf
+ install java 8+ and set java to PATH

In [None]:
from tabula.io import read_pdf
import pandas as pd
from datetime import datetime
from pdf_utils import *

## Read PDF

In [None]:
# Define area to extract data
box = [3, 1, 27, 30]  # unit: cm [top, left, width, bottom]

# Convert to pdf point: 1 pt = 1/72 inch, 1 inch = 2.54 cm
fc = 1/2.54*72
box = [round(i*fc, 2) for i in box]
print(box)

# read pdf: need to install java
# area: analyze(top,left,bottom,right)
# use guess=False to ensure it capture all content in the page
df = read_pdf('citi_test1.pdf', pages=[2, 3], area=[box],
              output_format='json', stream=True, lattice=False, guess=False)

In [None]:
convert_df('citi_test1.pdf', page_range=[2,3])

## Exploration

In [None]:
for data in df[0]['data']:
    print('------------------------------')
    for item in data:
        print(item['left'], round(item['left']+item['width'],1), item['text'])

## Parse Data

df[page][‘data’][row][column][‘text’]

### DBS

In [None]:
tbl = convert_df('test2.pdf', [4, 5, 6, 7, 8])
tbl = post_process(tbl, bank='DBS')
tbl

### Citi

In [None]:
tbl = convert_df('citi_test2.pdf', page_range=[2, 3], bank='Citi')
tbl = post_process(tbl, bank='Citi')
tbl

# Searchable PDF: Use pdfminer

Good for extracting text only, not good for tabular information

In [None]:
from pdfminer.high_level import extract_text

pdf_file = open('test1.pdf', 'rb')
text = extract_text(pdf_file, password='', page_numbers=None, maxpages=0, 
                    caching=True, codec='utf-8', laparams=None)
print(text)

# Searchable PDF: Use PyPDF2

Same as pdfminer, good for text extraction but not for tabular information

In [None]:
from PyPDF2 import PdfFileReader

# creating a pdf file object
pdfObject = open('test1.pdf', 'rb')

# creating a pdf reader object
pdfReader = PdfFileReader(pdfObject)

# Extract and concatenate each page's content
text=''
for i in range(0,pdfReader.numPages):
    # creating a page object
    pageObject = pdfReader.getPage(i)
    # extracting text from page
    text += pageObject.extractText()
print(text)

# Unsearchable PDF: Pytesseract

Tesseract OCR engine: https://github.com/UB-Mannheim/tesseract/wiki
+ Refer to installation manual: https://towardsdatascience.com/read-a-multi-column-pdf-with-pytesseract-in-python-1d99015f887a

fitz may occur some errors like "No such module":
+ Need to uninstall and then install PyMuPDF to solve the issue

Refer to https://nanonets.com/blog/ocr-with-tesseract/#introduction

OCR stands for "Optical Character Recognition", which transforms two-dimensional image of text (printed or handwriting) into machine-readable text. OCR generally consists of several subprocesses:
+ Preprocessing of the Image
+ Text Localization
+ Character Segmentation
+ Character Recognition
+ Post Processing

There are many OCR softwares, but only very few are free. Here is the brief summary on some OCR softwares.
+ Tesseract: an open-source OCR engine popular among OCR developers. It gained popularity and was developed by HP between 1984 and 1994. In 2005 HP released Tesseract as an open-source software. Since 2006 it is developed by Google.
+ OCRopus: an open-source OCR system allowing easy evaluation and reuse of the OCR components by both researchers and companies.
+ Ocular：it works best on documents printed using a hand press, including those written in multiple languages.
+ SwiftOCR: a fast and simple OCR library that uses neural networks for image recognition. SwiftOCR claims that their engine outperforms well known Tessaract library.

Tesseract OCR process:
+ <b> original image $\rightarrow$ adaptive binarization $\rightarrow$ binary image <b>
+ <b> binary image $\rightarrow$ component analysis $\rightarrow$ contour detection $\rightarrow$ detection of word paragraph lines<b>
+ <b> detection $\rightarrow$ organization words $\rightarrow$ two steps recoginition $\rightarrow$ editable document<b>

!pip install pytesseract
!pip install opencv-contrib-python
!pip install PyMuPDF  # install fitz directly will have ModuleNotFoundError for "frontend" in importing, solved by installing pymupdf instead

In [None]:
# for manipulating the PDF
import fitz

# for OCR using PyTesseract
import cv2                              # pre-processing images
import pytesseract                      # extracting text from images
import numpy as np
import matplotlib.pyplot as plt         # displaying output images

from PIL import Image
import os

In [None]:
pdf_to_img("test_text.pdf", img_folder='img')

## Binarize image

In [None]:
original_image, threshold_image = to_binary_img('img/test_textpage-1.png')

In [None]:
# Convert the image to grayscale
gray_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2GRAY) # convert image from one color space to another
plt.figure(figsize=(20, 15))
plt.imshow(gray_image, cmap='gray')
plt.show()

In [None]:
# Convert grayscale to white and black
ret, threshold_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)
plt.figure(figsize=(20, 15))
plt.imshow(threshold_image, cmap='gray')
plt.show()

## Localize text

In [None]:
masked = OCR_text(original_image, threshold_image, text_file='img_text.txt', local_area=(66,66))

In [None]:
plt.figure(figsize=(20, 15))
plt.imshow(masked)
plt.show()