Here I try the following OCR packages: tesseract; docTR


In [1]:
!python --version

Python 3.9.18


#### Set up the PaLM API to help with code documentation

Set the MakerSuite API key with the provided helper function.

In [2]:
import os
#from utils import get_api_key
import google.generativeai as palm
from google.api_core import client_options as client_options_lib

palm.configure(
    api_key=os.getenv('PALM_API_KEY'),
    transport="rest",
    client_options=client_options_lib.ClientOptions(
        api_endpoint=os.getenv("GOOGLE_API_BASE"),
    )
)

Pick the model that generates text

In [None]:
models = [m for m in palm.list_models() if 'generateText' in m.supported_generation_methods]
model_bison = models[0]
model_bison

Helper function to call the PaLM API

In [4]:
from google.api_core import retry
@retry.Retry()
def generate_text(prompt, 
                  model=model_bison, 
                  temperature=0.0):
    return palm.generate_text(prompt=prompt,
                              model=model,
                              temperature=temperature)

#### Run the OCR functions with Tesseract

**Technical Documentation**

This code uses the Python libraries `cv2` and `pytesseract` to read and extract text from an image. The image is first read into memory using the `cv2.imread()` function. The `pytesseract.image_to_string()` function is then used to convert the image to text. The `custom_config` parameter is used to specify custom options for the `pytesseract` library. In this case, the `-oem 3` option is used to enable the LSTM-based OCR engine, and the `-psm 6` option is used to specify that the image should be scanned for text using a single pass.

**Output**

The output of the code is the text that was extracted from the image.

In [None]:
# Import the necessary libraries
import cv2
import pytesseract

# Read the image
img = cv2.imread('test_graham_g.png')

# Adding custom options
custom_config = r'--oem 3 --psm 6'

# Convert the image to text
text = pytesseract.image_to_string(img, config=custom_config)

# Print the text
print(text)

Note: If your PDF is mostly clear, high-resolution images of text, direct processing with Tesseract might work well. On the other hand, if you're dealing with complex layouts, vectorized text, or poor-quality scans, you might get better results by first converting the PDF pages to high-quality images using a tool like PyMuPDF (fitz) and then applying OCR to those images. Source: GPT4.  
  
A more complex example is provided below. We iterate through the pages of a scanned pdf file and extract the corresponding text. You will need an additional library to convert PDF pages to images. PyMuPDF (also known as fitz) is a commonly used library for this purpose.

In [None]:
import fitz  # PyMuPDF
import numpy as np

# Path to your PDF file
pdf_file = 'Section 165 application.pdf'

# Open the PDF file
doc = fitz.open(pdf_file)

# Initialize an empty string to accumulate text
full_text = ""

# Frequency of progress updates
n = 10

# Total number of pages
total_pages = len(doc)

# Iterate through each page
for page_num in range(total_pages):
    # Print progress every n pages
    if (page_num + 1) % n == 0:
        print(f"Processing page {page_num + 1} of {total_pages}...")

    # Get the page
    page = doc.load_page(page_num)

    # Convert the page to an image (pix) object
    pix = page.get_pixmap()

    # Store the image in a format that OpenCV can read (in memory, without saving to disk)
    img = cv2.imdecode(np.frombuffer(pix.tobytes(), np.uint8), 1)

    # Adding custom options for PyTesseract
    custom_config = r'--oem 3 --psm 6'
    
    # Extract text using PyTesseract from the current page image
    text = pytesseract.image_to_string(img, config=custom_config)
    
    # Add the extracted text to the full text
    full_text += f"Text from page {page_num + 1}:\n{text}\n\n"

# Close the document
doc.close()

# Save the full text to a file
with open('section_165_ocr_output.txt', 'w') as file:
    file.write(full_text)


Define image preprocessing functions

The following code is a Python implementation of some basic image processing techniques.

**get_grayscale**

This function converts an RGB image to grayscale.

**remove_noise**

This function removes noise from an image using a median blur filter.

**thresholding**

This function applies a threshold to an image, converting it to a binary image.

**dilation**

This function dilates an image, increasing the size of its objects.

**erosion**

This function erodes an image, decreasing the size of its objects.

**opening**

This function applies erosion followed by dilation to an image.

**canny**

This function applies the Canny edge detector to an image.

**deskew**

This function corrects the skew of an image.

In [None]:
# get grayscale image
def get_grayscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# noise removal
def remove_noise(image):
    return cv2.medianBlur(image,5)
 
#thresholding
def thresholding(image, threshold_value=128):
    # Apply a binary threshold to the image
    # The first parameter is the source image, which should be a grayscale image
    # The second parameter is the threshold value which is used to classify the pixel values
    # The third parameter is the maxVal which represents the value to be given if pixel value is more than (sometimes less than) the threshold value
    # cv2.THRESH_BINARY is the type of threshold applied, and it ensures pixel values are either 0 or maxVal
    #return cv2.threshold(image, threshold_value, 255, cv2.THRESH_BINARY)[1] # binary threshold
    return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1] # Otsu's method


#dilation
def dilate(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.dilate(image, kernel, iterations = 1)
    
#erosion
def erode(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.erode(image, kernel, iterations = 1)

#opening - erosion followed by dilation
def opening(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)

#canny edge detection
def canny(image):
    return cv2.Canny(image, 100, 200)

#skew correction
def deskew(image):
    coords = np.column_stack(np.where(image > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    (h, w) = image.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
    return rotated


Run the OCR again with preprocessing

In [None]:
import fitz  # PyMuPDF
import numpy as np

# Path to your PDF file
pdf_file = 'Section 165 application.pdf'

# Open the PDF file
doc = fitz.open(pdf_file)

# Initialize an empty string to accumulate text
full_text = ""

# Frequency of progress updates
n = 10

# Total number of pages
total_pages = len(doc)

# Iterate through each page
for page_num in range(total_pages):
    # Print progress every n pages
    if (page_num + 1) % n == 0:
        print(f"Processing page {page_num + 1} of {total_pages}...")

    # Get the page
    page = doc.load_page(page_num)

    # Convert the page to an image (pix) object
    pix = page.get_pixmap()

    # Store the image in a format that OpenCV can read (in memory, without saving to disk)
    img = cv2.imdecode(np.frombuffer(pix.tobytes(), np.uint8), 1)
    
    # Preprocess the images
    gray = get_grayscale(img)
    thresh = thresholding(gray)
    #opened = opening(thresh)
    #edged = canny(opened)
    #deskewed = deskew(edged)

    # Adding custom options for PyTesseract
    custom_config = r'--oem 3 --psm 6'
    
    # Extract text using PyTesseract from the current page image
    text = pytesseract.image_to_string(thresh, config=custom_config)
    
    # Add the extracted text to the full text
    full_text += f"Text from page {page_num + 1}:\n{text}\n\n"

# Close the document
doc.close()

# Save the full text to a file
with open('section_165_preprocess_ocr_output.txt', 'w') as file:
    file.write(full_text)

Observation: OCR with Tesseract is not very good. Much of the context is unintelligible. Preprocessing appears to further degrade the output.

#### Run the OCR functions with docTR

Install the required packages

In [None]:

"""!pip install pymupdf # converts pdf to high quality images
!pip install pillow # image processing
!pip install python-doctr #used for ocr"""


In [9]:
# Experiment with one image
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

# Load a document
doc = DocumentFile.from_images('harry.jpeg')

# Load the OCR model
model = ocr_predictor(pretrained=True)

# Perform OCR on the document
result = model(doc)

# The result is a structured document
for page in result.pages:
    for block in page.blocks:
        for line in block.lines:
            # Concatenate words in the line to form a sentence
            line_text = ' '.join(word.value for word in line.words)
            print(line_text)


fight back to Oslo. He didn't know what was going to
aRerrou that; the plan he had worked out with Johan Krohn went
affer
-
than this. in the rowlocks and started to row.
the oars
Peid him. Thought back to the time he used to row while
sat in front of him, smiling and giving Harry little bits
he should use his upper body and straighten his arms,
How stomach, not his biceps. That he should take it gently,never
his rhythm, that a boat gliding evenly through the watermoves
SS, findar less energy. That he should feel with his buttocks
Bster even was sitting in the middle of the bench. That it was all
ple sure he That he shouldn't look at the oars, but keep his eyes on
aur balance. the signs ofwhat had already happened showed you where
kemde heading. But, his grandfather had said, they told you surpris-
pus ere about what was going to happen. That was determined by
stroke of the oars. His grandfather took out his pocket watch
pisidt that when we get back on shore, we look back on our journe

In [7]:
# Convert an entire pdf

import fitz  # PyMuPDF
from PIL import Image
from io import BytesIO
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

# Path to your PDF file
pdf_file = './georg/pdfs/Notice of Motion (Long Form) issued_LPC RAF 180 DAYS 2024.pdf'

# Output file path
output_file = 'Notice_Long_ocr_doctr.txt'

# Open the PDF file
doc = fitz.open(pdf_file)

# Initialize the OCR model
model = ocr_predictor(pretrained=True)

# Initialize a variable to store the full text
full_text = ""

# Frequency of progress updates
n = 5

# Total number of pages
total_pages = len(doc)

# Iterate through each page of the PDF
for page_num in range(len(doc)):
    # Print progress every n pages
    if (page_num + 1) % n == 0:
        print(f"Processing page {page_num + 1} of {total_pages}...")
        
    # Get the page
    page = doc.load_page(page_num)

    # Convert the page to a PIL image
    pix = page.get_pixmap()
    img_bytes = BytesIO(pix.tobytes("png"))
    image = Image.open(img_bytes)

    # Convert PIL image to byte array
    image_byte_array = BytesIO()
    image.save(image_byte_array, format='PNG')
    image_byte_array = image_byte_array.getvalue()

    # Perform OCR on the image
    result = model(DocumentFile.from_images([image_byte_array]))

    # Process the result
    for doc_page in result.pages:
        for block in doc_page.blocks:
            for line in block.lines:
                line_text = ' '.join(word.value for word in line.words)
                full_text += line_text + '\n'

# Close the document
doc.close()

# Save the full text to a file
with open(output_file, 'w') as file:
    file.write(full_text)
    print("Preprocessing complete")


Processing page 5 of 9...
Preprocessing complete


Explain the code using Google PaLM API

In [70]:
# Paste in the code block

CODE_BLOCK="""
# Convert an entire pdf

import fitz  # PyMuPDF
from PIL import Image
from io import BytesIO
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

# Path to your PDF file
pdf_file = 'Section 165 application.pdf'

# Output file path
output_file = 'section_165_ocr_doctr.txt'

# Open the PDF file
doc = fitz.open(pdf_file)

# Initialize the OCR model
model = ocr_predictor(pretrained=True)

# Initialize a variable to store the full text
full_text = ""

# Frequency of progress updates
n = 10

# Total number of pages
total_pages = len(doc)

# Iterate through each page of the PDF
for page_num in range(len(doc)):
    # Print progress every n pages
    if (page_num + 1) % n == 0:
        print(f"Processing page {page_num + 1} of {total_pages}...")
        
    # Get the page
    page = doc.load_page(page_num)

    # Convert the page to a PIL image
    pix = page.get_pixmap()
    img_bytes = BytesIO(pix.tobytes("png"))
    image = Image.open(img_bytes)

    # Convert PIL image to byte array
    image_byte_array = BytesIO()
    image.save(image_byte_array, format='PNG')
    image_byte_array = image_byte_array.getvalue()

    # Perform OCR on the image
    result = model(DocumentFile.from_images([image_byte_array]))

    # Process the result
    for doc_page in result.pages:
        for block in doc_page.blocks:
            for line in block.lines:
                line_text = ' '.join(word.value for word in line.words)
                full_text += line_text + '\n'

# Close the document
doc.close()

# Save the full text to a file
with open(output_file, 'w') as file:
    file.write(full_text)
"""

In [73]:
# Set up the prompt

prompt_template = """
Can you please explain how this code works?

{question}

Use a lot of detail and make it as clear as possible.
Output the results in markdown
"""

In [74]:
# Run the completion

completion = generate_text(
    prompt = prompt_template.format(question=CODE_BLOCK)
)
print(completion.result)

The following code is a Python implementation of some basic image processing techniques.

**get_grayscale**

This function converts an RGB image to grayscale.

**remove_noise**

This function removes noise from an image using a median blur filter.

**thresholding**

This function applies a threshold to an image, converting it to a binary image.

**dilation**

This function dilates an image, increasing the size of its objects.

**erosion**

This function erodes an image, decreasing the size of its objects.

**opening**

This function applies erosion followed by dilation to an image.

**canny**

This function applies the Canny edge detector to an image.

**deskew**

This function corrects the skew of an image.

**Output**

The following is the output of the code on an example image:

```
[![Image of the output](https://i.imgur.com/0311111.png)](https://i.imgur.com/0311111.png)
```
