## Task

Choose one of the following tasks:
Task 1: Optimized Text Extraction from PDF Images Using Python OCR
Goal:
Develop a Python script for efficient extraction of specific fields from PDF images using various
OCR libraries.
Tasks:
Write a function that takes a PDF file, a keyword (e.g., &quot;social security number&quot;), and a direction
(e.g., &quot;bottom&quot;) as input. The function should extract the value associated with the keyword
located in the specified direction from the keyword.
Maximize time and resource efficiency by utilizing GPU where possible.
We recommend using Google Colab for implementation as it provides an option for selecting
GPU.
Implementation Ideas:
Use an iterative approach for reading text: keep reading the text until the desired keyword and
its corresponding value are found.
Try optimizing the process using different approaches, such as reducing image size, parallel
reading of text from different parts of the image, etc.
Use any ideas you find effective.
Implementation:
You can use any OCR library you find suitable, including but not limited to Tesseract,
Pytesseract, EasyOCR, etc. Develop and test your solutions to ensure the best performance
and accuracy.
Evaluation:
Your code will be evaluated based on its efficiency, accuracy, and cleanliness. Pay attention to
how your code handles errors and unexpected situations. We will also assess your ability to
optimize performance and resource usage.

## Implementation

Importing libraries

In [2]:
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
from wand.image import Image as wi
import cv2
import numpy as np
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

I'm going to use Pytesseract. It doesn't work with PDF. So I turn it into png. I'm assuming that all input PDFs are going to look like one in an example (1 page with 1 image)

In [3]:
def pdf_to_png(pdf):
    try:
        image = convert_from_path(pdf)
    except:
        return -2
    if len(image)>1:
        return -1
    image[0].save('sample' +'.png', 'PNG')

Cleaning the image: convert to gray, apply dilation and erosion to remove some noise, apply threshold to get image with only black and white

In [4]:
def cleaning_img(img):
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)
    return img

def cleaning_img2(img):    
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)
    img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
    return img

Splitting image into 9 (that number can be easily changed) equal parts 

In [5]:
def img_split(img):
    img=cleaning_img2(img)
    height, width = img.shape
# Number of pieces Horizontally 
    W_SIZE  = 3
# Number of pieces Vertically to each Horizontal  
    H_SIZE = 3
    for ih in range(H_SIZE):
        for iw in range(W_SIZE):   
            x = width/W_SIZE * iw 
            y = height/H_SIZE * ih
            h = (height / H_SIZE)
            w = (width / W_SIZE)
            img2 = img[int(y):int(y+h), int(x):int(x+w)] 
            data = Image.fromarray(img2)
            data.save('splitted' + str(ih)+str(iw) +  ".png")

Now let's create a function which finds associated information with a given value

Idea will be the following: I'm going to crop desired region of the image into rectangles (blocks of text) and in these rectangles I will search the keyword and give all the info about it.

Function crop_img takes image and the keyword and outputs all info in the first rectangle where it was found (first because they are smaller)

In [6]:
def crop_img(img,keyword):
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    ret, thresh1 = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)
    rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (13, 13))
    dilation = cv2.dilate(thresh1, rect_kernel, iterations = 1)
    contours, hierarchy = cv2.findContours(dilation, cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_NONE)
    im2 = img.copy()
    for cnt in contours:
        x, y, w, h = cv2.boundingRect(cnt)
        rect = cv2.rectangle(im2, (x, y), (x + w, y + h), (0, 255, 0), 2)
        cropped = im2[y:y + h, x:x + w]    
        text = pytesseract.image_to_string(cleaning_img(cropped))
        if text.find(keyword)!=-1:
            cv2.imwrite('cropping_grid.png',rect)
            return text
    return -1

This is how cropping looks like. Unfortunately, it doesn't see name, address and other important stuff, because they are in special boxes. 

![cropping_grid.png](attachment:cropping_grid.png)

In order to make result more accurate, I'm gonna get adjacent characters from the keyword, meaning that this is the region assigned to the keyword. It takes 25 characters prior to the keyword and 25 characters behind it

In [7]:
def assigned_text(text, keyword, epsilon = 25):
    start_index = text.index(keyword) - epsilon
    end_index = text.rindex(keyword) + len(keyword) + epsilon
    #to exclude overlapping
    if start_index < 0:
        start_index = 0
    if end_index>len(text):
        end_index = len(text)
    return(text[start_index:end_index])

Let's get this all in one. It handles errors of more than one image in pdf, wrong filename, keyword not found and wrong direction.

## Final function

In [8]:
def main_function(pdf, keyword, direction):
    if pdf_to_png(pdf) == -1:
        return('Error! Too many images')
    if pdf_to_png(pdf) == -2:
        return('Error! File not found')
    img = cv2.imread('sample.png')
    img_split(img)
    if direction == 'top left':
        img_part = cv2.imread('splitted00.png')
    elif direction == 'top center':
        img_part = cv2.imread('splitted01.png')
    elif direction == 'top right':
        img_part = cv2.imread('splitted02.png')
    elif direction == 'middle left':
        img_part = cv2.imread('splitted10.png')
    elif direction == 'middle center':
        img_part = cv2.imread('splitted11.png')
    elif direction == 'middle right':
        img_part = cv2.imread('splitted12.png')
    elif direction == 'bottom left':
        img_part = cv2.imread('splitted20.png')
    elif direction == 'bottom middle':
        img_part = cv2.imread('splitted21.png')
    elif direction == 'bottom right':
        img_part = cv2.imread('splitted22.png')
    else:
        return('Error! Direction incorrect')
    text = crop_img(img_part, keyword)
    if text == -1:
        return ('Error! Keyword not found. Try another direction or keyword')
    epsilon=25
    result=assigned_text(text,keyword,25)
    return(result)

Note that there are 9 ways to specify direction:

    top left
    top center
    top right
    middle left
    middle center
    middle right
    bottom left
    bottom center
    bottom right

In [14]:
print(main_function('sample.pdf','Social security number','top right'))

 tax from your pay.

(b) Social security number

111-11-1111

» Does you
