# Tesseract example code

To be able to use this code, you will need to have the packages opencv (image reading and maipluation), tesseract (OCR) and pytesseract (tesseract python bindings) installed. You can install pytesseract by running the following command in your terminal:

```
pip3 install --user pytesseract
```

More information can be found at: https://pypi.org/project/pytesseract/

## Importing some modules

We start with importing som modules that will come in handy.

In [None]:
import cv2                      # Computer Vision
import numpy as np              # Vector math
import pytesseract              # OCR
import matplotlib.pyplot as plt # Plotting, the line after this one is only for jupyter notebook
%matplotlib inline

## Reading text from a file and render it to an image

For testing tesseract it can be useful to be able to create an image with text and then convert the image back to text data. We can conceptualise this as a function $f: T \rightarrow I$ from the text domain to the image domain. This will be implemented using [OpenCV](https://opencv.org/). We then define a function $g: I \rightarrow T$, as an approximation of the inverse of $f$. Tesseract will stand in for the inverse function. After these steps, we can see if we got back what we started with. The character (or word) error rate can be measured as a way to quantify the quality of $g$. For testing the robustness of $g$, we will insert some image noise between $f$ and $g$. This is the qualitative part of the lab.

First, we'll need to read some text.

In [None]:
# Read in some text
with open("Eisenhower.txt", 'r') as file:
    original_text = file.read()

print(original_text)

Now for rendering the text to an image and then crop the image to reduce the margins (our function $f$). 

In [None]:
def render_text(texttorender, scale=1.0):
    lines = texttorender.split('\n')
    img = np.zeros((int(len(lines)*40*scale), int(np.max([len(line) for line in lines])*20*scale)))
    for n, textline in enumerate(lines):
        img = cv2.putText(img, textline, (10, int((n+1)*40*scale)), cv2.FONT_HERSHEY_COMPLEX,
                          scale, (255, 255, 255), 2, cv2.LINE_AA)
    return img

def autocrop(image):
    """Reducing the size of the image to only include the parts with foreground."""
    for axis in [0, 1]:
        s = np.sum(image>0, axis=axis)
        a = 0
        b = len(s)-1
        while s[a] == 0:
            a += 1
        a = max(a-5, 0)
        while s[b] == 0:
            b -= 1
        b = min(b+5, len(s)-1)
        if axis == 1:
            image = image[a:b+1, :]
        else:
            image = image[:, a:b+1]
    return image

def f(text_to_render, scale=1.0):
    return autocrop(render_text(text_to_render, scale=scale))

image_with_text = f(original_text)
plt.figure(figsize=(12, 12))
plt.imshow(image_with_text, cmap='gray')
plt.show()

Now for the function $g$ (this can take some time, depending on the available computing power).

In [None]:
extracted_text = pytesseract.image_to_string(image_with_text)
print(extracted_text)

## Testing the quality

A common metric of quality of the OCR is the word error rate, i.e. the number of non-recognised words in relation to the total number of words. This can be done by flexibly matching the original text with the text returned from the OCR.

The code shown below uses cython for speed in the Levenshtein calculations. You must have the cython package installed to use this code as it needs to do some compiling on your machine.

In [None]:
from levenshtein import wer, cer
print(wer.__doc__)
print(cer.__doc__)

In [None]:
print(original_text.split()[:20])
print(extracted_text.split()[:20])
we = wer(original_text, extracted_text)
print("Word errors:", we)
print("WER:", we/max(len(original_text.split()), len(extracted_text.split())))

In [None]:
print(list(original_text)[:20])
print(list(extracted_text)[:20])
ce = cer(original_text, extracted_text)
print("Character errors:", ce)
print("CER:", ce/max(len(list(original_text)), len(list(extracted_text))))

## A real world example

This image is from an old encyclopedia.

In [None]:
with open("Gutenberg.txt", 'r', encoding='utf-8') as file:
    original_text = file.read()
print(original_text)

img = cv2.imread("Gutenberg.png", cv2.IMREAD_GRAYSCALE)
plt.figure(figsize=(4, 12))
plt.imshow(img, cmap='gray')
plt.show()

Now for extracting the text and calculating the error rates.

In [None]:
extracted_text = pytesseract.image_to_string(img)

we = wer(original_text, extracted_text)
print("Word errors:", we)
print("WER:", we/max(len(original_text.split()), len(extracted_text.split())))

ce = cer(original_text, extracted_text)
print("Character errors:", ce)
print("CER:", ce/max(len(list(original_text)), len(list(extracted_text))))

OCR can incorporate knowledge of language in the recognition. Let's try OCR with a language model for Swedish.

In [None]:
extracted_text = pytesseract.image_to_string(img, lang='swe')

we = wer(original_text, extracted_text)
print("Word errors:", we)
print("WER:", we/max(len(original_text.split()), len(extracted_text.split())))

ce = cer(original_text, extracted_text)
print("Character errors:", ce)
print("CER:", ce/max(len(list(original_text)), len(list(extracted_text))))

Unsurprisingly, the error rate for words in greatly improved.

In [None]:
img_cropped = img[:1000, :]
data = pytesseract.image_to_boxes(img_cropped)

In [None]:
plt.figure(figsize=(10, 10))
plt.imshow(img_cropped, cmap='gray')
for entry in data.split("\n"):
    char, y1, x1, y2, x2, _ = entry.split()
    y1 = int(y1)
    y2 = int(y2)
    x1 = img_cropped.shape[0]-int(x1)
    x2 = img_cropped.shape[0]-int(x2)
    plt.plot([y1, y2, y2, y1, y1], [x1, x1, x2, x2, x1]) # Plot the box
    plt.text(y2, x1, char, color='m') # Plot the OCRed character
plt.show()