## Combining OCR with Imaging

In this week of the course we have tip toed into the world of optical character recognition, and while we saw a really
nice example of OCR working well with clean text, the more authentic poster image text doesn't seem to come through very
well. There are a few things we can do to try and improve our ability to recognize characters and these techniques are
rooted in image manipulation.


In [None]:
# Let's bring in our image manipulation libraries
import PIL
from PIL import Image, ImageDraw
from IPython.display import display
import pytesseract


# Here is a helper function to display a marked up image with text boxes, adapted from
# the example in the previous lecture
def ocr_with_boxes(image):
    acopy = image.copy()
    draw = ImageDraw.Draw(acopy)
    for line in pytesseract.image_to_data(image).splitlines()[1:]:
        x, y, w, h, c = [float(p) for p in line.split("\t")[6:11]]
        if c > 0:
            draw.rectangle((x, y, x + w, y + h), outline="red", width=2)
    return acopy


# I want to display several different things
poster = Image.open("food.jpg")
# 1. Metadata about the image
print(poster)
# 2. The image itself
display(poster)
# 3. The image marked up with text boxes
display(ocr_with_boxes(poster))
# 4. The text extracted from the image
print(pytesseract.image_to_string(poster))

Now, this image recognition isn't really that bad, we see that the bounding boxes tend to be correct, and there are just
a few transcription issues with some of the lettering. However, there are a few different techniques you can use in PIL
to improve this, so let's go through them and see how it improves the quality of the OCR.

### Step 1: Removing Extraneous Information

Extraneous information -- pixels that we know are not going to contain text we want -- should be removed. In this image,
I think cropping around the food poster itself is probably what we want to do, and you should keep in mind that
tesseract was heavily tuned to work well on scans of books, so keeping a margin around your text is a pretty reasonable
thing to do.

### Step 2: Resize the Image

The tesseract docs suggest that larger images will generally work better, so we can experiment a bit with scaling the
image to improve the results. I've found tesseract is actually much better with bigger images in practice, and this one
is pretty small, being only 515x640 pixels in size.

### Step 3: Converting to Black and White

Generally, OCR tools do not work well on color images. Realistically, tesseract and other OCR tools convert a color
image to a greyscale image before doing text detection. As we saw in last week's lecture though, the way in which you
convert a color image to greyscale can change what information is retained. If we convert the image to greyscale before
sending it to tesseract, we can ensure we are retaining maximum information.

### Step 4: Binarizing the Greyscale Image to Black and White

Text generally has high contrast with it's background -- like black text on a white page -- otherwise it makes it very
difficult for us humans to read the text. We can exaggerate this even more through a process called binarization, where
we change each grey pixel to either black or white based on a threshold. Now, what threshold value we should choose is
unclear, and that depends on your source material, but we represent this as a single byte of information -- a number
between 1 and 255 that indicates the cutoff between black and white.

### Step 5: Using Tesseract Specific Optimizations

Tesseract has many different configuration parameters you can pass to it which can improve the quality of the results,
including forcing a specific page segmentation algorithm, and setting a limit on the words or characters you want to
look for. We can play with these parameters a bit to experiment and see if it leads to an increase in quality.

These are just five steps I use to improve my OCR results, and how well they work depends on the kind of image you are
dealing with and the level of quality you need to get out of it. Let's play with these in the notebook.


In [None]:
# Let's tackle the first three together, as we can use PIL for them all
image = poster.copy()  # I like to make a copy to preserve the original

# First up, let's crop out some of the extraneous information at the top and sides
# of the image, I experimentally found these numbers to be solid
image = image.crop([85, 80, 425, 540])

# Now let's resize the image
image = image.resize((image.width * 4, image.height * 4))

# Now lets change it to a greyscale (single channel) image
image = image.convert("L")

# Ok, that's steps 1 through 3. Now we have to binarize (or threshold) the image. This means we will
# convert all pixels to either black or white. Remember that the image is now in L mode, so each pixel
# is a single byte. We can set a "threshold" value as to the cutoff, and I'm going to use 140 as that
# number. Feel free to experiment in the notebook and see how changing this threshold affects the image!
threshold = 140

# Remember that a bytes object in python is immutable, but we can create a new bytearray of the
# same size and copy the bytes into it, then do our processing there.
new_image = bytearray(image.tobytes())

for location, value in enumerate(new_image):
    if value < threshold:
        new_image[location] = 0
    else:
        new_image[location] = 255

# Create a new image from the bytsarray
new_image = PIL.Image.frombytes(image.mode, image.size, bytes(new_image))

# Alright, we now have a bigger, cropped, binarized black and white image.
display(ocr_with_boxes(new_image))
print(pytesseract.image_to_string(new_image))

Ok, so not a huge difference between these results and the initial results right out of the gate. We get rid of a few
extraneous characters, but it's still not perfect. We could keep tweaking here, trying different sizes or thresholds,
but we can also consider what tesseract can do for us. If you look online at the tesseract documentation you'll find
there are a huge number of different parameters we can tweak by setting their values in a configuration file. This is
ok, but it doesn't really work well with jupyter notebooks, where we are experimenting real time and writing python
directly. The pytesseract library allows us to set some options when we make a call though, so lets give that a try.


In [None]:
# There are lots of configuration items, I'm just going to show a few. The first one
# is to set the language. By default, tesseract will try to recognize all the languages
# it knows about. You can set the language to a single language, or a list of languages.
config = "-l eng "

# I don't expect the language to do much, because the default is already english, but
# it's good to know this is an option, and tesseract actually has really nice support
# for handling multiple languages at once.

# Another handy configuration parameter for tesseract is the list of allowable characters
# that it should look for. That sounds handy because we're getting some odd symbols
# in here, like ¢, and this should stop that.
config = (
    config
    + '-c tessedit_char_whitelist=" .0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'"'
)
print(pytesseract.image_to_string(new_image, config=config))

Ok, well, that's a pretty moderate improvement. Setting the list of allowable characters does help us, but you have to
be careful when you do this. You must make sure you are inclusive of all characters -- when I first put this together I
didn't include the ampersand and whitespace characters, and it looked pretty funny as a result.

I've included below a few more images for you to try and experiment with. I really encourage you to jump into this
notebook and see if you can detect the text in these images and, if not, how you might use your abilities with the
python imaging library to improve text detection. Open ended exploration and practice is key in becoming better at
programming, so give it a shot!


In [None]:
# For practice, can you detect the text in the following images?
# What differences, if any, do you see between these three images?

additional_images = ["ocr1.jpg", "ocr2.jpg", "ocr3.jpg"]
for fil in additional_images:
    image = Image.open(fil)
    image.thumbnail((400, 400))
    display(image)