## Introducing Optical Character Recognition

Optical Character Recognition, or OCR, is a very hard problem. The goal is simple -- take an image which has some
natural language in it and convert it to a digital string representation so we can process the text. Numerous challenges
arise, the first being that characters in an image might not be clear, or they might be clear but of different fonts,
sizes, languages, or orientations in the image. While the field is old -- I've been using OCR since the first scanner I
owned, which would have been in the late 1990s -- there have been remarkable breakthroughs which have come with
advancements in the field of deep learning. Beyond the techniques used in deep learning, transformer-based artificial
intelligence systems, such as ChatGPT, have made this even easier now that media inputs can be manipulated by AI as
well. But, using large language models for image recognition and process is a different course, so let's instead talk
about how to do OCR work more directly in python.


The first thing to note is that, just like with the python imaging library, using OCR in python requires the support of
external libraries. In this case, we will use the pytesseract library, which is a python wrapper for the Tesseract OCR
engine. Now this is a bit different than with the case of PIL -- with images, we just installed the python library and
were done, but with pytesseract we first have to install the tesseract command line tool and then we have to install the
python wrapper library. This is a pretty common pattern with python libraries, where sometimes they leave the heavy
lifting of computation to a tool which is used in a variety of different contexts.

Of course, we've installed that here for you, so let's take a look at how we can start doing OCR in python.


In [None]:
# Pytesseract builds on and uses PIL, so let's bring in that module,
# as well as the new module pytesseract
import PIL
from PIL import Image
from IPython.display import display
import pytesseract

# The very first thing we probably want to do is just see if we can detect
# some text. We can do that with the image_to_string function. Note, there are
# no methods here, pytesseract is a simple module with a few functions.
newspaper_article = Image.open("two_col.png")
display(newspaper_article)

# Now, let's see if we can extract the text from this image
text = pytesseract.image_to_string(newspaper_article)
print(text)

Ok, so I was a bit tricky here in that I'm using a two column newspaper article and it's got a few different font
choices -- some bolding and italics at the top with the reporters name, and some horizontal bars separating the header
from the article. It's also fully justified text, which means the spacing between words is a variable amount, and this
can be a challenge for some OCR systems too. However, to my read, this is almost a perfect replication of the text, with
the one error being in the last paragraph, which starts as "I did wait" but was interpreted as "IT did wait".


In [None]:
# Another handy function is image_to_data. This function will return the
# bounding box coordinates, as well as the text detected in each box. This
# can be useful for visualizing how the OCR layout engine is working.
# This also demonstrates some of the challenges in working with python wrappers
# for C libraries. The return value of the image_to_data function is a string
# which is formatted as a tab separated values because it's meant to be easily
# read by humans on the command line. Let's take a look.

boxes = pytesseract.image_to_data(newspaper_article)

# Now we print out the boxes text line by line...
for box in boxes.splitlines():
    print(box)

Normally I would try and format this using a library which can handle spreadsheets in python, like pandas, and if you're
interested in that and maybe interested in doing data science with python the University of Michigan has courses here on
the Coursera platform that you might want to check out. But let's take a look at this table first. We see that tesseract
returns a level, page number, block number, paragraph number, and line and word number for each and every word it
detects, including whitespace. It also includes the top left corner, in pixels, and the width and height of that text.
Finally, it includes the text itself as a string, and a confidence in its correctness. That confidence is interesting --
if we scroll down we can see that the confidence in the incorrectly identified text "IT" is quite low, at 27, while the
confidence for most of the other words is well above 90.


In [None]:
# Let's draw those bounding boxes on the image. This is good practice and then we can see how tesseract itself is working.

from PIL import ImageDraw

draw = ImageDraw.Draw(newspaper_article)

for line in boxes.splitlines()[1:]:
    x, y, w, h, c = [float(p) for p in line.split("\t")[6:11]]
    if c > 0:
        draw.rectangle((x, y, x + w, y + h), outline="red", width=2)

display(newspaper_article)

Now, I've deliberately written things in a very tight format here, and that's mostly for pedagogically reasons. By the
end of this course you should be able to read this style of python and understand what's going on. It might not be super
fast for you to read, but let's deconstruct this one line in particular:
`x, y, w, h, c = [int(p) for p in line.split("\t")[6:11]]`

On the left hand side you can see that I've got several different variables, x, y, w, h, and c. This alone says to
python that we're going to do tuple unpacking, so the right hand side of this argument needs to be an iterable -- like a
list or tuple -- and needs to have five and only five values. If there are more or less than five values, python will
throw a `ValueError`.

Now, looking at the right hand side, we see everything is wrapped in square brackets, the indexing operator or list
construction syntax in python. What this means is that we are defining a list of items, either directly, where they are
separated by commas, or through a list comprehension. It's actually a list comprehension that I'm using here, which
means that right hand side is going to broken up into four parts, the first being the function we want to apply to each
data element, then the word `for`, then the variable name we're going to use, and I just chose the letter `p` for
brevity, then the word `in` and then some sequence or iterable. The function I actually want to apply to everything is
just the function `float` which is a builtin method of defining new floating point numbers, and when you pass in
parameter like this is does type conversion to make the input a floating point number. This is often referred to as type
casting.

Alright, so looking further on the right hand side we can see that `p` is defined in relation to the results of
line.split. Line is a string value, and the split method just breaks line up by whatever string you provide as an
argument. I've use `"\t"` which means a tab character. But I actually only want five values, and the line has eleven, so
I've used the python slicing syntax with the indexing operator, indicating that I only want items 6 through 11
inclusive. This gives me exactly five items, which are then all changed into integers and bound to the variables x, y,
w, h, and c.

Now, that was a big discussion, but I don't want this code to confuse or overwhelm you. With python, just go line by
line, chunk by chunk, and you can reason out what's happening in other people's code. And this is a bit of a superpower,
because once you can read other peoples code you can learn from what they've written, and the world is filled with
wonderful open source libraries you can explore to enhance your own work.

Actually, maybe now would be a good time to clean this code up. Why don't you take a stab at turning this into a
function, with documentation of course, which takes in an image and returns a new image with all of these bounding boxes
on it.


In [None]:
# Let's take a look at another image, this one a bit more free form
# and which comes from an old Wold War One poster from the Library of
# Congress.


def ocr_with_boxes(image):
    from PIL import ImageDraw

    acopy = image.copy()
    draw = ImageDraw.Draw(acopy)
    for line in pytesseract.image_to_data(acopy).splitlines()[1:]:
        x, y, w, h, c = [float(p) for p in line.split("\t")[6:11]]
        if c > 0:
            draw.rectangle((x, y, x + w, y + h), outline="red", width=2)
    return acopy


display(ocr_with_boxes(Image.open("food.jpg")))
print(pytesseract.image_to_string(Image.open("food.jpg")))

Ok, we see a lot of things going on here. The OCR engine does a passable job, but not good enough for a task like deep
searching or closed captioning. What's handy though is that we can combine with OCR our existing image processing
abilities from last week, and we'll do that in the next lecture.
