Welcome to the workshop on Optical Character Recognition (OCR)

Before getting started, 

*   Go to File --> Save a copy in drive. It should open the copied notebook in a new tab.
*   Alternatively, go to your Google drive, find the folder "Colab Notebooks", open the notebook.

Next, download:

*   Sample images: https://uppsala.box.com/s/qzvg9741dx915a0atc6mydo8qxgf9erc
*   Language models for Swedish (swe), French (fra), Italian (ita) and German (deu): https://uppsala.box.com/s/ovcpsdzj2dlomyghtw7vk50ilvxs3uix

Copy "sample_images" from Downloads folder to your Google Drive.

Navigate to the left panel, click on the upward arrow icon, go to /usr/share/tesseract-ocr/4.00/tessdata. Following the three dots, click "upload", select the 4 language models to upload. 

Close (X)

Now we are ready to get started! 

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
# Set up for tesseract
!sudo apt-get install tesseract-ocr
!pip install pytesseract==0.3.9

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tesseract-ocr is already the newest version (4.00~git2288-10f4998a-2).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


If you see a warning to "Restart runtime", click on RESTART RUNTIME.

In [13]:
# Import modules

import pytesseract
from pytesseract import Output
from PIL import Image
import cv2

In [14]:
# To check the location of pytesseract
#!pip show pytesseract

In [15]:
# List of available languages
print(pytesseract.get_languages(config=''))

['eng', 'osd']


For example:

*   osd: Orientation and script detection module
*   ita: Italian
*   deu: German
*   fra: France
*   swe: Swedish
*   eng: English


Here you can find the list of all languages supported by tesseract and the language codes to use:
https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

To download other language models, go to https://github.com/tesseract-ocr/tessdata (download as zip file).

In [16]:
# Read an input image (check left panel for the image name)
file = cv2.imread("/content/drive/My Drive/sample_images/test_eng.png")

In [17]:
#configure parameters for pytesseract
custom_config = r'--oem 3 --psm 6'      #oem 3: Default; psm 6: Assume a single uniform block of text.

In [18]:
# Pytesseract OCR - text file output
result_txt = pytesseract.image_to_string(file, lang='eng', config=custom_config) 

If you see "KeyError: 'PNG'", go to first cell and RESTART RUNTIME.

Note: to run tesseract for other languages, pass the language code in "lang". For example, lang = 'ita' for Italian. 

In [19]:
print(result_txt)

MARIAN.
ACT I.

SCENE I.—A rural Scene; on the right band Sit
Henry Truman’s Park-Wall jujt appears,
with an iron Palifade—Gate balf open, and a

‘Stile near the Gate—-At the back of the Scene @
River ; beyond which is a Road winding up the
fide of a Hill—A fmall Houfe clofe,to the River,
with a Window to the Stage—Near the Houfe,
bending over the River, a Willow, to which the
Boat is faftened.—The Sun appears as juft rifen.

Party, Fanny, and Kitry appear, walking up -
to the Boatman’s Houfe, with Bajfkets of Fruit and
‘Flowers on their arms, as for the Market—
Tuomas and Witutiam following.

Patty. -
‘THY, Robin! Robin! boatman! He’s not
awake yet, as I live; though he know’d
"we fhou’d want to be ferry’d over early this morn-
ing.—Call him, Thomas.
(They all go up to the window.)
, AQ Sone
, -



In [20]:
# Swedish: Read an input image (check left panel for the image name)
file_sv = cv2.imread("/content/drive/My Drive/sample_images/test_swe1.jpg")

In [21]:
# Pytesseract OCR - text file output
result_txt_sv = pytesseract.image_to_string(file_sv, lang='swe', config=custom_config) 

In [22]:
print(result_txt_sv)

Jubileumsfest:
För att ge avdelningarna möjlighet fira förbundets
50 åriga tillvaro anslog kongressen medel härför. ; d
Vär avd. högtidlighöll minnet med en fest på Gillet
den 8 deéc. Jubileumstalet höls av förbundsordf.
Ivan Larsson samt medverkade, bland annat
Uppsala Arbetaresångkör.
Representation:
Avd. har genom valda ombud eller styrelsen, varit repre-
senterad å: Folkets Hus årsmöte,Folkets Parks årsmöte
och extra möte,. Kvinnokommittén för ökad kvinnorepre- |
sentations möte, vid avd. 68 kassör Britta Jonssons
jordfästning, vid Ungdomsyrkeskonferensen i Uppsala,
vid Folk och Försvarskonferensen i Uppsala, vid avd.68
jubileumsfest, vid avtalskonferensen rörande avä.funk-
, tionärerna , vid Uppsala Länskommitté för ,Finland: samman-
träden, samt vid Uppsala P. C. O, =s möten.
AE
Avä. har beslutat att ansluta sig till2 Uppsalavdelningenav
Arbetarnas Bilningsförbund, från den 1_jan. ESA40R
Studiearbetet:
Studiearbetet har, efter en intensifierad propaganda
lämnat rätt gott resultat

In [23]:
# To create a searchable pdf, uncomment this block
#result_pdf = pytesseract.image_to_pdf_or_hocr(file, lang='eng', config=custom_config) 

Insights on the data:

*   There are 3 Swedish documents from the 19th century written using an old typewriter.
*    test_swe1.jpg is a good quality image.
*    test_swe2.jpg is a poor quality image and OCR will be challenging.
*    test_swe3.jpg consists of a photograph along with the text. Unfortunately, Tesseract does not perform well for the texts with photographs. Does cropping the text before using Tesseract improves the results?
*    test_deu.png is obtained from OCR4ALL project, and written between 16-18th century in German.
*    test_eng.png represents a sample text from Marian’s play written in Old English.
*    test_fra.jpg is challenging as it also includes the book edges while scanning. Try the cropped version (test_fra_cropped.jpg) and observe if OCR results are better?
*    test_ita.png was obtained from an old Italian book online. Also try the cropped version (test_ita_cropped.png) and observe if OCR results are
better?


---

Did you notice that the OCR has errors and need post-processing? 

This is a common challenge with heritage data and thanks to AI, we now have a solution - layout based OCR!

You can also explore Deep Learning based Layout Detection which is performed before OCR and the results are much accurate. 
Here's the link to the Layout Parser: https://github.com/Layout-Parser/layout-parser



Thank you!