Welcome to the workshop on Optical Character Recognition (OCR)

Before getting started, 

*   Go to File --> Save a copy in drive. It should open the copied notebook in a new tab.
*   Alternatively, go to your Google drive, find the folder "Colab Notebooks", open the notebook.

Next, download:

*   Sample images: https://uppsala.box.com/s/qzvg9741dx915a0atc6mydo8qxgf9erc
*   Language models for Swedish (swe), French (fra), Italian (ita) and German (deu): https://uppsala.box.com/s/ovcpsdzj2dlomyghtw7vk50ilvxs3uix

Navigate to the left panel, click the folder icon ("files"), click upload (upward arrow icon), and upload images from the downloaded folder, press "upload". If a message dialog pops up: click OK.


Once again, navigate to the left panel, click on the upward arrow icon, go to /usr/share/tesseract-ocr/4.00/tessdata. Following the three dots, click "upload", select the 4 language models to upload. 

Close (X)

Now we are ready to get started! 

In [None]:
# Set up for tesseract
!sudo apt-get install tesseract-ocr
!pip install pytesseract==0.3.9

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 12 not upgraded.
Need to get 4,795 kB of archives.
After this operation, 15.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-eng all 4.00~git24-0e00fe6-1.2 [1,588 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-osd all 4.00~git24-0e00fe6-1.2 [2,989 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr amd64 4.00~git2288-10f4998a-2 [218 kB]
Fetched 4,795 kB in 2s (2,741 kB/s)
debconf: unable to initi

If you see a warning to "Restart runtime", click on RESTART RUNTIME.

In [None]:
# Import modules

import pytesseract
from pytesseract import Output
from PIL import Image
import cv2

In [None]:
# To check the location of pytesseract
#!pip show pytesseract

In [None]:
# List of available languages
print(pytesseract.get_languages(config=''))

['osd', 'eng']


For example:

*   osd: Orientation and script detection module
*   ita: Italian
*   deu: German
*   fra: France
*   swe: Swedish
*   eng: English


Here you can find the list of all languages supported by tesseract and the language codes to use:
https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

To download other language models, go to https://github.com/tesseract-ocr/tessdata (download as zip file).

In [None]:
# Read an input image (check left panel for the image name)
file = cv2.imread("/content/test_swe1.jpg")

In [None]:
#configure parameters for pytesseract
custom_config = r'--oem 3 --psm 6'      #oem 3: Default; psm 6: Assume a single uniform block of text.

In [None]:
# Pytesseract OCR - text file output
result_txt = pytesseract.image_to_string(file, lang='swe', config=custom_config) 

If you see "KeyError: 'PNG'", go to first cell and RESTART RUNTIME.

Note: to run tesseract for other languages, pass the language code in "lang". For example, lang = 'ita' for Italian. 

In [None]:
print(result_txt)

Jubileumsfest:
För att ge avdelningarna möjlighet fira förbundets
50 åriga tillvaro anslog kongressen medel härför. ; d
Vär avd. högtidlighöll minnet med en fest på Gillet
den 8 deéc. Jubileumstalet höls av förbundsordf.
Ivan Larsson samt medverkade, bland annat
Uppsala Arbetaresångkör.
Representation:
Avd. har genom valda ombud eller styrelsen, varit repre-
senterad å: Folkets Hus årsmöte,Folkets Parks årsmöte
och extra möte,. Kvinnokommittén för ökad kvinnorepre- |
sentations möte, vid avd. 68 kassör Britta Jonssons
jordfästning, vid Ungdomsyrkeskonferensen i Uppsala,
vid Folk och Försvarskonferensen i Uppsala, vid avd.68
jubileumsfest, vid avtalskonferensen rörande avä.funk-
, tionärerna , vid Uppsala Länskommitté för ,Finland: samman-
träden, samt vid Uppsala P. C. O, =s möten.
AE
Avä. har beslutat att ansluta sig till2 Uppsalavdelningenav
Arbetarnas Bilningsförbund, från den 1_jan. ESA40R
Studiearbetet:
Studiearbetet har, efter en intensifierad propaganda
lämnat rätt gott resultat

In [None]:
# To create a searchable pdf, uncomment this block
#result_pdf = pytesseract.image_to_pdf_or_hocr(file, lang='eng', config=custom_config) 

Insights on the data:

*   There are 3 Swedish documents from the 19th century written using an old typewriter.
*    test_swe1.jpg is a good quality image.
*    test_swe2.jpg is a poor quality image and OCR will be challenging.
*    test_swe3.jpg consists of a photograph along with the text. Unfortunately, Tesseract does not perform well for the texts with photographs. Does cropping the text before using Tesseract improves the results?
*    test_deu.png is obtained from OCR4ALL project, and written between 16-18th century in German.
*    test_eng.png represents a sample text from Marian’s play written in Old English.
*    test_fra.jpg is challenging as it also includes the book edges while scanning. Try the cropped version (test_fra_cropped.jpg) and observe if OCR results are better?
*    test_ita.png was obtained from an old Italian book online. Also try the cropped version (test_ita_cropped.png) and observe if OCR results are
better?


---

Did you notice that the OCR has errors and need post-processing? 

This is a common challenge with heritage data and thanks to AI, we now have a solution - layout based OCR!

In the follow-up workshop, we will explore Deep Learning based Layout Detection which is performed before OCR and the results are much accurate. 
Here's the link to the Layout Parser: https://github.com/Layout-Parser/layout-parser



Thank you!