<a href="https://colab.research.google.com/github/gilbh/Applied_Digital_Research_in_SA_Langs/blob/main/2024_Summer_TAU/Materials_and_Homework/Lesson_02/Sanskrit_OCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome!

In this notebook, we will practice:  

1) Performing OCR with Tesseract using Python code. The OCR is performed on an image file (showing a Sanskrit text) that you will upload.  

2) Calculating the character error rates (cer) between the above Tesseract-produced text and three other pre-prepared digitizations of the same Sanskrit text (which you will upload as .txt files): manually-typed, Transkribus, and Google Vision.


First, we upload the required files: the scan image and the manually-typed text file. The Transkribus and Google Vision texts files are optional.

Click on the "Files" icon on the left bar, then click the "Upload" icon and selected the files.

Next, we tell the notepad the filenames:

In [None]:
# Here we provide information about the files we will use in this notebook:

# Enter the filename of the Sanskrit image you uploaded ("Files" icon on the left bar)
# It should end with '.png' '.jpg' or 'jpeg':
img_filename = 'PASTE HERE THE Copy path FROM THE THREE DOTS MENU OF THE IMAGE FILE'

# Likewise, enter below the three filenames of three text files you uploaded: manually-typed, Transkribus, and Google Vision
# (for any text file you did not upload, leave the string empty: ''):
manual_filename = 'PASTE HERE THE Copy path FROM THE THREE DOTS MENU OF THE MANUALLY-TYPED TEXT FILE'
tkbs_filename = ''
google_vision_filename = ''

Now we run all the installations required for processing OCR and related tasks:

In [None]:
# Install pytesseract and training data in your Google Colab environment
# The exclamation runs the command as a terminal command
!pip install tesseract
!pip install pytesseract
!apt-get install libtesseract-dev
!apt-get install tesseract-ocr
# Install jiwer to check WER and CER
!pip install jiwer
from jiwer import wer, cer, mer, wil
# Install Tesseract trained data for Sanskrit OCR
!wget https://github.com/tesseract-ocr/tessdata/raw/main/san.traineddata
!mv san.traineddata /usr/share/tesseract-ocr/4.00/tessdata/san.traineddata
# Install Sanscript module for converting scripts (from Devanagari to IAST)
!pip install indic_transliteration
from indic_transliteration import sanscript
from indic_transliteration.sanscript import transliterate

Load and display the image file:

In [None]:
# Import the Image module from the Pillow Library, which will help us access the image.
from PIL import Image
from IPython.display import display
# Open an image file
img = Image.open(img_filename)

# Display the image
display(img)

Perform OCR on this image and display the results:

In [None]:
from PIL import Image

# Import the pytesseract library, which will run the OCR process.
import pytesseract
# Open a specific image file, convert the text in the image to computer-readable text (OCR),
# and then print the results for us to see here.
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'
ocr_text_devanagari = pytesseract.image_to_string(Image.open(img_filename), lang="san")
print(ocr_text_devanagari)

Convert the script from Devanagari to IAST transliteration:

In [None]:
ocr_text = transliterate(ocr_text_devanagari, sanscript.DEVANAGARI, sanscript.KOLKATA)
print(ocr_text)


Load the three pre-prepared text files:

In [None]:
# Read the text from a .txt file and return it as a variable:
def load_text_file(filename):
  if filename != '':
    with open(filename, "r", encoding="utf-8") as file:
      text = file.read()
    print(filename + " loaded successfully")
  else:
    text = ""
  return text


manual_text = load_text_file(manual_filename)
tkbs_text = load_text_file(tkbs_filename)
google_vision_text = load_text_file(google_vision_filename)


Now, having the four different text versions in hand (Tesseract, manually-typed, Transkribus, and Google Vision) we can run CER comparisons between the manually-typed and the other three files.

In [None]:
def calculate_cer(compared_text, caption):
  if compared_text != '':
    character_error_rate = cer(manual_text, compared_text)
    print("cer for " + caption + ": " + f"{character_error_rate * 100:.2f}%")

# not used: wer (word_error_rate), mer (match_error_rate), wil (word_info_rate)

calculate_cer(ocr_text, 'Tesseract')
calculate_cer(tkbs_text, 'Transkribus')
calculate_cer(google_vision_text, 'Google Vision')

