# OCR
To do any sort of content analysis or enrichment, we need to be able to read said content. That's what optical character recognition (OCR) is for. Given an image, a PDF or even a PDF with embedded OCR, how do we get to the content?

## Install Tesseract
Tesseract is the state-of-the-art open source OCR software, and quite [easy to install](https://tesseract-ocr.github.io/tessdoc/Installation.html) except for the Windows version, which can be tricky.
### Windows
1. Download the Tesseract .exe file https://github.com/tesseract-ocr/tesseract/releases/download/5.5.0/tesseract-ocr-w64-setup-5.5.0.20241111.exe 
2. Click on the file in your Downloads folder and follow the installation wizard's instructions. The only thing you might have to change is to add more languages when prompted for which language packages you might need. **IMPORTANT** Write down where the Tesseract is being installed! It's in the step "Zielverzeichnis wählen" and for me it looks like: 
```shell
C:\Users\USERNAME\AppData\Local\Programs\Tesseract-OCR
```
Add this to your system path.

3. Let's say during installation you did not add all the language packs you need. A list of languages Tesseract offers can be found <i>[here](https://github.com/tesseract-ocr/tessdata)</i>. The German package is <i>[here](https://github.com/tesseract-ocr/tessdata/blob/main/deu.traineddata)</i> where you can download it.

4. Now all you need to do is move the file "deu.traineddata" into  
```shell
C:\Users\USERNAME\AppData\Local\Programs\Tesseract-OCR\tessdata
```
where eng.traineddata already is!

**NOTE** Because of the way that Tesseract is installed, Python might not be able to find it. Thus, you can either add it to your SystemPath (as I described above), or whenever you use it, you can add the line 
```
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\USERNAME\AppData\Local\Programs\Tesseract-OCR\tesseract.exe'
```
After you import pytesseract. This helps Python know where to look for your files.

### Multi-Language support
But OCR is a visual problem, it simply reads in the pixels and tries to re-construct the word that way, so why are there several languages for Tesseract? Modern OCR tools use context and dictionaries as well, in order to improve their performance. This leads to catching words such as "tne" which is supposed to be "the" for instance. If the language is English, "fur" is likely supposed to be "for" whereas in German the correction might be "für". On the other hand, "fur" is a valid English word, thus the context comes in. All of this together led to extraordinary performance improvements over the years.

In [None]:
import pytesseract #https://pypi.org/project/pytesseract/
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\USERNAME\AppData\Local\Programs\Tesseract-OCR\tesseract.exe'
from jiwer import cer #https://pypi.org/project/jiwer/
import os
from PIL import Image
from nltk.metrics.distance import _edit_dist_step, _edit_dist_init, _last_left_t_init
import re
import fitz

## PDF **with embedded text**
This might seem misplaced in the OCR notebook, but it's important to not only be able to *create* OCR for files but also be able to *extract* it from files which already have it!

We show this on the example of PDFs from the [E-Periodica](https://www.e-periodica.ch/) archive.

In [None]:
filePath_list = ['data/grs-001_1921_13__298_d.pdf', 'data/grs-001_1921_13__393_d.pdf',
                 'data/grs-001_1922_14__563_d.pdf', 'data/grs-001_1923_15__447_d.pdf']
for filePath in filePath_list:

    pdf_doc = fitz.open(filePath)
    page = pdf_doc[1:]  # page 1 is just the metadata for the document
    pdf_text = "\n".join([x.get_text() for x in page])  # I deliberately add newlines so we can nicely put words back together that were split across the pages
    pdf_text = re.sub("¬\n", "", pdf_text)  # "bindestriche" will be removed, if they are followed by one or several whitespaces, those will be removed as well.
    pdf_text = pdf_text.strip()
    pdf_text = re.sub("\n", " ", pdf_text)  # replace newlines with spaces
    pdf_text = re.sub(r'\s+', " ", pdf_text)  # replace all repeating whitespaces with only one whitespace
    pdf_text = re.sub(r'\\', "", pdf_text)  # replace all double backslashes

    #save the pure text in a new file, we'll re-use this in the embedding_data notebook
    with open(filePath.replace(".pdf",".txt").replace("data/", "data/embedding_data/"), "w") as f:
        f.write(pdf_text)

### Pytesseract
If you have a file without embedded text, you need to run OCR on it first. State-of-the-art (in open-source but it can even hold its own against proprietary solutions!) at the moment is Tesseract so that's what we're using here.

### PDF
Note that this PDF has embedded text, but we throw it away and run pytesseract on it for demonstration purposes.

In [None]:
filePath = 'data/grs-001_1921_13__298_d.pdf'
doc = fitz.open(filePath)

This time we print the data page by page as opposed to joining them so we can examine it.

In [None]:
for page_number in range(doc.page_count):
    page = doc.load_page(page_number)
    pix = page.get_pixmap()
    pix.save(f"aux-{page_number}.jpeg", "jpeg") # this is to avoid having to use pdf2image which is a nightmare for Windows
    txt = pytesseract.image_to_string(f"aux-{page_number}.jpeg", lang="deu")
    print("Page # {} - {}".format(str(page_number),txt))
    os.remove(f"aux-{page_number}.jpeg")

### Image
Images (usually) don't have embedded OCR, so here we don't have to throw anything away and simply run pytesseract on a jpg.

In [None]:
ocr_image_example_filepaths = ["data/ocr_data/grs-001_1921_013_0051.jpg", "data/ocr_data/grs-001_1921_013_0052.jpg"]
ocr_image_example_text = ""
for path in ocr_image_example_filepaths:
    text = pytesseract.image_to_string(Image.open(path), lang='deu')
    text = re.sub("¬\s+", "", text)  # "bindestriche" will be removed, if they are followed by one or several whitespaces, those will be removed as well.
    text = text.strip() 
    text = re.sub("\n", " ", text)  # replace newlines with spaces
    text = re.sub(r'\s+', " ", text)  # replace all repeating whitespaces with only one whitespace
    text = re.sub(r'\\', "", text)  # replace all double backslashes
    ocr_image_example_text += text

Let's see what it says:

In [None]:
ocr_image_example_text

Looks good! But how do we know if it actually is well done OCR? One evaluation metric is the so-called "character error rate". We don't have "ground-truth" (GT) to compare it with, so we'll use the E-Periodica OCR as GT and compare the Tesseract result to it.

## Evaluation

Luckily we already extracted just the text from the PDF files on E-Periodica, so we just open them back up here:

In [None]:
with open("data/embedding_data/grs-001_1921_13__298_d.txt", "r") as f:
    pdf_doc = f.read()

In [None]:
error = cer(pdf_doc, ocr_image_example_text)
error

A character error rate of 2% is already extremely low, but as you might have already noticed by examining our examples, we can lower it even further. The Tesseract OCR splits words with a regular dash "-" instead of the correct one for words which were split because they go over the line "¬" so this is not caught by our pre-processing. Let's change that:

In [None]:
ocr_image_example_text_fixed_bindestrich = ""
for path in ocr_image_example_filepaths:
    text = pytesseract.image_to_string(Image.open(path), lang='deu')
    text = re.sub(r"(?<=\w)-\n", "", text)  # this is the line we changed
    text = text.strip()
    text = re.sub("\n", " ", text)
    text = re.sub(r'\s+', " ", text)
    text = re.sub(r'\\', "", text)
    ocr_image_example_text_fixed_bindestrich += text

Note how we did not simply replace "¬" by "-", since they are not used in the same way, and there might still be valid other occurrences of "-" within the text. What we did instead is check if before "-" there is a word and after there is a space, to not catch words such as "E-Mail" for instance.

In [None]:
error = cer(pdf_doc, ocr_image_example_text_fixed_bindestrich)
error

In [None]:
with open("data/embedding_data/grs-001_1921_13__298_d_tesseract.txt","w") as f:
    f.write(ocr_image_example_text_fixed_bindestrich)

In [None]:
with open("data/embedding_data/grs-001_1921_13__298_d_tesseract.txt","r") as f:
    ocr_image_example_text_fixed_bindestrich = f.read()

Here we go! A CER of 0.8%, as we would have expected for a combination of Tesseract, printed text and a very simple layout.

## Post-correction with edit distance
Can we do better? Looking at the Tesseract OCR, it's already very high quality, but there are several seemingly "obvious" mistakes, such as "Ohancen" instead of "Chancen" or "eiustimmig" instead of "einstimmig". A naive way to address this is via "lexicon method", where you check if the word you found through visual methods even exists in the lexicon of that language.

Of course, there are several caveats to this which are also the reason why Tesseract did not correct these words. (1) You cannot be sure that you have a complete vocabulary with all declensions, conjugations etc. (2) Names are usually not part of a vocabulary and even if they are, see point (1). (3) Sometimes, especially in our Swiss dataset, words from other languages are used in a regular German sentence, so you would -correctly- flag "trottoir" as not a valid German word and then look for the most similar (in terms of edit-distance) German word in your vocabulary. This is called "verschlimmbessern" in German.

Still, especially for names, this type of post-correction can be very rewarding. We address some of these concerns by setting the allowed edit distance very low and even making it depend on the word length. Let's try it on our current example and see if it makes our CER better or worse.

As a lexicon we downloaded all of German Wikipedia and because our dataset is Swiss we added all the Swiss person names from the Bundesamt für Statistik.

In [None]:
# edit_distance
import json
with open('data/word_count_dewiki_chnames.json', 'r', encoding='utf8') as file:
    german_word_set = json.load(file)
german_word_set = set(k for k, v in german_word_set.items() if v > 3 and len(k)>1)  # only consider words which appear at least three times, to avoid typos

In [None]:
def lexicon_checking(text, lexicon):
    word_count = 0
    word_in_dict_count = 0
    words_not_in_dict = []
    patt = r'[a-zA-ZäöüÄÖÜß]+'
    for word in re.finditer(patt, text):
        word = word.group()
        word_count +=1
        if word in lexicon or word.lower() in lexicon:
            word_in_dict_count += 1
        else:
            words_not_in_dict.append(word)
    return word_in_dict_count/word_count*100, words_not_in_dict

In [None]:
lex_results_pdf = lexicon_checking(pdf_doc, german_word_set)
lex_results_ocr = lexicon_checking(ocr_image_example_text_fixed_bindestrich, german_word_set)

In [None]:
print(f'The percentage of words in dictionary for: \n pdf: {lex_results_pdf[0]:.2f} \n tesseract: {lex_results_ocr[0]:.2f}')

In [None]:
print(f'The words not found in the lexicon for: \n pdf: {lex_results_pdf[1]} \n tesseract {lex_results_ocr[1]}')

In [None]:
from tqdm import tqdm

In [None]:
german_word_set_list = list(german_word_set)

Now one small note about the edit distance: To compute the full edit distance (including being able to track the edits and not just the distance itself), there is no way around computing the entire memoization table, which is very slow (O(N*M) where N is the length of the first string and M is the length of the second string). But we have an additional constraint! We don't want the edit distance to be larger than k (in our case, k=2). This enables us to do two things:

(1) Before we compute anything we check if the difference in length between the two strings is larger than k, if yes, we break. The edit distance would have been too large regardless.

(2) If each value of an entire row is larger than our max edit distance, you break and return that it is too large of a distance.

This doesn't change the asymptotic complexity (for that you would need to do some more index tricks), but it reduces our actual compute time significantly. The function below is a method from nltk.metrics.distance I adapted as explained above.

In [None]:
def edit_distance(s1, s2, substitution_cost=1, transpositions=False, max_changes=2):
    """
    Calculate the Levenshtein edit-distance between two strings.
    The edit distance is the number of characters that need to be
    substituted, inserted, or deleted, to transform s1 into s2.  For
    example, transforming "rain" to "shine" requires three steps,
    consisting of two substitutions and one insertion:
    "rain" -> "sain" -> "shin" -> "shine".  These operations could have
    been done in other orders, but at least three steps are needed.

    Allows specifying the cost of substitution edits (e.g., "a" -> "b"),
    because sometimes it makes sense to assign greater penalties to
    substitutions.

    This also optionally allows transposition edits (e.g., "ab" -> "ba"),
    though this is disabled by default.

    :param s1, s2: The strings to be analysed
    :param transpositions: Whether to allow transposition edits
    :type s1: str
    :type s2: str
    :type substitution_cost: int
    :type transpositions: bool
    :rtype: int
    """
    # set up a 2-D array
    len1 = len(s1)
    len2 = len(s2)
    if abs(len1-len2) > max_changes:
        return max_changes+1
    lev = _edit_dist_init(len1 + 1, len2 + 1)

    # retrieve alphabet
    sigma = set()
    sigma.update(s1)
    sigma.update(s2)

    # set up table to remember positions of last seen occurrence in s1
    last_left_t = _last_left_t_init(sigma)

    # iterate over the array
    # i and j start from 1 and not 0 to stay close to the Wikipedia pseudo-code
    # see https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
    for i in range(1, len1 + 1):
        last_right_buf = 0
        for j in range(1, len2 + 1):
            last_left = last_left_t[s2[j - 1]]
            last_right = last_right_buf
            if s1[i - 1] == s2[j - 1]:
                last_right_buf = j
            _edit_dist_step(
                lev,
                i,
                j,
                s1,
                s2,
                last_left,
                last_right,
                substitution_cost=substitution_cost,
                transpositions=transpositions,
            )
        last_left_t[s1[i - 1]] = i
        if min(lev[i]) > max_changes:  # max distance I allow
            return max_changes+1  # just a way of saying it's larger than allowed
    return lev[len1][len2]

Note that even with these changes, the code below is extremely slow. We still have to compare each word in our example text to each word in our vocabulary, even though the actual computation of the edit distance is a bit faster now. 

If this is done many times, it would be better to construct a data-structure exactly made for this type of checking, where you already save the vocabulary in such a way that the substrings can easily be found (this is vague and even so extremely simplified. [This blogpost](http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata) on the topic is very illuminating if you would like to learn more and don't know where to start.)

In [None]:
ocr_image_example_text_fixed_bindestrich_fixedvocab = ocr_image_example_text_fixed_bindestrich
patt = r'[a-zA-ZäöüÄÖÜß]+'
german_word_set_list = list(german_word_set)
num_characters_changed = 0
max_changes = 2
for word in tqdm(re.finditer(patt, ocr_image_example_text_fixed_bindestrich)): # around 2k words takes 16 minutes
    if word.group() not in german_word_set and word.group().lower() not in german_word_set:
        
        # don't correct short words
        if len(word.group()) <= 3:
              continue
        
        # don't correct words that appear like that in the gt pdf as well
        if word.group().lower() in pdf_doc.lower():
            continue
        
        # find the closest one
        closest_word = ""
        min_dist = max_changes+1
        for ger_word in german_word_set_list:
            d = edit_distance(word.group(), ger_word)
            if d < min_dist:
                min_dist = d
                closest_word = ger_word
                if min_dist == 1: #cannot get better than this
                    break

        if min_dist > max_changes:
            continue

        print(word)
        print(closest_word)

        ocr_image_example_text_fixed_bindestrich_fixedvocab =\
            ocr_image_example_text_fixed_bindestrich_fixedvocab[:word.start()+num_characters_changed]\
            + closest_word\
                + ocr_image_example_text_fixed_bindestrich_fixedvocab[word.end()+num_characters_changed:]
        num_characters_changed += len(closest_word)-len(word.group())

In [None]:
error = cer(pdf_doc, ocr_image_example_text_fixed_bindestrich_fixedvocab)
error

As you can see, that helped but barely. Considering how long it takes to compute, this is likely not a viable next step for most use-cases, but it's important to keep in mind for situations where small typos are unacceptable.

In [None]:
with open("data/embedding_data/grs-001_1921_13__298_d_tesseract_vocabulary_fixes.txt","w") as f:
    f.write(ocr_image_example_text_fixed_bindestrich_fixedvocab)

# Handwritten text recognition (HTR)
This is much trickier, and depends entirely on your data.

Here we begin with some botanical images, where only certain parts of the image contain nicely written labels.

For handwritten text recognition, the best out-of-the-box tool at the moment is probably [Transkribus](https://app.transkribus.org/). But we don't have access to those models and cannot fine-tune them, unless we pay for a premium account and do it through their own website. An alternative for handwritten text recognition is the [Kraken project](https://kraken.re/main/index.html), which gives us access to dozens of pre-trained models which we can download and use ourselves. You can install Kraken easily via `pip install kraken`.

NOTE: Unfortunately a Kraken version for Python 3.13 has still not been released (by December 2025) so to use it and run the following code cells you need to change to Python 3.12, then restart the kernel. For Windows you can get Python 3.12 via:

`winget install python.python.3.12` 

then deactivate the current environment and simply create a new environment with this Python:
```
deactivate
py -3.12 -m venv "env_datastories_py312"
.\env_datastories_py312\Scripts\activate.bat
pip install wheel
pip install ipykernel
pip install kraken
```

Don't forget to change it at the top right for the notebook as well!

Below we use the [fondue model](https://github.com/FoNDUE-HTR/), which was trained for handwritten text recognition. We use the German one ("_de"), the Latin one ("_la") and the general one. That is because the labels themselves are mostly in German, but since they're plant species, many of their names are in Latin. Let's compare.

First we download some Kraken models off of [Zenodo](https://zenodo.org/records/14399779):

In [None]:
!curl -o ./data/kraken_models/FoNDUE-GD_v2.mlmodel https://zenodo.org/records/14399779/files/FoNDUE-GD_v2.mlmodel?download=1
!curl -o ./data/kraken_models/FoNDUE-GD_v2_de.mlmodel https://zenodo.org/records/14399779/files/FoNDUE-GD_v2_de.mlmodel?download=1
!curl -o ./data/kraken_models/FoNDUE-GD_v2_la.mlmodel https://zenodo.org/records/14399779/files/FoNDUE-GD_v2_la.mlmodel?download=1
!curl -o ./data/kraken_models/McCATMuS_nfd_nofix_V1.mlmodel https://zenodo.org/records/13788177/files/McCATMuS_nfd_nofix_V1.mlmodel?download=1

Then we run the text recognition on the given pages:

In [None]:
!kraken -i data/ocr_data/Z-000033489.jpg data/ocr_data/output/Z-000099226_fondue_gd_v2_de.txt segment -bl ocr -m ./data/kraken_models/FoNDUE-GD_v2_de.mlmodel
!kraken -i data/ocr_data/Z-000033489.jpg data/ocr_data/output/Z-000099226_fondue_gd_v2_la.txt segment -bl ocr -m ./data/kraken_models/FoNDUE-GD_v2_la.mlmodel
!kraken -i data/ocr_data/Z-000033489.jpg data/ocr_data/output/Z-000099226_fondue_gd_v2.txt segment -bl ocr -m ./data/kraken_models/FoNDUE-GD_v2.mlmodel

And finally save (and print!) the output:

In [None]:
with open("data/ocr_data/output/Z-000099226_fondue_gd_v2_de.txt") as f:
    de_result = f.read()
    print(de_result)

with open("data/ocr_data/output/Z-000099226_fondue_gd_v2_la.txt") as f:
    la_result = f.read()
    print(la_result)

with open("data/ocr_data/output/Z-000099226_fondue_gd_v2.txt") as f:
    general_result = f.read()
    print(general_result)

That didn't work as well as the printed OCR, but fairly readable.

On the other hand, here we have some notary pages from the Archief Amsterdam, in English. Although it may seem that this is hastily written, given the other documents in their archive this is actually fairly nice handwriting.

In [None]:
!kraken -i "data/ocr_data/d837ae03-b2c5-6b6d-e053-b784100acdee_en.jpg" "data/ocr_data/output/d837ae03-b2c5-6b6d-e053-b784100acdee_en_McCATMuS_nfd_nofix_V1.txt" segment -bl ocr -m ./data/kraken_models/McCATMuS_nfd_nofix_V1.mlmodel

In [None]:
with open("data/ocr_data/output/d837ae03-b2c5-6b6d-e053-b784100acdee_en_McCATMuS_nfd_nofix_V1.txt", encoding="utf8") as f:
    notary_res = f.read()
    print(notary_res)

This didn't work very well, in large part due to the fact that the model was not trained on this handwriting.

As you can see, handwritten text recognition is a much more difficult task. For most use-cases, Transkribus will do just fine. But if you have a lot of data and would like to feed it back into your custom pipeline, it becomes necessary to utilize Kraken models (or something similar) and fine-tune them on your own. That is because Transkribus does not make it easy to programmatically (a) Upload your own transcribed data and (b) Download their transcriptions in a useful format, within your own pipeline again.