# OCR

In [13]:
import pytesseract #https://pypi.org/project/pytesseract/
from jiwer import cer #https://pypi.org/project/jiwer/
from pdf2image import convert_from_path
import os
from PIL import Image
from nltk.metrics.distance import _edit_dist_step, _edit_dist_init, _last_left_t_init
import re
from xml.etree.ElementTree import Element, SubElement, ElementTree
from bs4 import BeautifulSoup
import spacy
import fitz

## Printed

### PDF **with embedded text**
This might seem misplaced in the OCR notebook, but it's important to not only be able to create OCR for files but also be able to extract it from files which already have it!

In [2]:
filePath_list = ['data/grs-001_1921_13__298_d.pdf', 'data/grs-001_1921_13__393_d.pdf',
                 'data/grs-001_1922_14__563_d.pdf', 'data/grs-001_1923_15__447_d.pdf']
for filePath in filePath_list:

    pdf_doc = fitz.open(filePath)
    page = pdf_doc[1:] #page 1 is just the metadata for the document
    pdf_text = " ".join([x.get_text() for x in page])
    pdf_text = re.sub("¬\s+", "", pdf_text)  # "bindestriche" will be removed, if they are followed by one or several whitespaces, those will be removed as well.
    pdf_text = pdf_text.strip() 
    pdf_text = re.sub("\n", " ", pdf_text)  # replace newlines with spaces
    pdf_text = re.sub("\. ", "\.\n", pdf_text)  # replace periods with newlines (for nicer printing)
    pdf_text = re.sub(r'\s+', " ", pdf_text)  # replace all repeating whitespaces with only one whitespace
    pdf_text = re.sub(r'\\', "", pdf_text)  # replace all double backslashes

    #save the pure text in a new file, we'll re-use this in the embedding_data notebook
    with open(filePath.replace(".pdf",".txt").replace("data/", "data/embedding_data/"), "w") as f:
        f.write(pdf_text)

### Pytesseract
If you have a file without embedded text, you need to run OCR on it first. State of the art at the moment is tesseract so that's what we're using here.

#### PDF
Note that this PDF has embedded text, but we throw it away and run pytesseract on it for demonstration purposes.

In [108]:
filePath = 'data/grs-001_1921_13__298_d.pdf'
doc = convert_from_path(filePath)
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)

This time we print the data page by page as opposed to joining them so we can examine it.

In [109]:
for page_number, page_data in enumerate(doc):
    txt = pytesseract.image_to_string(page_data, lang="deu")
    print("Page # {} - {}".format(str(page_number),txt))

Page # 0 - Zeitschrift: Gewerkschaftliche Rundschau für die Schweiz : Monatsschrift des
Schweizerischen Gewerkschaftsbundes

Herausgeber: Schweizerischer Gewerkschaftsbund

Band: 13 (1921)

Heft: 5

Artikel: Lohnabbau

Autor: [s.n.]

DOI: https://doi.org/10.5169/seals-351437

Nutzungsbedingungen

Die ETH-Bibliothek ist die Anbieterin der digitalisierten Zeitschriften. Sie besitzt keine Urheberrechte
an den Zeitschriften und ist nicht verantwortlich für deren Inhalte. Die Rechte liegen in der Regel bei
den Herausgebern beziehungsweise den externen Rechteinhabern. Siehe Rechtliche Hinweise.
Conditions d'utilisation

L'ETH Library est le fournisseur des revues numerisees. Elle ne d&tient aucun droit d'auteur sur les
revues et n'est pas responsable de leur contenu. En r&gle generale, les droits sont d&tenus par les
editeurs ou les d&tenteurs de droits externes. Voir Informations legales.

Terms of use

The ETH Library is the provider of the digitised journals. It does not own any copyright

#### Image
Images (usually) don't have embedded OCR, so here we don't have to throw anything away and simply run pytesseract on a jpg.

In [2]:
ocr_image_example_filepaths = ["data/ocr_data/grs-001_1921_013_0051.jpg", "data/ocr_data/grs-001_1921_013_0052.jpg"]
ocr_image_example_text = ""
for path in ocr_image_example_filepaths:
    text = pytesseract.image_to_string(Image.open(path), lang='deu')
    text = re.sub("¬\s+", "", text)  # "bindestriche" will be removed, if they are followed by one or several whitespaces, those will be removed as well.
    text = text.strip() 
    text = re.sub("\n", " ", text)  # replace newlines with spaces
    text = re.sub("\. ", "\.\n", text)  # replace periods with newlines (for nicer printing)
    text = re.sub(r'\s+', " ", text)  # replace all repeating whitespaces with only one whitespace
    text = re.sub(r'\\', "", text)  # replace all double backslashes
    ocr_image_example_text += text

Let's see what it says:

In [4]:
ocr_image_example_text

'GEWERKSCHAFTLICHE RUNDSCHAU 39 In den Mittelpunkt des Interesses ist der «Lohnab- bau» gerückt. Auch er ist wie die Krise eine internatio- rale Erscheinung. In Amerika, in England, ja sogar in den valutaschwachen Ländern Deutschland, Oesterreich, Tschechoslowakei, Italien usw., überall stehen wir vor der gleichen Erscheinung. Der « Preisabbau» hatte kaum Zeit, sich anzumelden, machten schon die Indu- striellen die grössten Anstrengungen, die «hohen » Kriegslöhne auf ein «erträgliches Mass» zurückzu- drücken. Leider ist die Situation diesem Vorhaben gün- stieg, denn die Konkurrenz der arbeitslosen Reserve- armee war noch nie so stark wie eben jetzt. Es lässt sich nachweisen, dass hauptsächlich in der Textilindu- strie die Löhne gesunken sind, ohne dass die Oeffent- lichkeit etwas davon merkte. In den Heimarbeitgebie- ten ist es nicht besser. Lohnreduktionen treten auf in der chemischen Industrie und in manchen Zweigen der Lebens- und Genussmittelindustrie. Unberührt davon sind bis jetz

Looks good! But how do we know if it actually is well done OCR? Once evaluation metric is the so-called "character error rate". We don't have "ground-truth" to compare it with, so we'll use the E-Periodica OCR as GT and compare the tesseract result to it.

### Evaluation

Luckily we already extracted just the text from the pdf files on E-Periodica, so we just open them back up here:

In [3]:
with open("data/embedding_data/grs-001_1921_13__298_d.txt", "r") as f:
    pdf_doc = f.read()

In [15]:
error = cer(pdf_doc, ocr_image_example_text)
error

0.020342382350834468

A character error rate of 2% is already extremely low, but as you might have already noticed in our examples, we can lower it even further. The tesseract OCR splits words with a regular dash "-" instead of the correct one for words which were split because they go over the line "¬" so this is not caught by our pre-processing. Let's change that:

In [5]:
ocr_image_example_text_fixed_bindestrich = ""
for path in ocr_image_example_filepaths:
    text = pytesseract.image_to_string(Image.open(path), lang='deu')
    text = re.sub("-\n", "", text)  # this is the line we changed
    text = text.strip()
    text = re.sub("\n", " ", text)
    text = re.sub("\. ", "\.\n", text)
    text = re.sub(r'\s+', " ", text)
    text = re.sub(r'\\', "", text)
    ocr_image_example_text_fixed_bindestrich += text

In [16]:
error = cer(pdf_doc, ocr_image_example_text_fixed_bindestrich)
error

0.007950719862474035

In [9]:
with open("data/embedding_data/tesseract.txt","w") as f:
    f.write(ocr_image_example_text_fixed_bindestrich)

In [4]:
with open("data/embedding_data/tesseract.txt","r") as f:
    ocr_image_example_text_fixed_bindestrich = f.read()

Here we go! A CER of 0.8%, as we would have expected for tesseract and printed text and very simple layout.

## Post-correction with edit distance
Can we do better? Looking at the tesseract OCR, it's already very high quality, but there are several seemingly "obvious" mistakes, such as "Ohancen" instead of "Chancen" or "eiustimmig" instead of "einstimmig". Something simple that tesseract even does itself is a "vocabulary correction", where you check if the word you found through visual methods even exists in the vocabulary of that language.

Of course, there are several caveats to this which are also the reason why tesseract did not correct these words. (1) You cannot be sure that you have a complete vocabulary with all declensions, conjugations etc. (2) Names are usually not part of a vocabulary and even if they are, see point (1). (3) Sometimes, especially in our Swiss dataset, words from other languages are used in a regular German sentence, so you would -correctly- flag "trottoir" as not a valid German word and then look for the most similar (in terms of edit-distance) German word in your vocabulary, essentially "verschlimmbessern".

Still, especially for names this type of post-correction can be very rewarding. We address some of these concerns by setting the allowed edit distance very low and even making it depend on the word length. Let's try it on our current example and see if it makes our CER better or worse.

As a vocabulary we downloaded all of German Wikipedia and because our dataset is Swiss we added all the Swiss person names from the Bundesamt für Statistik.

In [141]:
# edit_distance
import json
with open('data/word_count_dewiki_chnames.json', 'r', encoding='utf8') as file:
    german_word_set = json.load(file)
german_word_set = set(k for k, v in german_word_set.items() if v > 3 and len(k)>1)  # only consider words which appear at least three times, to avoid typos

In [142]:
def lexicon_checking(text, lexicon):
    word_count = 0
    word_in_dict_count = 0
    words_not_in_dict = []
    patt = r'[a-zA-ZäöüÄÖÜß]+'
    for word in re.finditer(patt, text):
        word = word.group()
        word_count +=1
        if word in lexicon or word.lower() in lexicon:
            word_in_dict_count += 1
        else:
            words_not_in_dict.append(word)
    return word_in_dict_count/word_count*100, words_not_in_dict

In [143]:
lex_results_pdf = lexicon_checking(pdf_doc, german_word_set)
lex_results_ocr = lexicon_checking(ocr_image_example_text_fixed_bindestrich, german_word_set)

In [144]:
print(f'The percentage of words in dictionary for: \n pdf: {lex_results_pdf[0]:.2f} \n tesseract: {lex_results_ocr[0]:.2f}')

The percentage of words in dictionary for: 
 pdf: 97.26 
 tesseract: 97.46


In [145]:
print(f'The words not found in the lexicon for: \n pdf: {lex_results_pdf[1]} \n tesseract {lex_results_ocr[1]}')

The words not found in the lexicon for: 
 pdf: ['valutaschwachen', 'Preisabbau', 'machton', 'Kriegslöhne', 'Heimarbeitgebieten', 'Lohnreduktionen', 'Vorsloss', 'Uhrcnarbeiter', 'Maschincnindustric', 'Exportindustric', 'Preisabbaues', 'Preisabbau', 'valutaschwachen', 'Vergloichszahlen', 'J', 'z', 'B', 'V', 'S', 'K', 'torenbesoldung', 'Weltmarktkonkurrenz', 'Preisabbau', 'Preisabbau', 'Hochschutzzöllnern', 'Tabakzolls', 'Zollzuschläge', 'Sabotagemassnahmen', 'boschränkungen', 'seuchenpolizeilichen', 'Grenzschlachthöfen', 'Metzgergewerbe', 'Defizitwirtschaft', 'Intcressencliquen', 'Futterationen', 'Industrielatid', 'Verhältnisesn', 'Preisabbau', 'deutsehen', 'u', 'a', 'Zwangsmassregeln', 'Uebereinkon', 'Gowerkschaftskonferenz', 'machungen', 'Tnteressen', 'entgegenzustel', 'machungen', 'Kriegshyänen', 'Reparationsinstitut', 'Wiecloraufbauarbeit'] 
 tesseract ['internatiorale', 'valutaschwachen', 'Preisabbau', 'Kriegslöhne', 'günstieg', 'Heimarbeitgebieten', 'Lohnreduktionen', 'Entwieklung'

In [146]:
from tqdm import tqdm

In [147]:
german_word_set_list = list(german_word_set)

In [148]:
def edit_distance(s1, s2, substitution_cost=1, transpositions=False, max_changes=2):
    """
    Calculate the Levenshtein edit-distance between two strings.
    The edit distance is the number of characters that need to be
    substituted, inserted, or deleted, to transform s1 into s2.  For
    example, transforming "rain" to "shine" requires three steps,
    consisting of two substitutions and one insertion:
    "rain" -> "sain" -> "shin" -> "shine".  These operations could have
    been done in other orders, but at least three steps are needed.

    Allows specifying the cost of substitution edits (e.g., "a" -> "b"),
    because sometimes it makes sense to assign greater penalties to
    substitutions.

    This also optionally allows transposition edits (e.g., "ab" -> "ba"),
    though this is disabled by default.

    :param s1, s2: The strings to be analysed
    :param transpositions: Whether to allow transposition edits
    :type s1: str
    :type s2: str
    :type substitution_cost: int
    :type transpositions: bool
    :rtype: int
    """
    # set up a 2-D array
    len1 = len(s1)
    len2 = len(s2)
    if abs(len1-len2) > max_changes:
        return max_changes+1
    lev = _edit_dist_init(len1 + 1, len2 + 1)

    # retrieve alphabet
    sigma = set()
    sigma.update(s1)
    sigma.update(s2)

    # set up table to remember positions of last seen occurrence in s1
    last_left_t = _last_left_t_init(sigma)

    # iterate over the array
    # i and j start from 1 and not 0 to stay close to the wikipedia pseudo-code
    # see https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
    for i in range(1, len1 + 1):
        last_right_buf = 0
        for j in range(1, len2 + 1):
            last_left = last_left_t[s2[j - 1]]
            last_right = last_right_buf
            if s1[i - 1] == s2[j - 1]:
                last_right_buf = j
            _edit_dist_step(
                lev,
                i,
                j,
                s1,
                s2,
                last_left,
                last_right,
                substitution_cost=substitution_cost,
                transpositions=transpositions,
            )
        last_left_t[s1[i - 1]] = i
        if min(lev[i]) > max_changes:  # max distance i allow
            return max_changes+1  # just a way of saying it's larger than allowed
    return lev[len1][len2]

In [149]:
ocr_image_example_text_fixed_bindestrich_fixedvocab = ocr_image_example_text_fixed_bindestrich
patt = r'[a-zA-ZäöüÄÖÜß]+'
german_word_set_list = list(german_word_set)
num_characters_changed = 0
max_changes = 2
for word in tqdm(re.finditer(patt, ocr_image_example_text_fixed_bindestrich)): # around 2k words takes 16 minutes
    if word.group() not in german_word_set and word.group().lower() not in german_word_set:
        
        # don't correct short words
        if len(word.group()) <= 3:
              continue
        
        # don't correct words that appear like that in the gt pdf as well
        if word.group().lower() in pdf_doc.lower():
            continue
        
        # find the closest one
        distances = [edit_distance(word.group(), ger_word, max_changes) for ger_word in german_word_set_list]
        # ACHTUNG this takes ages.
        # around 1.5 minutes per word
        min_dist = min(distances)
        if min_dist > max_changes:
            continue
        if len(word.group()) < 6 and min_dist > max_changes-1: # if it's somewhat of a short word allow fewer changes
            continue
        closest_word = german_word_set_list[distances.index(min_dist)]
        print(word)
        print(closest_word)
        ocr_image_example_text_fixed_bindestrich_fixedvocab =\
            ocr_image_example_text_fixed_bindestrich_fixedvocab[:word.start()+num_characters_changed]\
            + closest_word\
                + ocr_image_example_text_fixed_bindestrich_fixedvocab[word.end()+num_characters_changed:]
        num_characters_changed += len(closest_word)-len(word.group())

19it [01:35,  5.03s/it]

<re.Match object; span=(125, 139), match='internatiorale'>
internationale


71it [02:43,  2.06s/it]

<re.Match object; span=(547, 555), match='günstieg'>
günstig


214it [03:58,  1.12it/s]

<re.Match object; span=(1604, 1615), match='Entwieklung'>
Entwiklung


264it [04:55,  1.04it/s]

<re.Match object; span=(1939, 1946), match='Ohancen'>
Chancen


828it [06:16,  3.55it/s]

<re.Match object; span=(6007, 6024), match='Indvstrieprodukte'>
Industrieprodukte


1289it [09:15,  2.88it/s]

<re.Match object; span=(9532, 9545), match='Gieichgewicht'>
Gelichgewicht


1386it [10:11,  2.55it/s]

<re.Match object; span=(10255, 10262), match='Entenie'>
Enten


1457it [11:07,  2.16it/s]

<re.Match object; span=(10824, 10831), match='möglieh'>
möglich


1473it [12:23,  1.45it/s]

<re.Match object; span=(10972, 10982), match='eiustimmig'>
einstimmig


1852it [13:33,  2.28it/s]


In [150]:
error = cer(pdf_doc, ocr_image_example_text_fixed_bindestrich_fixedvocab)
error

0.007664207434997493

As you can see, that helped but barely. Considering how long it takes to compute, this is likely not a viable next step for most cases, but it's important to keep in mind for situations where small typos are unacceptable.

In [151]:
ocr_image_example_text_fixed_bindestrich_fixedvocab = ocr_image_example_text_fixed_bindestrich
patt = r'[a-zA-ZäöüÄÖÜß]+'
german_word_set_list = list(german_word_set)
num_characters_changed = 0
max_changes = 1
for word in tqdm(re.finditer(patt, ocr_image_example_text_fixed_bindestrich)): # around 2k words takes 16 minutes
    if word.group() not in german_word_set and word.group().lower() not in german_word_set:
        
        # don't correct short words
        if len(word.group()) <= 3:
              continue
        
        # don't correct words that appear like that in the gt pdf as well
        if word.group().lower() in pdf_doc.lower():
            continue
        
        # find the closest one
        distances = [edit_distance(word.group(), ger_word, max_changes) for ger_word in german_word_set_list]
        # ACHTUNG this takes ages.
        # around 1.5 minutes per word
        min_dist = min(distances)
        if min_dist > max_changes:
            continue
        closest_word = german_word_set_list[distances.index(min_dist)]
        print(word)
        print(closest_word)
        ocr_image_example_text_fixed_bindestrich_fixedvocab =\
            ocr_image_example_text_fixed_bindestrich_fixedvocab[:word.start()+num_characters_changed]\
            + closest_word\
                + ocr_image_example_text_fixed_bindestrich_fixedvocab[word.end()+num_characters_changed:]
        num_characters_changed += len(closest_word)-len(word.group())

19it [01:07,  3.56s/it]

<re.Match object; span=(125, 139), match='internatiorale'>
internationale


71it [02:14,  1.75s/it]

<re.Match object; span=(547, 555), match='günstieg'>
günstig


214it [03:29,  1.22it/s]

<re.Match object; span=(1604, 1615), match='Entwieklung'>
Entwiklung


264it [04:29,  1.08it/s]

<re.Match object; span=(1939, 1946), match='Ohancen'>
Chancen


828it [06:08,  3.27it/s]

<re.Match object; span=(6007, 6024), match='Indvstrieprodukte'>
Industrieprodukte


1289it [10:32,  2.11it/s]

<re.Match object; span=(9532, 9545), match='Gieichgewicht'>
Gleichgewicht


1386it [11:53,  1.82it/s]

<re.Match object; span=(10255, 10262), match='Entenie'>
Entente


1457it [13:14,  1.54it/s]

<re.Match object; span=(10824, 10831), match='möglieh'>
möglich


1473it [15:06,  1.01it/s]

<re.Match object; span=(10972, 10982), match='eiustimmig'>
einstimmig


1852it [17:01,  1.81it/s]


In [152]:
error = cer(pdf_doc, ocr_image_example_text_fixed_bindestrich_fixedvocab)
error

0.007377695007520951

A little better still. Since we can't go down to zero edits, we've reached the floor of CER tesseract can achieve, given this GT.

In [154]:
with open("tesseract.txt","w") as f:
    f.write(ocr_image_example_text_fixed_bindestrich_fixedvocab)

## Handwritten
This is much trickier, and depends entirely on your data.

Here we begin with some botanical images, where only certain parts of the image contain nicely written labels.

For handwritten text recognition, the best out-of-the-box tool at the moment is probably Transkribus. But we don't have access to those models and cannot fine-tune them, unless we pay for a premium account and do it through their own website. An alternative for handwritten text recognition is the kraken project, which gives us access to dozends of pre-trained models which we can download and use ourselves.

Below we use the fondue model, which was trained for handwritten text recognition. We use the German one ("_de"), the Latin one ("_la") and the general one. That is because the labels themselves are mostly in German, but since they're plant species, many of their names are in Latin. Let's compare.

In [155]:
!kraken -i data/ocr_data/Z-000033489.jpg data/ocr_data/output/Z-000099226_fondue_gd_v2_de.txt segment -bl ocr -m FoNDUE-GD_v2_de.mlmodel
!kraken -i data/ocr_data/Z-000033489.jpg data/ocr_data/output/Z-000099226_fondue_gd_v2_la.txt segment -bl ocr -m FoNDUE-GD_v2_la.mlmodel
!kraken -i data/ocr_data/Z-000033489.jpg data/ocr_data/output/Z-000099226_fondue_gd_v2.txt segment -bl ocr -m FoNDUE-GD_v2.mlmodel

Loading ANN /home/genta/Documents/notebooks_cs/.venv/lib/python3.8/site-packages/kraken/blla.mlmodel	[0m[32m✓[0m
Loading ANN FoNDUE-GD_v2_de.mlmodel	[0m[32m✓[0m
Segmenting	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m15/15[0m [36m0:00:00[0m [33m0:00:02[0m [33m0:00:02[0m
[?25hWriting recognition results for data/ocr_data/Z-000033489.jpg	[0m[32m✓[0m
Loading ANN /home/genta/Documents/notebooks_cs/.venv/lib/python3.8/site-packages/kraken/blla.mlmodel	[0m[32m✓[0m
Loading ANN FoNDUE-GD_v2_la.mlmodel	[0m[32m✓[0m
Segmenting	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m15/15[0m [36m0:00:00[0m [33m0:00:02[0mm [33m0:00:02[0m
[?25hWriting recognition results for data/ocr_data/Z-000033489.jpg	[0m[32m✓[0m
Loading ANN /home/genta/Documents/notebooks_cs/.venv/lib/python3.8/site-packages/kraken/blla.mlmodel	[0m[32m✓[0m
Loading ANN FoNDUE-GD_v2.mlmodel	[0m[32

In [153]:
with open("data/ocr_data/output/Z-000099226_fondue_gd_v2_de.txt") as f:
    de_result = f.read()
    print(de_result)

with open("data/ocr_data/output/Z-000099226_fondue_gd_v2_la.txt") as f:
    la_result = f.read()
    print(la_result)

with open("data/ocr_data/output/Z-000099226_fondue_gd_v2.txt") as f:
    general_result = f.read()
    print(general_result)

nuhnneilnnknlalnelnnhm
EMLSDUO ANSIOAUAOU
I. A.1307. 1. 1.I.
107. .. X. 14854
711 I X404
Jorpliera ceberoises, Mart.
Schleckter 6716 Dot. 7. Mrasol Pedersen Je
G. deemnber, Sac, Sablimatisirt
Plantar Schlechterianae,
Reg Natal
Gonshrena glotesa d
In gram, tr ollorityburg
16. I. 1895
3000
No 6746 leg. R. Geblechter
uulimluntiluntuntuntmtuti
inipuo husionum oui
DOTAHISCHER BEMTEI.
BOTMMIEeNs MVEEVI
muIUt eMIS EMEN
somplena celtocornes, Mart.
Schlechter 6FVce Dot. T. Mondol Podorson fc
ccm sa Sublimatioirt.
Plantar Schlichterianae.
Reque Nutat
Gephrena olotasas
nam.  Moretgbures.
16.  189
3ooo
N646 leg. B, detilcchtet
mtimtimtintimtimtmntimtimtunt
443 pUD AISIOAUNOII
DOTAIISCHEN CANTEN
POTAMISENE MMETUE
VRMEENT MIEN
Comphoera celvevises, hart.
Sekleckter 6716 Oot 7. Myndot Podorsen Je
9. decumbens srcq Sublimatisirt
Plantar Schlechterianac,
Reg Natal
Gomphrena globosa s
In gram, tt. Moritzburg
16. 7 1895
3000
N=o 6746 beg. R. Schlechter


That didn't work as well as the printed OCR, but fairly readable.

On the other hand, here we have some notary pages from the Archief Amsterdam, in English. Although it may seem that this is hastily written, given the other documents in their archive this is actually quite nice handwriting.

In [160]:
!kraken -i "data/ocr_data/d837ae03-b2c5-6b6d-e053-b784100acdee_en.jpg" "data/ocr_data/output/d837ae03-b2c5-6b6d-e053-b784100acdee_en_McCATMuS_nfd_nofix_V1.txt" segment -bl ocr -m McCATMuS_nfd_nofix_V1.mlmodel

Loading ANN /home/genta/Documents/notebooks_cs/.venv/lib/python3.8/site-packages/kraken/blla.mlmodel	[0m[32m✓[0m
Loading ANN McCATMuS_nfd_nofix_V1.mlmodel	[0m[32m✓[0m
Segmenting	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m61/61[0m [36m0:00:00[0m [33m0:00:21[0mm [33m0:00:21[0m
[?25hWriting recognition results for data/ocr_data/d837ae03-b2c5-6b6d-e053-b784100acdee_en.jpg	[0m[32m✓[0m


In [161]:
with open("data/ocr_data/output/d837ae03-b2c5-6b6d-e053-b784100acdee_en_McCATMuS_nfd_nofix_V1.txt") as f:
    notary_res = f.read()
    print(notary_res)

Anoall Men bychese jor esente
Phat me Braunsberg Tluppel, sacsch
&s somp ? of the fity o Amsterdaine, in
Nolland Merchants, de hercoy spatee
constitute and appoinits So hn sheophilus
Daubuz of Londoir 60q^r our true and.
larfull Attoney for tis and in our Names
and bchalf to absign and Eranster unto
anq Person et Persons erhosoerer all
oxanq Fart of sen Thousand Dollars
six per sentum Atocte contairded in
Me
thé sne pollon inq Certificales, Virxt
D
Se tour Landsandselein dndesdayn
aforesaid thie Smenty fourth doin of
Oune, wthe Gear ofour Lord One
Thousand Vereuf Hbundred and
Nincty seren,
Séales and decivéred,
exjoresence of
43
FtByl-Py.
25
1
1
Lelroosesd
Vhaunsoerg hupeshiuesatin
(Eenter tothe Notary K. Mglens
6
Bert Enowre, shat onthe Frventy fourths
dayof suné One shousand Séron
vandred and sinety Leren, before me
Anthong Rglnes, Nctary publie and
tranelater bn larfult authorite du ly
sommisfioned appointed and srrorn
Résiding aud practisinq in the fity o
Amsterdai, same thed

This didn't work very well, in large part due to the fact that the model was not trained on this handwriting.

As you can see, handwritten text recognition is a much more difficult task. For most use-cases, Transkribus will do just fine. But if you have a lot of data and would like to feed it back into your custom pipeline, it becomes necessary to utilize Kraken models and fine-tune them on your own. That is because Transkribus does not make it easy to (a) Upload your own transcribed data and (b) Download their transcriptions in a useful format, within your own pipeline again.

## text to TEI xml with spacy NER
Humanities mostly work with XML files and NER works quite nicely with a typical XML file structure. Here we take a text file, run NER on it and save it as a TEI XML file.

In [None]:
#!curl -o output.xml -F upload=@grs-002_1984_76__277_d.txt https://teigarage.tei-c.org/ege-webservice/Conversions/txt%3Atext%3Aplain/odt%3Aapplication%3Avnd.oasis.opendocument.text/TEI%3Atext%3Axml/conversion

In [182]:
txt_file = "data/embedding_data/grs-001_1921_13__298_d.txt"
output_file = "data/ocr_data/output/grs-001_1921_13__298_d_tei.xml"
output_file_ner = "data/ocr_data/output/grs-001_1921_13__298_d_tei_ner.xml"

In [None]:
nlp = spacy.load("de_core_news_lg")

In [183]:
def create_tei_from_txt(txt_file, output_file, paragraph_delimiter="\n", page_delimiter="\n\n",):
    #TODO add line breaks?
    with open(txt_file, 'r', encoding='utf-8') as f:
        text = f.read()
    
    pages = text.split(page_delimiter)
    paragraphs = [x.split(paragraph_delimiter) for x in pages]

    tei = Element('teiHeader') #root
    text_section = SubElement(tei, 'text')
    body = SubElement(text_section, 'body')
    
    for page in paragraphs:
        p_page = SubElement(body,"pb")
        for paragraph in page:
            p_para = SubElement(p_page, 'p')  # Paragraph element
            p_para.text = paragraph
    
    # Generate the output XML file
    tree = ElementTree(tei)
    tree.write(output_file, encoding='utf-8', xml_declaration=True)
    
    print(f"TEI file created: {output_file}")

In [187]:
create_tei_from_txt(txt_file, output_file)

TEI file created: data/ocr_data/output/grs-001_1921_13__298_d_tei.xml


In [241]:
def create_ner_tei_from_tei(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        xml_doc = f.read()

    soup = BeautifulSoup(xml_doc, "xml")

    paragraphs = soup.find_all(string=True)
    for entry in paragraphs:

        doc = nlp(entry.text)
        newtext = entry
        last_tag = ""
        running_total = 0
        for i,ent in enumerate(doc.ents):
            start = ent.start_char + running_total
            end = ent.end_char + running_total
            entity_text = ent.text
            entity_label = ent.label_

            if entity_label == "PER":
                tag = "perName"
            elif entity_label == "ORG":
                tag = "orgName"
            elif entity_label == "GPE" or entity_label == "LOC":
                tag = "placeName"
            else:
                tag = entity_label
            
            newtext = newtext[:start] + "<"+tag+">"+entity_text+"</"+tag+">" + newtext[end:]
            last_tag = tag
            running_total += (5+2*len(last_tag))
        
        entry.replace_with(BeautifulSoup(newtext, features="html.parser"))
    
    with open(output_file, 'w') as f:
        f.write(soup.prettify())

In [242]:
create_ner_tei_from_tei(output_file, output_file_ner)

I know the results look somewhat disappointing here, for instance "Zwangsmassregeln" being tagged as a person is less than ideal...
I elaborate in the NER notebook why that might be happening and what to do about it. For now, we simply take the results at face value.