# Speed-up GT generation
[OCR4all](https://github.com/OCR4all) is a great (and easy-to-use) tool to do high-quality OCR of old books. In order to get good results (over 99% correctness is possible) GT must be generated. Within the OCR4all workflow, this is done with LAREX. LAREX allows for very detailed annotation, as of now, however, it doesn't provide any batch editing features.

The following script offers a very simplistic method to select and correct particularly interesting lines. The selection is done via REGEX. It can be useful to focus on certain patterns during GT creation. If you notice e.g. a very poor performance of your models when it comes to numbers, you can choose specifically lines containing numbers (`regex = r"\d+"`), if you notice that 'ſſ' is often transcribed as 'ſf', you can filter these lines with `regex = r"(fſ|ſf)"` and so on.

In [None]:
# Import necessary modules
from glob import glob
from lxml import etree
from matplotlib import pyplot as plt
import re
from shapely.geometry import Polygon
import matplotlib.image as mpimg
import numpy as np
import shutil
import os
import pyperclip

In [None]:
# Provide the path to the OCR4all project directory (contains folders `input` and `processing`):
project_directory = ""

In [None]:
pagexml = glob(project_directory+"/processing/*.xml")
pagexml.sort()

In [None]:
regex = r"\d+"

The script shows line snippets and the ocr transcription. The ocr transcription is also stored in the clipboard and can be pasted easily (ctrl+c). This is a pragmatic workaround as Python's `input()` function doesn't allow prefill. If you don't want to alter the ocr (and consequently generate no GT) hit enter. If you want to delete the line, type ` d ` (a space before and after the `d`).
If changes are made, the original PageXML is overwritten, but a backup is written to the `backup_processing` directory. It is, of course, wise to use (e.g.) git as well for tracking changes.

In [None]:
for page in pagexml[12:]:
    print(page)
    img_basename = page.split('/')
    tree = etree.parse(page)
    NSMAP = {'xmlns' : tree.xpath('namespace-uri(.)')}
    imageFilename, imageHeight, imageWidth = tree.find('xmlns:Page', NSMAP).attrib.values()
    img = mpimg.imread(project_directory+"/input/"+imageFilename)
    lines = tree.findall('//xmlns:TextLine', NSMAP)
    changed = False # Switch
    for l in lines:
        namespace = l.xpath('namespace-uri(.)')
        try:
            gt = l.find('xmlns:TextEquiv[@index="0"]/xmlns:Unicode', NSMAP).text
        except:
            gt = None
        try:
            ocr = l.find('xmlns:TextEquiv[@index="1"]/xmlns:Unicode', NSMAP).text
        except:
            ocr = ""
        line_coords = l.find('xmlns:Coords', NSMAP).attrib['points'].split(' ')
        vertices = [list(map(int,vertex.split(','))) for vertex in line_coords]
        vertices.append(vertices[0])
        xs,ys = zip(*vertices)
        polygon = Polygon(vertices)
        bbox = polygon.bounds
        try:
            match = re.search(regex,ocr)
        except:
            match = None
        if not gt and match:
            print(f"{page}\t{imageFilename}")
            print(bbox)
            try:
                line_snippet = img[int(bbox[1]):int(bbox[3]),int(bbox[0]):int(bbox[2])]
                plt.imshow(line_snippet)
                plt.show()
                pyperclip.copy(ocr)
                print(ocr)
                new_gt = input()
                if new_gt == " d":
                    l.getparent().remove(l)
                    changed = True
                elif len(new_gt) < 2:
                    pass
                else:
                    gt_elem = etree.SubElement(l,'{'+namespace+'}'+'TextEquiv', index="0")
                    gt_unicode = etree.SubElement(gt_elem,'{'+namespace+'}'+'Unicode')
                    gt_unicode.text = new_gt
                    changed = True
            except Exception as e:
                print(e)
            plt.clf()
    if changed:
        # make backup of original PageXML
        new_dir = project_directory+"/backup_processing/"
        try:
            shutil.copy(page,new_dir+img_basename[-1])
        except FileNotFoundError:
            
            os.mkdir(project_directory+"/backup_processing")
            shutil.copy(page,new_dir+img_basename[-1])
        # write new tree to  
        tree.write(open(page, 'wb'), pretty_print=False)