#  Automatic Correction of ALTO xml Manuscripts from Aligned Printed Sources

Carmen Carrasco Lujan, Université Libre de Bruxelles  
carmen.carrasco@ulb.be

This notebook aims to develop a complete pipeline for the automatic correction of a manuscript in ALTO format (which has been previously segmented and transcribed automatically via eScriptorium), based on the printed version of the same manuscript (also in ALTO format).

The process will have 5 principals steps: 

**I.  Data preparation**  
**III. Tokenization**  
**IV. Alignment**  
**V. Correction of the manuscript in ALTO format**

In [1]:
# First of all we install the neccessary packages
!pip install collatex
!pip install python-Levenshtein
!pip install pandas

# install gravitz with: https://graphviz.org/download/



## [Schéma]

**I.  Data preparation**  
printversion --> printversion_normalized --> printversion_cleaned

manuscript --> 
footnotes_manuscript ; nofootnotes_manuscript  
printversion_cleaned -->  
footnotes_print ; nofootnotes_print  

**III. Tokenization**  

footnotes_manuscript_tokenized   ---> footnotes_manuscript_tokenized_joined  
nofootnotes_manuscript_tokenized  ---> nofootnotes_manuscript_tokenized_joined  
footnotes_print_tokenized
nofootnotes_print_tokenized

**IV. Alignment**   
nofootnotes_collation.csv --> nofootnotes_collation_filtered.csv  
footnotes_collation.csv --> footnotes_collation_filtered.csv  

**V. Correction of the manuscript in ALTO format**

Correction of the joined tokenized version in: manuscript_nofootnotes_joined_corrected ;  manuscript_footnotes_joined_corrected   
Re-hyphenation: manuscript_nofootnotes_json_final, manuscript_footnotes_json_final  
Re-creation of ALTOs: nofootnotes_manuscript_final, footnotes_manuscript_final  
Put together in: manuscript_final



## I. Data preparation
### 1. Automatic correction of the print

If the printed version comes from eScriptorium or from a PDF source, there may be some errors or undesired characters that we must to clean up. In the case of our printed documents, we will focus on two main types of corrections:

**1.1 Character normalization**  
Since our printed text was OCR-processed with NFKD normalization (e.g. using the Catmus-PRINT large model), accented letters may be split into a base character plus a combining mark. We’ll re-normalize the entire text to NFC so that each accent is represented as a single composed code point.

**1.2 Line-end hyphenation**  
The OCR model we used insert a special character, the soft hyphen: "¬" (U+00AC), to indicate that a word has been split across lines. These hyphenation marks should be removed and the broken words recombined. 


In [99]:
# 1.1. Normalize the ALTO files of the printed version and the manuscript version:

import os
import unicodedata
import xml.etree.ElementTree as ET

folders = [
    ("printversion", "printversion_normalized"),
    ("manuscript", "manuscript_normalized"),
]

for input_folder, output_folder in folders:
    os.makedirs(output_folder, exist_ok=True)

    for filename in os.listdir(input_folder):
        if filename.endswith(".xml"):
            input_path = os.path.join(input_folder, filename)
            output_path = os.path.join(output_folder, filename)

            tree = ET.parse(input_path)
            root = tree.getroot()

            for string_elem in root.iter():
                if string_elem.tag.endswith("String") and "CONTENT" in string_elem.attrib:
                    original = string_elem.attrib["CONTENT"]
                    normalized = unicodedata.normalize("NFC", original)
                    string_elem.set("CONTENT", normalized)

            tree.write(output_path, encoding="utf-8", xml_declaration=True)

    print(f"All ALTO files have been normalized and saved in: {output_folder}")

All ALTO files have been normalized and saved in: printversion_normalized
All ALTO files have been normalized and saved in: manuscript_normalized


In [100]:
# 1.2. Fix hyphenated words at the end of lines in the print version

import re

input_folder = "printversion_normalized"
output_folder = "printversion_cleaned"
os.makedirs(output_folder, exist_ok=True)

# Namespace definition 
namespace = {'ns': 'http://www.loc.gov/standards/alto/ns-v4#'}

for filename in os.listdir(input_folder):
    if filename.endswith(".xml"):
        input_path = os.path.join(input_folder, filename)
        output_path = os.path.join(output_folder, filename)

        tree = ET.parse(input_path)
        root = tree.getroot()

        # Find all <String> elements in document order
        string_elements = root.findall(".//ns:String", namespace)

        i = 0
        while i < len(string_elements) - 1:
            current_elem = string_elements[i]
            current_text = current_elem.attrib.get("CONTENT", "")

            # Check if line ends with '¬'
            if current_text.endswith("¬"):
                # Remove the ¬ 
                current_text = current_text[:-1]

                # Get the next element
                next_elem = string_elements[i + 1]
                next_text = next_elem.attrib.get("CONTENT", "")

                # Extract it
                match = re.match(r"(\S+)(\s?.*)", next_text)
                if match:
                    prefix = match.group(1)
                    suffix = match.group(2)

                    # Merge 
                    current_elem.set("CONTENT", current_text + prefix)

                    # Remove the element from the next line
                    next_elem.set("CONTENT", suffix.lstrip())
                else:
                    # If next line has no space, empty it 
                    current_elem.set("CONTENT", current_text + next_text)
                    next_elem.set("CONTENT", "")

                i += 1  # move to the next line (don't skip further)
            else:
                i += 1  # when no hyphenation

        # Save XML files
        tree.write(output_path, encoding="utf-8", xml_declaration=True)

print("Hyphennations corrected in:", output_folder)

Hyphennations corrected in: printversion_cleaned


## 2. Extraction of "InterlinearLine", "MarginText:lower" and "NumberingZone"



In [101]:
# Extraction of interlinear lines

from glob import glob

input_dir = "manuscript_normalized"
interlinear_dir = "interlinear_manuscript"
nointerlinear_dir = "nointerlinear_manuscript"

tag_ids = {"LT50", "LT51"}

# Namespace ALTO
ns = {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace('', ns['alto'])

def process_alto_files(input_folder, output_interlinear, output_nointerlinear):
    os.makedirs(output_interlinear, exist_ok=True)
    os.makedirs(output_nointerlinear, exist_ok=True)
    alto_files = sorted(glob(os.path.join(input_folder, "*.xml")))

    for file_path in alto_files:
        tree = ET.parse(file_path)
        root = tree.getroot()

        # only interlinear
        interlinear_tree = ET.ElementTree(ET.fromstring(ET.tostring(root)))
        for textblock in interlinear_tree.findall('.//alto:TextBlock', ns):
            for line in list(textblock.findall('alto:TextLine', ns)):
                if not any(tag_id in line.attrib.get("TAGREFS", "") for tag_id in tag_ids):
                    textblock.remove(line)
        interlinear_out = os.path.join(output_interlinear, os.path.basename(file_path))
        interlinear_tree.write(interlinear_out, encoding="utf-8", xml_declaration=True)

        # without interlinear
        nointerlinear_tree = ET.ElementTree(ET.fromstring(ET.tostring(root)))
        for textblock in nointerlinear_tree.findall('.//alto:TextBlock', ns):
            for line in list(textblock.findall('alto:TextLine', ns)):
                if any(tag_id in line.attrib.get("TAGREFS", "") for tag_id in tag_ids):
                    textblock.remove(line)
        nointerlinear_out = os.path.join(output_nointerlinear, os.path.basename(file_path))
        nointerlinear_tree.write(nointerlinear_out, encoding="utf-8", xml_declaration=True)

    print(f"Only interlinear lines saved in: {output_interlinear}")
    print(f"Files without interlinear lines saved in: {output_nointerlinear}")

# execute
process_alto_files(
    input_folder=input_dir,
    output_interlinear=interlinear_dir,
    output_nointerlinear=nointerlinear_dir
)


Only interlinear lines saved in: interlinear_manuscript
Files without interlinear lines saved in: nointerlinear_manuscript


In [102]:
# Extraction of numbering zones (TAGREFS="BT85")

input_dir = "nointerlinear_manuscript"
numbering_dir = "numbering_manuscript"
nonumbering_dir = "nonumbering_manuscript"
tagref = "BT85"

# Namespace
ns = {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace('', ns['alto'])

def process_alto_files(input_folder, tagref, output_numbering, output_nonumbering):
    os.makedirs(output_numbering, exist_ok=True)
    os.makedirs(output_nonumbering, exist_ok=True)
    alto_files = sorted(glob(os.path.join(input_folder, "*.xml")))

    for file_path in alto_files:
        tree = ET.parse(file_path)
        root = tree.getroot()
        page = root.find('.//alto:Page', ns)
        printspace = page.find('alto:PrintSpace', ns) if page is not None else None

        # only numbering version
        numbering_tree = ET.ElementTree(ET.fromstring(ET.tostring(root)))
        numbering_ps = numbering_tree.find('.//alto:PrintSpace', ns)
        for elem in list(numbering_ps):
            if not (elem.tag.endswith("TextBlock") and tagref in elem.attrib.get("TAGREFS", "")):
                numbering_ps.remove(elem)
        numbering_out = os.path.join(output_numbering, os.path.basename(file_path))
        numbering_tree.write(numbering_out, encoding="utf-8", xml_declaration=True)

        # without numbering
        nonumbering_tree = ET.ElementTree(ET.fromstring(ET.tostring(root)))
        nonumbering_ps = nonumbering_tree.find('.//alto:PrintSpace', ns)
        for elem in list(nonumbering_ps):
            if elem.tag.endswith("TextBlock") and tagref in elem.attrib.get("TAGREFS", ""):
                nonumbering_ps.remove(elem)
        nonumbering_out = os.path.join(output_nonumbering, os.path.basename(file_path))
        nonumbering_tree.write(nonumbering_out, encoding="utf-8", xml_declaration=True)

    print(f"Numbering zones saved in: {output_numbering}")
    print(f"ALTOs without numbering saved in: {output_nonumbering}")


# Execute
process_alto_files(
    input_folder=input_dir,
    tagref=tagref,
    output_numbering=numbering_dir,
    output_nonumbering=nonumbering_dir
)


Numbering zones saved in: numbering_manuscript
ALTOs without numbering saved in: nonumbering_manuscript


In [103]:
#  Extraction of Alto *MarginText:lower" Zones

input_dirs = {
    "manuscript": "nonumbering_manuscript",
    "print": "printversion_cleaned"
}
footnote_dirs = {
    "manuscript": "footnotes_manuscript",
    "print": "footnotes_print"
}
nofootnote_dirs = {
    "manuscript": "nofootnotes_manuscript",
    "print": "nofootnotes_print"
}
tagrefs = {
    "manuscript": "BT88",   # every manuscript or print can have a different reference !
    "print": "BT80"
}

# Namespace
ns = {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace('', ns['alto'])


def process_alto_files(input_folder, tagref, output_footnotes, output_nofootnotes):
    os.makedirs(output_footnotes, exist_ok=True)
    os.makedirs(output_nofootnotes, exist_ok=True)
    alto_files = sorted(glob(os.path.join(input_folder, "*.xml")))

    for file_path in alto_files:
        tree = ET.parse(file_path)
        root = tree.getroot()
        page = root.find('.//alto:Page', ns)
        printspace = page.find('alto:PrintSpace', ns) if page is not None else None

        # Create footnote-only version 
        footnote_tree = ET.ElementTree(ET.fromstring(ET.tostring(root)))
        footnote_ps = footnote_tree.find('.//alto:PrintSpace', ns)
        for elem in list(footnote_ps):
            if not (elem.tag.endswith("TextBlock") and tagref in elem.attrib.get("TAGREFS", "")):
                footnote_ps.remove(elem)
        footnote_out = os.path.join(output_footnotes, os.path.basename(file_path))
        footnote_tree.write(footnote_out, encoding="utf-8", xml_declaration=True)

        # Create no-footnote version 
        nofoot_tree = ET.ElementTree(ET.fromstring(ET.tostring(root)))
        nofoot_ps = nofoot_tree.find('.//alto:PrintSpace', ns)
        for elem in list(nofoot_ps):
            if elem.tag.endswith("TextBlock") and tagref in elem.attrib.get("TAGREFS", ""):
                nofoot_ps.remove(elem)
        nofoot_out = os.path.join(output_nofootnotes, os.path.basename(file_path))
        nofoot_tree.write(nofoot_out, encoding="utf-8", xml_declaration=True)

    print(f"Footnotes saved in: {output_footnotes}")
    print(f"ALTOs without footnotes saved in: {output_nofootnotes}")

# Execute 
for version in ["manuscript", "print"]:
    process_alto_files(
        input_folder=input_dirs[version],
        tagref=tagrefs[version],
        output_footnotes=footnote_dirs[version],
        output_nofootnotes=nofootnote_dirs[version]
    )


Footnotes saved in: footnotes_manuscript
ALTOs without footnotes saved in: nofootnotes_manuscript
Footnotes saved in: footnotes_print
ALTOs without footnotes saved in: nofootnotes_print


## II. Tokenization 

1. Tokenization of all 4 folders
2. Creation of a version of the tokenized manuscript (JSON) without hyphenations 

In [104]:
# tokenization in JSON files

import json

input_folders = [
    "footnotes_manuscript",
    "nofootnotes_manuscript",
    "footnotes_print",
    "nofootnotes_print"
]

import re

def tokenizer(text):
    tokens = []
    i, n = 0, len(text)

    # dots
    dot_chars = '.…·'
    DOTS = re.escape(dot_chars)

    segA = rf'(?:[<>]|\w+)?[{DOTS}]{{2,}}(?:\w+)?(?:[<>])?'
    
    abbr = r'\w\.'

    cluster_re = re.compile(rf'^(?:{segA}|{abbr})(?:\s+(?:{segA}|{abbr}))+', re.UNICODE)
  
    segA_re = re.compile(rf'^{segA}', re.UNICODE)
    abbr_re = re.compile(rf'^{abbr}', re.UNICODE)

    while i < n:
        if text[i].isspace():
            i += 1
            continue

        m = cluster_re.match(text[i:])
        if m:
            tokens.append(m.group(0))
            i += len(m.group(0))
            continue

        ch = text[i]

        if ch == '<':
            next_gt = text.find('>', i + 1)
            next_lt = text.find('<', i + 1)
            if next_gt != -1 and (next_lt == -1 or next_gt < next_lt):
                tokens.append(text[i:next_gt + 1])   # incluye <>
                i = next_gt + 1
                continue

        m = segA_re.match(text[i:])
        if m:
            tokens.append(m.group(0))
            i += len(m.group(0))
            continue

        m = abbr_re.match(text[i:])
        if m:
            tokens.append(m.group(0))
            i += len(m.group(0))
            continue

        if ch == '¬':
            tokens.append('¬')
            i += 1
            continue

        m = re.match(r'^\w+', text[i:], re.UNICODE)
        if m:
            tokens.append(m.group(0))
            i += len(m.group(0))
            continue

        tokens.append(ch)
        i += 1

    return tokens


for alto_folder in input_folders:
    output_folder = f"{alto_folder}_tokenized"
    os.makedirs(output_folder, exist_ok=True)
    alto_files = sorted(glob(os.path.join(alto_folder, "*.xml")))

    for alto_path in alto_files:
        tree = ET.parse(alto_path)
        root = tree.getroot()
        ns = {'a': 'http://www.loc.gov/standards/alto/ns-v4#'}
        ET.register_namespace('', ns['a'])

        strings = root.findall(".//a:String", ns)
        line_data = []

        for i, elem in enumerate(strings):
            content = elem.attrib.get("CONTENT", "").strip()
            tokens = tokenizer(content)
            line_data.append({
                "line": i + 1,
                "original": content,
                "tokens": tokens
            })

        json_filename = os.path.basename(alto_path).replace(".xml", ".json")
        json_path = os.path.join(output_folder, json_filename)

        with open(json_path, "w", encoding="utf-8") as f:
            json.dump(line_data, f, ensure_ascii=False, indent=2)

    print(f"Tokenization saved in {output_folder}")

Tokenization saved in footnotes_manuscript_tokenized
Tokenization saved in nofootnotes_manuscript_tokenized
Tokenization saved in footnotes_print_tokenized
Tokenization saved in nofootnotes_print_tokenized


In [105]:
# 2. Creation of a version of the tokenized manuscript (JSON) without hyphenations 


input_folders = ["footnotes_manuscript_tokenized", "nofootnotes_manuscript_tokenized"]

for input_folder in input_folders:
    output_folder = f"{input_folder}_joined"
    os.makedirs(output_folder, exist_ok=True)

    for file_path in sorted(glob(os.path.join(input_folder, "*.json"))):
        with open(file_path, encoding="utf-8") as f:
            lines = json.load(f)

        i = 0
        while i < len(lines) - 1:
            current = lines[i]
            next_line = lines[i + 1]
            tokens = current["tokens"]

            # If current line ends in ¬ and next line has tokens
            if tokens and tokens[-1] == "¬" and next_line["tokens"]:
                # Remove ¬ and merge with first token of next line
                tokens = tokens[:-1]
                tokens[-1] += next_line["tokens"][0]
                current["tokens"] = tokens
                current["original"] = " ".join(tokens)

                # Modify next line
                if len(next_line["tokens"]) == 1:
                    next_line["tokens"] = []
                    next_line["original"] = ""
                else:
                    next_tokens = next_line["tokens"][1:]
                    next_line["tokens"] = next_tokens
                    next_line["original"] = " ".join(next_tokens)

            i += 1

        # Save corrected file
        output_path = os.path.join(output_folder, os.path.basename(file_path))
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(lines, f, ensure_ascii=False, indent=2)

    print(f"Hyphenation correction saved in: {output_folder}")

Hyphenation correction saved in: footnotes_manuscript_tokenized_joined
Hyphenation correction saved in: nofootnotes_manuscript_tokenized_joined


## III. Alignment

### No footnotes collation

In [106]:
from collatex import Collation, collate

def extract_page_number(filename):
    match = re.search(r"_(\d+)\.json$", filename)
    return int(match.group(1)) if match else -1

def load_tokens_from_json_folder(folder_path):
    files = [f for f in os.listdir(folder_path) if f.endswith('.json')]
    sorted_files = sorted(files, key=extract_page_number)

    tokens = []
    for filename in sorted_files:
        file_path = os.path.join(folder_path, filename)
        with open(file_path, encoding='utf-8') as f:
            lines = json.load(f)
            for line in lines:
                tokens.extend(line["tokens"])
    return tokens

folder_print = 'nofootnotes_print_tokenized'
folder_manuscript = 'nofootnotes_manuscript_tokenized_joined'

tokens_print = load_tokens_from_json_folder(folder_print)
tokens_manuscript = load_tokens_from_json_folder(folder_manuscript)

collation = Collation()
collation.add_witness({"id": "Print", "tokens": [{"t": t} for t in tokens_print]})
collation.add_witness({"id": "Manuscript", "tokens": [{"t": t} for t in tokens_manuscript]})

alignment_table = collate(collation, layout='vertical', near_match=True, segmentation=False)
print(alignment_table)


+------------------+----------------------+
|      Print       |      Manuscript      |
+------------------+----------------------+
|     Chapitre     |       Chapitre       |
+------------------+----------------------+
|        II        |          "           |
+------------------+----------------------+
|      Jusque      |          1           |
+------------------+----------------------+
|       vers       |          7           |
+------------------+----------------------+
|        le        |       pesique        |
+------------------+----------------------+
|      milieu      |         ver          |
+------------------+----------------------+
|        du        |          la          |
+------------------+----------------------+
|        XI        |          Vi          |
+------------------+----------------------+
|        ^         |          li          |
+------------------+----------------------+
|        e         |          en          |
+------------------+------------

In [107]:
# Once the result is satisfactory, export it to a CSV file

csv_result = collate(collation, near_match=True, segmentation=False, output="csv")

with open("nofootnotes_collation.csv", "w", encoding="utf-8") as f:
    f.write(csv_result)


### Footnotes collation

In [108]:


def extract_page_number(filename):
    match = re.search(r"_(\d+)\.json$", filename)
    return int(match.group(1)) if match else -1

def load_tokens_from_json_folder(folder_path):
    files = [f for f in os.listdir(folder_path) if f.endswith('.json')]
    sorted_files = sorted(files, key=extract_page_number)

    tokens = []
    for filename in sorted_files:
        file_path = os.path.join(folder_path, filename)
        with open(file_path, encoding='utf-8') as f:
            lines = json.load(f)
            for line in lines:
                tokens.extend(line["tokens"])
    return tokens

folder_print = 'footnotes_print_tokenized'
folder_manuscript = 'footnotes_manuscript_tokenized_joined'

tokens_print = load_tokens_from_json_folder(folder_print)
tokens_manuscript = load_tokens_from_json_folder(folder_manuscript)

collation = Collation()
collation.add_witness({"id": "Print", "tokens": [{"t": t} for t in tokens_print]})
collation.add_witness({"id": "Manuscript", "tokens": [{"t": t} for t in tokens_manuscript]})

alignment_table = collate(collation, layout='vertical', near_match=True, segmentation=False)
print(alignment_table)

+----------------------+----------------------+
|        Print         |      Manuscript      |
+----------------------+----------------------+
|          1.          |          (           |
+----------------------+----------------------+
|         Voy          |         IVoy         |
+----------------------+----------------------+
|          .           |          .           |
+----------------------+----------------------+
|          H.          |          W.          |
+----------------------+----------------------+
|       PIRENNE        |       Sisenme        |
+----------------------+----------------------+
|          .           |          ,           |
+----------------------+----------------------+
|       Histoire       |       Hutoire        |
+----------------------+----------------------+
|          de          |          de          |
+----------------------+----------------------+
|       Belgique       |       Belyrque       |
+----------------------+----------------

In [109]:
# Once the result is satisfactory, export it to a CSV file

csv_result = collate(collation, near_match=True, segmentation=False, output="csv")

with open("footnotes_collation.csv", "w", encoding="utf-8") as f:
    f.write(csv_result)


### CSV Correction

In [142]:
# no footnotes correction
# (1ère version)


input_files = ["nofootnotes_collation.csv"]

def is_punctuation(token):
    return token in [".", ",", ";", ":", "!", "?", "«", "»", "(", ")", "’", "'", "\"", "“", "”", "¬", "^"]

def is_bracket_token(token):
    return bool(re.fullmatch(r"<[^<>]*>", token.strip())) or token.strip() in ["<", ">"]

def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    if len(s2) == 0:
        return len(s1)
    previous_row = list(range(len(s2) + 1))
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1    
            deletions = current_row[j] + 1          
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]


def has_small_internal_difference(one, two): 
    dist = levenshtein_distance(one, two) 
        if dist <= 4: 
            if one.startswith(two) or one.endswith(two) or two.startswith(one) or two.endswith(one): 
                if abs(len(one) - len(two)) >= 3: # si la différence du nombre de caractères est 3 ou plus. 
                    return False 
            return True 
        return False

def process_csv(input_csv):
    output_csv = input_csv.replace(".csv", "_filtered.csv")

    with open(input_csv, newline='', encoding="utf-8") as f:
        reader = csv.reader(f)
        rows = list(reader)

    assert len(rows) >= 2

    print_row = [t.strip() for t in rows[0]]
    manuscript_row = [t.strip() for t in rows[1]]
    filtered_print_row = []

    for p_token, m_token in zip(print_row, manuscript_row):
        if not m_token:
            filtered_print_row.append("")
        elif p_token == m_token:
            filtered_print_row.append("")
        elif is_bracket_token(m_token) or is_punctuation(m_token):
            filtered_print_row.append("")
        elif abs(len(p_token) - len(m_token)) >= 3:
            filtered_print_row.append("")
        elif has_small_internal_difference(m_token, p_token):
            filtered_print_row.append(p_token)
        else:
            filtered_print_row.append("")

    with open(output_csv, "w", newline='', encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(filtered_print_row)
        writer.writerow(manuscript_row)

    print(f"CSV filtered saved in: {output_csv}")

for input_csv in input_files:
    process_csv(input_csv)


CSV filtered saved in: nofootnotes_collation_filtered.csv


In [145]:
# no footnotes correction
# (2ème version: sans faire attention aux prefixes, distance 5)


input_files = ["nofootnotes_collation.csv"]

def is_punctuation(token):
    return token in [".", ",", ";", ":", "!", "?", "«", "»", "(", ")", "’", "'", "\"", "“", "”", "¬", "^"]

def is_bracket_token(token):
    return bool(re.fullmatch(r"<[^<>]*>", token.strip())) or token.strip() in ["<", ">"]

def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    if len(s2) == 0:
        return len(s1)
    previous_row = list(range(len(s2) + 1))
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1    
            deletions = current_row[j] + 1          
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]

def has_small_internal_difference(one, two):
    dist = levenshtein_distance(one, two)
    if abs(len(one) - len(two)) >= 3:
        return False
    return dist <= 5

def process_csv(input_csv):
    output_csv = input_csv.replace(".csv", "_filtered.csv")

    with open(input_csv, newline='', encoding="utf-8") as f:
        reader = csv.reader(f)
        rows = list(reader)

    assert len(rows) >= 2

    print_row = [t.strip() for t in rows[0]]
    manuscript_row = [t.strip() for t in rows[1]]
    filtered_print_row = []

    for p_token, m_token in zip(print_row, manuscript_row):
        if not m_token:
            filtered_print_row.append("")
        elif p_token == m_token:
            filtered_print_row.append("")
        elif is_bracket_token(m_token) or is_punctuation(m_token):
            filtered_print_row.append("")
        elif abs(len(p_token) - len(m_token)) >= 3:
            filtered_print_row.append("")
        elif has_small_internal_difference(m_token, p_token):
            filtered_print_row.append(p_token)
        else:
            filtered_print_row.append("")

    with open(output_csv, "w", newline='', encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(filtered_print_row)
        writer.writerow(manuscript_row)

    print(f"CSV filtered saved in: {output_csv}")

for input_csv in input_files:
    process_csv(input_csv)


CSV filtered saved in: nofootnotes_collation_filtered.csv


In [157]:
# no footnotes correction
# (3ème version: sans faire attention aux prefixes, distance 6)


input_files = ["nofootnotes_collation.csv"]

def is_punctuation(token):
    return token in [".", ",", ";", ":", "!", "?", "«", "»", "(", ")", "’", "'", "\"", "“", "”", "¬", "^"]

def is_bracket_token(token):
    return bool(re.fullmatch(r"<[^<>]*>", token.strip())) or token.strip() in ["<", ">"]

def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    if len(s2) == 0:
        return len(s1)
    previous_row = list(range(len(s2) + 1))
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1    
            deletions = current_row[j] + 1          
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]

def has_small_internal_difference(one, two):
    dist = levenshtein_distance(one, two)
    if abs(len(one) - len(two)) >= 3:
        return False
    return dist <= 6

def process_csv(input_csv):
    output_csv = input_csv.replace(".csv", "_filtered.csv")

    with open(input_csv, newline='', encoding="utf-8") as f:
        reader = csv.reader(f)
        rows = list(reader)

    assert len(rows) >= 2

    print_row = [t.strip() for t in rows[0]]
    manuscript_row = [t.strip() for t in rows[1]]
    filtered_print_row = []

    for p_token, m_token in zip(print_row, manuscript_row):
        if not m_token:
            filtered_print_row.append("")
        elif p_token == m_token:
            filtered_print_row.append("")
        elif is_bracket_token(m_token) or is_punctuation(m_token):
            filtered_print_row.append("")
        elif abs(len(p_token) - len(m_token)) >= 3:
            filtered_print_row.append("")
        elif has_small_internal_difference(m_token, p_token):
            filtered_print_row.append(p_token)
        else:
            filtered_print_row.append("")

    with open(output_csv, "w", newline='', encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(filtered_print_row)
        writer.writerow(manuscript_row)

    print(f"CSV filtered saved in: {output_csv}")

for input_csv in input_files:
    process_csv(input_csv)


CSV filtered saved in: nofootnotes_collation_filtered.csv


In [165]:
# no footnotes correction
# (4ème version: sans faire attention aux prefixes, distance 7, différence caractères: 4)


input_files = ["nofootnotes_collation.csv"]

def is_punctuation(token):
    return token in [".", ",", ";", ":", "!", "?", "«", "»", "(", ")", "’", "'", "\"", "“", "”", "¬", "^"]

def is_bracket_token(token):
    return bool(re.fullmatch(r"<[^<>]*>", token.strip())) or token.strip() in ["<", ">"]

def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    if len(s2) == 0:
        return len(s1)
    previous_row = list(range(len(s2) + 1))
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1    
            deletions = current_row[j] + 1          
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]

def has_small_internal_difference(one, two):
    dist = levenshtein_distance(one, two)
    if abs(len(one) - len(two)) >= 4:
        return False
    return dist <= 7

def process_csv(input_csv):
    output_csv = input_csv.replace(".csv", "_filtered.csv")

    with open(input_csv, newline='', encoding="utf-8") as f:
        reader = csv.reader(f)
        rows = list(reader)

    assert len(rows) >= 2

    print_row = [t.strip() for t in rows[0]]
    manuscript_row = [t.strip() for t in rows[1]]
    filtered_print_row = []

    for p_token, m_token in zip(print_row, manuscript_row):
        if not m_token:
            filtered_print_row.append("")
        elif p_token == m_token:
            filtered_print_row.append("")
        elif is_bracket_token(m_token) or is_punctuation(m_token):
            filtered_print_row.append("")
        elif abs(len(p_token) - len(m_token)) >= 3:
            filtered_print_row.append("")
        elif has_small_internal_difference(m_token, p_token):
            filtered_print_row.append(p_token)
        else:
            filtered_print_row.append("")

    with open(output_csv, "w", newline='', encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(filtered_print_row)
        writer.writerow(manuscript_row)

    print(f"CSV filtered saved in: {output_csv}")

for input_csv in input_files:
    process_csv(input_csv)


CSV filtered saved in: nofootnotes_collation_filtered.csv


In [111]:
# footnotes correction
# ((( ne pas corriger les mots coupés à cause de la digitalisation )))

input_files = ["footnotes_collation.csv"]

def is_punctuation(token):
    return token in [".", ",", ";", ":", "!", "?", "«", "»", "(", ")", "’", "'", "\"", "“", "”", "¬", "^"]

def is_bracket_token(token):
    return bool(re.fullmatch(r"<[^<>]*>", token.strip())) or token.strip() in ["<", ">"]

def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    if len(s2) == 0:
        return len(s1)
    previous_row = list(range(len(s2) + 1))
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1    
            deletions = current_row[j] + 1          
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    return previous_row[-1]

def has_small_internal_difference(one, two):
    dist = levenshtein_distance(one, two)
    if dist <= 2:
        if one.startswith(two) or one.endswith(two) or two.startswith(one) or two.endswith(one):
            if abs(len(one) - len(two)) >= 2:  # si la différence du nombre de caractères est 2 ou plus. 
                return False
        return True
    return False

def process_csv(input_csv):
    output_csv = input_csv.replace(".csv", "_filtered.csv")

    with open(input_csv, newline='', encoding="utf-8") as f:
        reader = csv.reader(f)
        rows = list(reader)

    assert len(rows) >= 2

    print_row = [t.strip() for t in rows[0]]
    manuscript_row = [t.strip() for t in rows[1]]
    filtered_print_row = []

    for p_token, m_token in zip(print_row, manuscript_row):
        if not m_token:
            filtered_print_row.append("")
        elif p_token == m_token:
            filtered_print_row.append("")
        elif is_bracket_token(m_token) or is_punctuation(m_token):
            filtered_print_row.append("")
        elif abs(len(p_token) - len(m_token)) >= 3:
            filtered_print_row.append("")
        elif has_small_internal_difference(m_token, p_token):
            filtered_print_row.append(p_token)
        else:
            filtered_print_row.append("")

    with open(output_csv, "w", newline='', encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(filtered_print_row)
        writer.writerow(manuscript_row)

    print(f"CSV filtered saved in: {output_csv}")

for input_csv in input_files:
    process_csv(input_csv)


CSV filtered saved in: footnotes_collation_filtered.csv


## III. Correction of the manuscript

1. Correction of the joined tokenized version from the CSV
2. Correction of the JSON: re hyphenation 
3. Correction of ALTOs from JSON files
4. Put together in a final version
5. Final corrections

In [166]:
# 1. Correction of the joined tokenized version since the CSV 

tasks = [
    {
        "json_folder": "nofootnotes_manuscript_tokenized_joined",
        "csv_path": "nofootnotes_collation_filtered.csv",
        "output_folder": "manuscript_nofootnotes_joined_corrected"
    },
    {
        "json_folder": "footnotes_manuscript_tokenized_joined",
        "csv_path": "footnotes_collation_filtered.csv",
        "output_folder": "manuscript_footnotes_joined_corrected"
    }
]


for task in tasks:
    os.makedirs(task["output_folder"], exist_ok=True)

    # Load CSV tokens
    with open(task["csv_path"], encoding="utf-8") as f:
        reader = csv.reader(f)
        csv_rows = list(reader)
        print_tokens = [t.strip() for t in csv_rows[0]]
        manuscript_tokens = [t.strip() for t in csv_rows[1]]

    # Generate a filtered list of (index, m_token, p_token) only for non-empty manuscript tokens
    csv_pairs = [(i, m, print_tokens[i]) for i, m in enumerate(manuscript_tokens) if m.strip()]

    json_files = sorted(glob(os.path.join(task["json_folder"], "*.json")))
    csv_pos = 0  

    for json_file in json_files:
        with open(json_file, encoding="utf-8") as f:
            lines = json.load(f)

        corrected_lines = []

        for line in lines:
            tokens = line["tokens"]
            new_tokens = []

            for token in tokens:
                found = False
                while csv_pos < len(csv_pairs):
                    _, m_tok, p_tok = csv_pairs[csv_pos]
                    csv_pos += 1

                    if m_tok == token:
                        new_tokens.append(p_tok if p_tok else token)
                        found = True
                        break

            corrected_lines.append({
                "line": line["line"],
                "original": " ".join(new_tokens),
                "tokens": new_tokens
            })

        output_path = os.path.join(task["output_folder"], os.path.basename(json_file))
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(corrected_lines, f, ensure_ascii=False, indent=2)

    print(f"Correction finished in: {task['output_folder']}")

Correction finished in: manuscript_nofootnotes_joined_corrected
Correction finished in: manuscript_footnotes_joined_corrected


In [167]:
# 2. Correction of the JSON: re hyphenation


tasks = [
    {
        "original_folder": "nofootnotes_manuscript_tokenized",
        "corrected_folder": "manuscript_nofootnotes_joined_corrected",
        "output_folder": "manuscript_nofootnotes_json_final"
    },
    {
        "original_folder": "footnotes_manuscript_tokenized",
        "corrected_folder": "manuscript_footnotes_joined_corrected",
        "output_folder": "manuscript_footnotes_json_final"
    }
]

for task in tasks:
    os.makedirs(task["output_folder"], exist_ok=True)

    original_files = sorted(glob(os.path.join(task["original_folder"], "*.json")))

    for filepath in original_files:
        filename = os.path.basename(filepath)
        corrected_path = os.path.join(task["corrected_folder"], filename)
        output_path = os.path.join(task["output_folder"], filename)

        with open(filepath, encoding="utf-8") as f:
            original_lines = json.load(f)

        with open(corrected_path, encoding="utf-8") as f:
            corrected_lines = json.load(f)

        output_lines = []
        i = 0
        while i < len(original_lines):
            orig_tokens = original_lines[i]["tokens"]
            corr_tokens = corrected_lines[i]["tokens"] if i < len(corrected_lines) else []

            if "¬" in orig_tokens:
                # loop
                start = i
                buffer_lines = []

                while start < len(original_lines) and "¬" in original_lines[start]["tokens"]:
                    orig_toks = original_lines[start]["tokens"]
                    corr_toks = corrected_lines[start]["tokens"] if start < len(corrected_lines) else []

                    cesura_idx = orig_toks.index("¬")
                    part1 = orig_toks[cesura_idx - 1] if cesura_idx > 0 else ""
                    len1 = len(part1)

                    # Part 2 (first word of next line)
                    part2 = ""
                    if start + 1 < len(original_lines):
                        next_orig_toks = original_lines[start + 1]["tokens"]
                        part2 = next_orig_toks[0] if next_orig_toks else ""

                    corrected_word = corr_toks[-1] if corr_toks else (part1 + part2)

                    if corrected_word and len(corrected_word) >= len1:
                        first_part = corrected_word[:len1]
                        second_part = corrected_word[len1:]
                    else:
                        first_part = part1
                        second_part = part2

                    # do tokens for current line 
                    if corr_toks:
                        new_line_tokens = corr_toks[:-1] + [first_part, "¬"]
                    else:
                        prefix = orig_toks[:max(0, cesura_idx - 1)]
                        new_line_tokens = prefix + [first_part, "¬"]

                    buffer_lines.append({
                        "line": corrected_lines[start]["line"] if start < len(corrected_lines) else original_lines[start]["line"],
                        "original": " ".join(new_line_tokens),
                        "tokens": new_line_tokens
                    })

                    # insert second_part in the beggining of the next line 
                    if start + 1 < len(corrected_lines):
                        next_corr_toks = corrected_lines[start + 1]["tokens"].copy()
                    elif start + 1 < len(original_lines):
                        next_corr_toks = original_lines[start + 1]["tokens"].copy()
                    else:
                        next_corr_toks = []

                    # Insert second_part if it's not already at the beggining 
                    if second_part and (len(next_corr_toks) == 0 or next_corr_toks[0] != second_part):
                        next_corr_toks = [second_part] + next_corr_toks

                    # update next line corrected for next loop 
                    if start + 1 < len(corrected_lines):
                        corrected_lines[start + 1]["tokens"] = next_corr_toks
                        corrected_lines[start + 1]["original"] = " ".join(next_corr_toks)
                    elif start + 1 < len(original_lines):
                        original_lines[start + 1]["tokens"] = next_corr_toks
                        original_lines[start + 1]["original"] = " ".join(next_corr_toks)

                    start += 1

                # Add processed lines and update i
                output_lines.extend(buffer_lines)
                i = start
            else:
                # add line without hyphenation
                output_lines.append(corrected_lines[i])
                i += 1

        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(output_lines, f, ensure_ascii=False, indent=2)

    print(f"Re-hyphenation finished in: {task['output_folder']}")

Re-hyphenation finished in: manuscript_nofootnotes_json_final
Re-hyphenation finished in: manuscript_footnotes_json_final


In [168]:
# 3. Corrections of the ALTOs since the JSON

tasks = [
    {
        "alto_folder": "footnotes_manuscript",
        "json_folder": "manuscript_footnotes_json_final",
        "output_folder": "footnotes_manuscript_final"
    },
    {
        "alto_folder": "nofootnotes_manuscript",
        "json_folder": "manuscript_nofootnotes_json_final",
        "output_folder": "nofootnotes_manuscript_final"
    }
]

for task in tasks:
    os.makedirs(task["output_folder"], exist_ok=True)

    alto_files = sorted(glob(os.path.join(task["alto_folder"], "*.xml")))

    for alto_file in alto_files:
        filename = os.path.basename(alto_file)
        json_path = os.path.join(task["json_folder"], filename.replace(".xml", ".json"))
        output_path = os.path.join(task["output_folder"], filename)

        if not os.path.exists(json_path):
            print(f"JSON file not found for {filename}")
            continue

        with open(json_path, encoding="utf-8") as f:
            json_lines = json.load(f)

        tree = ET.parse(alto_file)
        root = tree.getroot()
        ns = {'a': 'http://www.loc.gov/standards/alto/ns-v4#'}
        ET.register_namespace('', ns['a'])

        string_elems = root.findall(".//a:String", ns)

        # Replace content
        for elem, line in zip(string_elems, json_lines):
            corrected_text = " ".join(line["tokens"]).strip()
            elem.set("CONTENT", corrected_text)

        tree.write(output_path, encoding="utf-8", xml_declaration=True)

    print(f"Alto's files corrected in: {task['output_folder']}")

Alto's files corrected in: footnotes_manuscript_final
Alto's files corrected in: nofootnotes_manuscript_final


In [169]:
# 4. Put together (footnotes and nofootnotes) in a final version

main_folder = "nofootnotes_manuscript_final"
footnotes_folder = "footnotes_manuscript_final"
output_folder = "footnotes_and_nofootnotes"
os.makedirs(output_folder, exist_ok=True)

ns = {'a': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace('', ns['a'])

for filename in os.listdir(main_folder):
    if not filename.endswith(".xml"):
        continue

    main_path = os.path.join(main_folder, filename)
    footnote_path = os.path.join(footnotes_folder, filename)
    output_path = os.path.join(output_folder, filename)

    # Load main file
    tree_main = ET.parse(main_path)
    root_main = tree_main.getroot()

    # find PrintSpace
    printspace = root_main.find(".//a:PrintSpace", ns)
    if printspace is None:
        continue

    # If footnote file exists, append its blocks
    if os.path.exists(footnote_path):
        tree_foot = ET.parse(footnote_path)
        root_foot = tree_foot.getroot()
        footnote_blocks = root_foot.findall(".//a:PrintSpace/a:TextBlock", ns)

        for block in footnote_blocks:
            printspace.append(block)

    # Save
    tree_main.write(output_path, encoding="utf-8", xml_declaration=True)

print(f"Files saved in: {output_folder}")


Files saved in: footnotes_and_nofootnotes


In [170]:
# Add NumberingPage

manuscript_final = "footnotes_and_nofootnotes"
numbering_dir = "numbering_manuscript"
output_dir = "manuscript_final_1"

# Namespace
ns = {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace('', ns['alto'])

os.makedirs(output_dir, exist_ok=True)

def reintegrate_numbering(final_folder, numbering_folder, output_folder):
    final_files = sorted(glob(os.path.join(final_folder, "*.xml")))
    
    for file_path in final_files:
        filename = os.path.basename(file_path)
        numbering_path = os.path.join(numbering_folder, filename)

        if not os.path.exists(numbering_path):
            continue

        # Parse both trees
        final_tree = ET.parse(file_path)
        final_root = final_tree.getroot()
        final_ps = final_root.find('.//alto:PrintSpace', ns)

        numbering_tree = ET.parse(numbering_path)
        numbering_root = numbering_tree.getroot()
        numbering_ps = numbering_root.find('.//alto:PrintSpace', ns)

        # Insert numbering blocks back at the beginning
        for elem in list(numbering_ps):
            final_ps.insert(0, elem)   

        # Save output
        out_path = os.path.join(output_folder, filename)
        final_tree.write(out_path, encoding="utf-8", xml_declaration=True)

# Ejecutar
reintegrate_numbering(manuscript_final, numbering_dir, output_dir)

print(f"Files saved in: {output_dir}")


Files saved in: manuscript_final_1


In [176]:
# Add interlinear

import statistics

nointerlinear_dir = "manuscript_final_1"
interlinear_dir = "interlinear_manuscript"
output_dir = "manuscript_final"

ns = {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace('', ns['alto'])


def get_y_ref(line):
    baseline = line.attrib.get("BASELINE")
    if baseline:
        try:
            coords = list(map(float, baseline.strip().split()))
            y_vals = coords[1::2]  # solo las Y
            if y_vals:
                return statistics.mean(y_vals)
        except Exception:
            pass
    vpos = float(line.attrib.get("VPOS", "0"))
    height = float(line.attrib.get("HEIGHT", "0"))
    return vpos + height / 2


def merge_interlinear(nointer_file, interlinear_file, output_file):
    nointer_tree = ET.parse(nointer_file)
    inter_tree = ET.parse(interlinear_file)

    nointer_root = nointer_tree.getroot()
    inter_root = inter_tree.getroot()

    no_blocks = nointer_root.findall('.//alto:TextBlock', ns)
    inter_blocks = inter_root.findall('.//alto:TextBlock', ns)

    for nb, ib in zip(no_blocks, inter_blocks):
        inter_lines = ib.findall('alto:TextLine', ns)

        for il in inter_lines:
            line_id = il.attrib.get("ID")
            if line_id and not nb.find(f'alto:TextLine[@ID="{line_id}"]', ns):
                nb.append(il) 

        all_lines = nb.findall('alto:TextLine', ns)

        def sort_key(line):
            y_ref = get_y_ref(line)
            hpos = float(line.attrib.get("HPOS", "0"))
            return (y_ref, hpos)

        all_lines_sorted = sorted(all_lines, key=sort_key)

        for child in list(nb):
            if child.tag.endswith("TextLine"):
                nb.remove(child)
        for l in all_lines_sorted:
            nb.append(l)

    nointer_tree.write(output_file, encoding="utf-8", xml_declaration=True)


def reconstruct_all(nointerlinear_folder, interlinear_folder, output_folder):
    os.makedirs(output_folder, exist_ok=True)

    no_files = sorted(glob(os.path.join(nointerlinear_folder, "*.xml")))
    inter_files = sorted(glob(os.path.join(interlinear_folder, "*.xml")))

    for no_file, inter_file in zip(no_files, inter_files):
        out_file = os.path.join(output_folder, os.path.basename(no_file))
        merge_interlinear(no_file, inter_file, out_file)

    print(f"Interlinear lines added to: {output_folder}")


# Ejecutar
reconstruct_all(nointerlinear_dir, interlinear_dir, output_dir)


Interlinear lines added to: manuscript_final


In [172]:
# 5. Finals corrections

folder = "manuscript_final"

# Patterns
apostrophe_pattern = re.compile(r"\s*(['’])\s*")    # qu ' ils → qu'ils
punctuation_pattern = re.compile(r"\s*([.,;!?])")   # text , → text,
hyphen_pattern = re.compile(r"\b(\w+)\s*-\s*(\w+)\b")# mot - cle → mot-cle
pre_hyphen_marker_pat = re.compile(r"\s+¬")       # rati ¬ → rati¬
caret_pattern = re.compile(r"\s*\^\s*")         # X ^ e → X^e
paren_open_pattern = re.compile(r"\(\s+")   # ( 1 → (1
paren_close_pattern = re.compile(r"\s+\)")        # 1 ) → 1)
quote_token_pattern = re.compile(r'"\s*([^"]*?)\s*"')    # " abc " → "abc"
colon_pattern = re.compile(r"\s+:")          # mot : → mot:
spaces_before_hyphen = re.compile(r"\s+-")        # erase spaces before -
double_spaces_pattern = re.compile(r" {2,}")      # erase double spaces
accent_composition_pattern = re.compile(r"([aAeEiIoOuUyYcC])\s+([`´])")  # a ̀ → à, e ́ → é

# Process all Alto files
for filename in os.listdir(folder):
    if not filename.endswith(".xml"):
        continue

    path  = os.path.join(folder, filename)
    tree  = ET.parse(path)
    root  = tree.getroot()

    for string in root.iter():
        if 'CONTENT' not in string.attrib:
            continue
        txt = string.attrib['CONTENT']
        txt = apostrophe_pattern.sub(r"\1", txt)
        txt = punctuation_pattern.sub(r"\1", txt)
        txt = hyphen_pattern.sub(r"\1-\2", txt)
        txt = pre_hyphen_marker_pat.sub("¬", txt)
        txt = caret_pattern.sub("^", txt)
        txt = paren_open_pattern.sub("(", txt)
        txt = paren_close_pattern.sub(")", txt)
        txt = quote_token_pattern.sub(r'"\1"', txt)
        txt = colon_pattern.sub(":", txt)
        txt = accent_composition_pattern.sub(r"\1\2", txt)
        txt = spaces_before_hyphen.sub("-", txt)
        txt = double_spaces_pattern.sub(" ", txt)
        string.attrib['CONTENT'] = txt

    tree.write(path, encoding="UTF-8", xml_declaration=True)

print("Final corrections finished")


Final corrections finished


## Pour évaluer sans MarginText

In [175]:
#  Extraction of Alto *MarginText:lower" Zones

input_dirs = "manuscript_final"
nofootnote_dirs = "manuscript_final_nofootnotes"
tagrefs = "BT88"   # every manuscript or print can have a different reference !

# Namespace
ns = {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'}
ET.register_namespace('', ns['alto'])


def process_alto_files(input_folder, tagref, output_nofootnotes):
    os.makedirs(output_nofootnotes, exist_ok=True)
    alto_files = sorted(glob(os.path.join(input_folder, "*.xml")))

    for file_path in alto_files:
        tree = ET.parse(file_path)
        root = tree.getroot()
        page = root.find('.//alto:Page', ns)
        printspace = page.find('alto:PrintSpace', ns) if page is not None else None

        # Create no-footnote version 
        nofoot_tree = ET.ElementTree(ET.fromstring(ET.tostring(root)))
        nofoot_ps = nofoot_tree.find('.//alto:PrintSpace', ns)
        for elem in list(nofoot_ps):
            if elem.tag.endswith("TextBlock") and tagref in elem.attrib.get("TAGREFS", ""):
                nofoot_ps.remove(elem)
        nofoot_out = os.path.join(output_nofootnotes, os.path.basename(file_path))
        nofoot_tree.write(nofoot_out, encoding="utf-8", xml_declaration=True)

    print(f"ALTOs without footnotes saved in: {output_nofootnotes}")

# Execute 
for version in ["manuscript", "print"]:
    process_alto_files(
        input_folder=input_dirs,
        tagref=tagrefs,
        output_nofootnotes=nofootnote_dirs
    )


ALTOs without footnotes saved in: manuscript_final_nofootnotes
ALTOs without footnotes saved in: manuscript_final_nofootnotes
