# Notebook : Benchmark de lemmatisation (latin) d'un corpus HTR (XML PAGE)

- Author: Lucas Terriel <lucas.terriel@chartes.psl.eu>
- Date: 10/01/2023

## Motivation

(Fr) Ce notebook permet de réaliser un *benchmark* de plusieurs lemmatiseurs sur un corpus acquis par HTR (XML PAGE). Chaque *taggers* produit un corpus pré-lémmatisé qui peut être analysé et éventuellement repris dans un outil de correction manuel de la lemmatisation (comme [Pyrrha](https://github.com/hipster-philology/pyrrha)). Ces données reprises pourront servir alors pour entraîner un nouveau modèle de lemmatisation. Ce benchmark a été mis en place dans le cadre du projet e-NDP. 


(En) This notebook realize a benchmark of several lemmatizers on a corpus acquired by HTR (XML PAGE). Each taggers produces a pre-lemmatized corpus that can be analyzed and possibly included in a manual lemmatization correction tool (like [Pyrrha](https://github.com/hipster-philology/pyrrha)). These collected data can then be used to train a new lemmatization model. This benchmark was set up as part of the e-NDP project.

## Overall process schema

![schema_du_notebook](./docs/benchmark_lemmatization.jpg)


## File tree description
```
 lemmatization-post-benchmark/
|
├── treetagger/ (Do not remove, Specific dependencies for the treetagger, available when launching the notebook)
│    
├── out_lemmatized_corpus/ (Contains the lematized corpora classified according to the taggers'name, this directory is created during the notebook execution)
│    ├── cltk/
│    ├── nlp_pie_lasla/
│    ├── omnia/
│    ├── ...
│   
├── lib/ (Do not remove, Contains notebook-specific dependencies)
|
├── data_in/ (Contains  input data for the notebook)
│    ├── ground-truth-htr/ (Cf. Notebook part 1. for details)
│    ├── ground-truth-layout_list/ (Cf. Notebook part 1. for details)
│   
├── data_filtered/ (Contains filtered input data according to the layout, this directory is created during the notebook execution) 
|
├── lemmatize_corpus_main.ipynb (This actual notebook)
|
├── gt_filtered_struct_to_lemmatize.json (This file allows you to view the recovery of text blocks according to the layout that will be passed in the taggers, this file is created during the notebook execution)
|
|
└── environment.yml
```


## Lemmatizers available

- [CLTK](http://cltk.org/) (classical latin)
- OMNIA via [treetaggerwrapper](https://treetaggerwrapper.readthedocs.io/en/latest/)
- [Lasla model](https://github.com/chartes/deucalion-model-lasla) model via [nlp-pie-taggers](https://github.com/hipster-philology/nlp-pie-taggers)
- [ud_pipe](https://lindat.mff.cuni.cz/services/udpipe/) (models ittb, perseus, proeil) via spaCy extension ([spacy_udpipe](https://github.com/TakeLab/spacy-udpipe))

In [1]:
import warnings
warnings.filterwarnings('ignore')

### 1. Download e-NDP HTR ressources

Create a `data_in/` directory that contains:

* HTR Ground truth from E-NDP_HTR (`./data_in/ground-truth-htr/`) with XML ALTO and images;
* List of GT with layout (`./data_in/ground-truth-layout_list.txt`) 

> These data are available on https://github.com/chartes/e-NDP_HTR 

In [2]:
# Local path to HTR data
gt_files_dir_path = "./data_in/ground-truth-htr/"
gt_files_segmented_path_list = "./data_in/ground-truth-layout_list.txt"

### 2. Filter only layout GT, prepare and inspect data before lemmatization

In [3]:
from lib.data_preprocessor import endpDataLoader
        
        
data_loader = endpDataLoader(filtered_layout_gt_list_path=gt_files_segmented_path_list, 
               in_gt_path=gt_files_dir_path,
               out_gt_filtered_path="./data_filtered/", 
               copy_images=False)

# Inspect GT text blocks to lemmatize
# data.gt_structure_to_lemmatize

The directory: ./data_filtered/ already exists, this action recreate your directory [Y/n] Y


New directory: ./data_filtered/ is re-created.
✔️ data filtered and available: ./data_filtered/ and ready to lemmatize.
* Inspect your texts blocks to lemmatize in file: gt_filtered_struct_to_lemmatize.json


### 3. Setup taggers (in the future you can skip this step)

In [4]:
from lib.utils import init_taggers

init_taggers(arch="osx")

fatal: not a git repository (or any of the parent directories): .git
It seems like you downloaded `pie` instead of git-cloning it or installing it with pip.
We won't be able to check compatibility between pretrained models and `pie` version.



* Initialize Omnia tagger for arch: osx
* Omnia taggger is ready ✔️
* Initialize Lasla model for NLP Pie tagger
[████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████]
[████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████]
[██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████]
[██████████████████████████████████████████████████]
[██████████████████████████████████████████████████]
[██████████████████████████████████████████████████]
[██████████████████████████████████████████████████]
[████████████████████████

### 3. Run all lemmatization benchmark

In [5]:
%%time

from lib.nlp import LemmatizerBenchmarkEngine

# If you run this engine for the first time initialize parameter `setup_taggers` to True 
lemmatizer_engine = LemmatizerBenchmarkEngine(
    data=data_loader.gt_structure_to_lemmatize, 
    output_tagged_corpus_path="./out_lemmatized_corpus",
)

# 1a) run all lemmatizer engine for a benchmark (time: ~13min)
lemmatizer_engine.run_benchmark()

# 1b) or run lemmatizer individually ...
# lemmatizer_engine.omnia_process()
# lemmatizer_engine.ud_pipe_process()
# lemmatizer_engine.cltk_process()
# ...



‎𐤀 CLTK version '1.1.6'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`.
[INFO] *> Directory: ./out_lemmatized_corpus is already created.
[INFO] *> Run all lemmatization benchmark...
[INFO] *> Initialize cltk tagger...
[INFO] *> Directory: ./out_lemmatized_corpus/cltk is already created.


CLTK tagger in progress with doc: ./data_filtered/FRAN_0393_12690_L.xml: 100%|██████████| 363/363 [04:42<00:00,  1.29it/s]


[INFO] *> Initialize omnia tagger...
[INFO] *> Directory: ./out_lemmatized_corpus/omnia is already created.


OMNIA tagger in progress with doc: ./data_filtered/FRAN_0393_12690_L.xml: 100%|██████████| 363/363 [00:08<00:00, 45.35it/s]


[INFO] *> Initialize nlp_pie_lasla tagger...
[INFO] *> Directory: ./out_lemmatized_corpus/nlp_pie_lasla is already created.


NLP_PIE_LASLA tagger in progress with doc: ./data_filtered/FRAN_0393_12690_L.xml: 100%|██████████| 363/363 [06:37<00:00,  1.09s/it]


[INFO] *> Initialize ud_pipe tagger...
[INFO] *> Directory: ./out_lemmatized_corpus/ud_pipe_nlp_la_ittb is already created.


UD_PIPE_NLP_LA_ITTB tagger in progress with doc: ./data_filtered/FRAN_0393_12690_L.xml: 100%|██████████| 363/363 [00:44<00:00,  8.21it/s]


[INFO] *> Directory: ./out_lemmatized_corpus/ud_pipe_la-proiel is already created.


UD_PIPE_LA-PROIEL tagger in progress with doc: ./data_filtered/FRAN_0393_12690_L.xml: 100%|██████████| 363/363 [00:51<00:00,  6.99it/s]


[INFO] *> Directory: ./out_lemmatized_corpus/ud_pipe_la-perseus is already created.


UD_PIPE_LA-PERSEUS tagger in progress with doc: ./data_filtered/FRAN_0393_12690_L.xml: 100%|██████████| 363/363 [00:35<00:00, 10.29it/s]

[INFO] *> Benchmark is over ✔️
CPU times: user 14min 56s, sys: 4min 37s, total: 19min 33s
Wall time: 13min 45s



