Dulaurier dataset

Ground-truth of the Dulaurier project (HTR of Armenian manuscripts).

The dataset has been collated within the scope of the Dulaurier project (Valorisation numérique du fonds Dulaurier) led by Calfa and sponsored by the BnF Datalab (call for projects 2022-2023), in partnership with GREgORI (UCLouvain, Belgium; CIOL Institut orientaliste).

The project involves the automatic transcription of 14 manuscripts from the Dulaurier collection at the BnF (around 3.600 images), as well as the automatic tagging of these texts (identification of proper nouns, lemmatisation, POS). Manuscripts are available on Gallica. Transcribed texts (available and searchable online) are part of the Armenian corpus of the GREgORI Project. Texts are focused on medieval Armenian historiography (e.g. Łevond, Movsēs Kalankatuac'i, Kirakos Ganjakec'i , etc.).

Dataset composition

Dataset contains 42 images, for a total of:

148 TextRegions
1.467 TextLines
8.864 words
54.294 characters

Images consist in scanned microfilms. Four hands are representend (mainly Dulaurier hand and hands of his students).

ms. Arm. 226 (p. 55), ms. Arm. 231 (p. 126), and ms. Arm. 217 (p. 81) — Gallica, BnF

Images

BnF Images are available through its IIIF server. For the list of IDs (images and documents), see the list-images.tsv file. To request an image, please use the following URL template:

https://gallica.bnf.fr/iiif/ark:/12148/{document_ID}/f{image_ID}/{region}/{size}/{rotation}/{quality}.{format}

Ground-truth specifications

Informations levels

We provide for each image a pageXML file containing three level of information:

TextRegion localisation, with a semantic tag (e.g. MainText), following the SegmOnto ontology;
Baseline localisation and surrounding polygon of the line;
Text.

    <TextRegion id="80735" custom="structure {type:MainText;}">
      <Coords points="2656,438 2664,3362 410,3472 403,3095 403,2903 363,2705 312,402 2656,438"/>
      <TextLine id="879019">
        <Coords points="322,485 320,520 526,524 579,504 616,518 722,506 757,524 802,516 845,532 937,512 1039,512 1078,530 1114,516 1141,528 1174,518 1245,540 1353,520 1443,520 1555,547 1753,551 1817,540 1858,553 1960,542 2154,555 2207,536 2401,553 2433,540 2658,542 2662,522 2658,473 2578,473 2456,444 2327,467 2239,440 2194,449 2150,426 2062,453 1992,424 1804,457 1692,457 1660,440 1555,451 1441,412 1247,408 1151,424 1098,420 1049,438 961,416 882,438 845,428 779,436 645,400 575,432 475,432 322,389 322,485"/>
        <Baseline points="323,486 814,477 1764,520 2664,524"/>
        <TextEquiv>
          <Unicode>չգալ կենարարին մերոյ ք[րիստոս]ի և յետ դարձի Յիսուբա</Unicode>
        </TextEquiv>
      </TextLine>

Annotations have been made on the Calfa Vision platform, a free web-based annotation tool for documents and images designed for Oriental scripts. The transcription is faithful to the text present in the image, including when the word is misspelled. We do not follow the scriptio continua of Armenian manuscripts and we re-establish a separation between words. We expand abbreviations and ideograms used in the manuscripts, by using brackets, to allow keyword search and abbreviations understanding. For instance:

քի will be transcribed ք[րիստոս]ի
~ն will be transcribed [աշխարհ]ն
քղք will be transcribed ք[ա]ղ[ա]ք etc.

Some results

Paper WIP. Preliminary results have been presentend at the Bibliothèque nationale de France in January 2024. With this dataset, we reach a mean accuracy of good recognition of 98,56% (and 92,9% of good reading of abbreviations).

Accuracy per manuscript with models trained with this dataset

For now, processed texts have been analyzed (lemmatisation and POS-tagging) and are searchable on GREgORI interfaces (Ancient Armenian Corpus). Results are linked to Gallica and Calfa dictionaries. Proofreading will be carried out in 2024.

Related dataset

A dataset of stamps and seals, produced within the scope of this project and using layout analysis models trained with this dataset, has been released on Zenodo.

Acknowledgments

This work was carried out with the support of the BnF Datalab, in partnership with the GREgORI project and ANR DALiH.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
page		page
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
list-images.tsv		list-images.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dulaurier dataset

Dataset composition

Images

Ground-truth specifications

Informations levels

Some results

Further readings

Related dataset

Acknowledgments

About

License

calfa-co/datalab-dulaurier

Folders and files

Latest commit

History

Repository files navigation

Dulaurier dataset

Dataset composition

Images

Ground-truth specifications

Informations levels

Some results

Further readings

Related dataset

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks