Skip to content

e-ditiones/OCR17plus

 
 

Repository files navigation

OCR17+ - Layout analysis and text recognition for 17th c. French prints

This repo contains training data and models for Layout analysis and text recognition for 17th c. French prints

- This repo is an updated version of the OCR17 repo. It uses XML files and not .png/.txt pairs.

The old repo is still available here.

How to use

Training data is organised per print:

  • Balzac1624_Lettres_btv1b86262420_corrected
  • Boyer1697_Meduse_cb30152139c_corrected

To train a model, all the data needs to added to a single file, prior to the repartition between train, validation and test. To do so:

  1. git clone https://github.com/e-ditiones/OCR17plus
  2. cd datasetsOCRSegmenter17
  3. bash build_train_alto_Seg17.sh creates a trainingDataSeg17 directory
  4. python train_val_prep.py ./trainingDataSeg17/*.xml creates two new files train.txt (with training data) and val.txt (validation data).
  5. If you have kraken installed, you can use ketos segtrain -t train.txt -e val.txt -o model -d cuda -f alto -q early -bl to train a model for layout analysis

The test.txt file is already prepared for the reproducibility of the test, and evaluate the improvement over time. It was created with 3 title pages, 14 pages containing damage, 2 pages with margin, 14 with decoration, 19 with rubric or signatures (or both), 1 with a running title on bottom of page, 3 pages with decorated drop capitals, 7 with basic drop capitals and 28 basic pages. This test file can also be used for an HTR training test.

Structure

The structure of the repo is the following:

├── Data
│     ├── Print_1
│     │  ├── alto4eScriptorium
│     │  ├── pageXmlTranskribus
│     │  ├── pagexmlTranskribusCorrected
│     │  └── png
│     ├── Print_2
│     │  ├── alto4eScriptorium
│     │  ├── pageXmlTranskribus
│     │  ├── pagexmlTranskribusCorrected
│     │  └── png
│     └── …
├── Models
|     ├── HTR
|     |	 ├── bleu.mlmodel
|     |  ├── cheddar.mldmodel
|     |  ├── dentduchat.mldmodel
|     |  └── README.md
|     └── Segment
|	 ├── appenzeller.mlmodel
|        └── README.md
├── build_train_alto_Seg17.sh
├── files_informations.csv
├── parts_dataset.csv
├── train_val_prep.py
├── test.txt
├── segmontoAltoValidator.xsd
├── validator_alto.py
└── README.md

The Data directory contains excerpts of 17th century books, i.e. scans of selected pages and their encoding in

  1. PageXML
  2. ALTO-4 files.

Regarding the difference between all these directories, cf. infra, § Data description.

Prints have been selected in the OCR17 repo, and are all described individually in their respective folder.

The Models directory contains models for:

  1. HTR
  2. Layout analysis. The layout analysis is based on the SegmOnto vocabulary.

Validation of the XML data pushed on the repository is made via segmontoAltoValidator and validator_alto.py. They comme from HTR-United/cremma-medieval repository.

Data description

Some of used data come from the OCR17 repo, the composition of which started with Transkribus, which needs to be adapted for eScriptorium. Therefore, for each print exported from transkribus, we propose

  1. The exported file (pageXmlTranskribus)
  2. The exported file prepared form for eScriptorium (pagexmlTranskribusCorrected)
  3. The version exported from eScriptorium (alto4eScriptorium)

Credits

Data prepared and models trained by Claire Jahan with the help of Simon Gabay, as part of the E-ditiones project.

Contact

Claire Jahan : claire.jahan[at]chartes.psl.eu

Simon Gabay : Simon.Gabay[at]unige.ch

Cite this dataset

Claire Jahan and Simon Gabay, OCR17+ - Layout analysis and text recognition for 17th c. French prints, 2021, Paris/Genève: ENS Paris/UniGE, https://github.com/e-ditiones/OCR17plus.

Licence

Data is CC-BY, except images which come from Gallica (cf. conditions d'utilisation).

68747470733a2f2f692e6372656174697665636f6d6d6f6e732e6f72672f6c2f62792f322e302f38387833312e706e67

About

Data for layout analysis and HTR.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 90.1%
  • Shell 9.9%