Extractor - From ALTO to TEI

Python script to transform ALTO4 files into XML-TEI files.

Warning: For now, this pipeline only works with

files annotated with the SegMonto controlled vocabulary
ALTO files created with eScriptorium. Data and models for the latter can be found in the OCR17plus repo.
Prints found on Gallica and their IIIF manifest.

Structure of the repo

├── functions
│     ├── cleaned_file.py
│     ├── count_illustration.py
│     ├── count_words.py
│     ├── creation_intermediaire.py
│     ├── extraction_img.py
│     ├── geonames.json
│     ├── licences.json
│     ├── recuperation_donnees_SPARQL.py
│     ├── récupération_données_manifest.py
│     ├── sorted.py
│     ├── Transkribus_ABBYY_native.py
│     └── __init__.py
│ 
├── ODD
│     ├── ODD.xml
│     └── out
│           └── ODD.rng
├── example
│     ├── xml
│     │     ├── ALTOs
│     │     │      ├──author_title_date_ID_folio.xml
│     │     │      ├──…
│     │     │      └── author_title_date_ID_folio.xml
│     │     └── TEI
│     │            ├── author_title_date_ID.xml
│     │            ├── author_title_date_ID_decorations.xml
│    	│            └── extraction_img.py
│     ├── img
│     │     ├──author_title_date_ID_folio.jpg
│     │     ├──…
│     │     └── author_title_date_ID_folio.jpg
│     └── README.md
├── alto4_into_TEI.py
├── strings_checking.py
└── README.md

With alto4_into_TEI.py it is possible to transform XML ALTO4 files from eScriptorium into XML-TEI files.

The directory functions contains several python files used in alto4_into_TEI.py. They are all differents steps of it.

strings_checking.py is a script that allows corrections of segmentation mistakes.

The ODD directory can be found an ODD based on the work of Alexandre Bartz and Simon Gabay, and especially the first of the three levels of transcription in XML-TEI (i.e. E-ditiones/ODD17).

example directory contains an example of result that this pipeline creates.

How to

Install

git clone https://github.com/e-ditiones/extractor
cd extractor
pip install virtualenv
virtualenv env
source env/bin/activate
pip install -r requirements.txt

Run

Please note that you need to have two separate folders:

one with all the images (in case you need to create one, use cp PATH_TO FOLDER/*png images in the terminal)
one with all the XML files (in case you need to create one, use cp PATH_TO FOLDER/*xml altos in the terminal)

Import, annotate, transcribe and correct data on eScriptorium. Downnload them as ALTO v4 files with the images.
Control the consistency use strings_checking.py:

python strings_checking.py PATH_TO_THE_ALTO4_DIRECTORY

In case you encountered a problem, correct lines or zones errors

Transform the data with alto4_into_TEI.py

python alto4_into_TEI.py 'IIIF_GALLICA_ARK' 'NAME_SURNAME_ORCID' 'PUBLISHER' 'LINK_TO_PUBLIHER_INFO' -a 'AVAILABILITY' -e

IIIF_GALLICA_ARK: provide the qualifier (btv1b86262420 in ark:/12148/btv1b86262420)
'NAME_SURNAME_ORCID' must be written with underscores instead of blanks to be correctly treated. And if you do not have an ORCID, use 'NAME_SURNAME_'.
'PUBLISHER' is the name of the project publishing the document.
'LINK_TO_PUBLIHER_INFO' is the url of the project.
-a 'AVAILABILTY', it is a mandatory argument with specific entries. They are 'cc by', 'cc by-sa', 'cc by-nb', 'cc by-nc', 'cc by-nc-sa' or 'cc by-nc-nd' (cf. creattive commons licences).
-e is an option that gives a extra xml file with the list of all "Decoration", "Figure" and "DropCapital" zones and their IIIF link.

for example:

python transformation.py 'bpt6k73945k' 'Simon_Gabay_0000-0001-9094-4475' 'E-ditiones' 'github.com/e-ditiones' -a 'cc by' -e

The script will ask the path to ALTO4 files and images directory: provide a path
The script returns a directory with the following structure:

├── xml
│     ├── ALTOs
│     │      ├──author_title_date_ID_folio.xml
│     │      ├──…
│     │      └── author_title_date_ID_folio.xml
│     └── TEI
│            ├── author_title_date_ID.xml
│            ├── author_title_date_ID_decorations.xml
│    	       └── extraction_img.py
└── img
      ├──author_title_date_ID_folio.jpg
      ├──…
      └── author_title_date_ID_folio.jpg

ALTO files (one per page in the original document) are cleaned and renamed
2 TEI files are created:

one complete
one with just the decorations, only if the -e parameter has been used

Images are renamed, with the same name that the ALTO files

Credits

Scripts prepared by Claire Jahan with the help of Simon Gabay, as part of the E-ditiones project.

Contact

Claire Jahan : claire.jahan[at]chartes.psl.eu

Simon Gabay : Simon.Gabay[at]unige.ch

Cite this dataset

Claire Jahan and Simon Gabay, Extractor - From ALTO to TEI, 2021, Paris/Geneva: ENS Paris/UniGE, https://github.com/e-ditiones/extractor.

Thanks to

Thanks to Simon Gabay, Juliette ❤️ Janes and Alexandre ❤️ Bartz for their help and work.

Licence

Data is CC-BY, except images which come from Gallica (cf. conditions d'utilisation).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extractor - From ALTO to TEI

Structure of the repo

How to

Install

Run

Credits

Contact

Cite this dataset

Thanks to

Licence

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
ODD		ODD
example		example
functions		functions
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md
alto4_into_TEI.py		alto4_into_TEI.py
requirements.txt		requirements.txt
strings_checking.py		strings_checking.py

e-ditiones/extractor

Folders and files

Latest commit

History

Repository files navigation

Extractor - From ALTO to TEI

Structure of the repo

How to

Install

Run

Credits

Contact

Cite this dataset

Thanks to

Licence

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages