Scripts for text and index extraction from TEI files for the Clariah+ VOC use case.
The scripts pretei_gm_extracter.py and vandam_extracter.py allow to extract text from preTEI files for the Generale Missiven corpus, and from TEI files for the Van Dam corpus.
The script gm_indices.py allows to extract a lexicon of ship names, person names, locations and miscellaneous from the Generale Missiven.
Two scripts are provided for extracting indices from the Van Dam corpus:
- html_vandam_indices.py extracts indices from html-derived files for person names, locations and miscellaneous;
- xml_vandam_indices.py extracts indices from an xml index of person names