Skip to content

cltl/clariah-voc-scripts

Repository files navigation

Clariah+ VOC scripts

Scripts for text and index extraction from TEI files for the Clariah+ VOC use case.

Text extraction

The scripts pretei_gm_extracter.py and vandam_extracter.py allow to extract text from preTEI files for the Generale Missiven corpus, and from TEI files for the Van Dam corpus.

Index extraction

The script gm_indices.py allows to extract a lexicon of ship names, person names, locations and miscellaneous from the Generale Missiven.

Two scripts are provided for extracting indices from the Van Dam corpus:

  • html_vandam_indices.py extracts indices from html-derived files for person names, locations and miscellaneous;
  • xml_vandam_indices.py extracts indices from an xml index of person names

About

Text and index extraction from TEI files for the CLARIAH+ VOC use case

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published