A general purpose processing framework for corpora of scientific documents
- Safe and speedy Rust implementation
- Distributed processing and streaming data transfers via ZeroMQ
- Backend support for Document (via FileSystem), Annotation (via ?) and Task (via PostgreSQL ≥9.5) provenance.
- Representation-aware and -independent (TeX, HTML+RDFa, ePub, TEI, JATS, ...)
- Automatic dependency management of registered Services (TODO)
- Powerful workflow management and development support through the CorTeX web interface
- Supports multi-corpora multi-service installations
- Centralized storage, with distributed computing, motivated to enable collaborations across institutional and national borders.
- Routinely tested on 1 million scientific TeX papers from arXiv.org
- Originally motivated by the desire to process any Cor-pus of TeX documents.
- Rust reimplementation of the original Perl CorTeX stack.
- Builds on the expertise developed during the arXMLiv project at Jacobs University.
- In particular, CorTeX is a successor to the build system originally developed by Heinrich Stamerjohanns.
- The architecture tiered towards generic processing with conversion, analysis and aggregation services was motivated by the LLaMaPUn project at Jacobs University.
- The messaging conventions are motivated by work on standardizing LaTeXML's log reports with Bruce Miller.
Disclaimer: This repository has recently undergone first stability runs. We have converted ~1 million articles from arXiv.org with this implementation, and consider the CorTeX job manager largely stable. The backend can still benefit of using an ORM such as diesel.rs, and the setup of the various framework tasks still requires (imperfectly documented) manual intervention, so I would not advise deploying the repository for third-party use just yet. However, both bug reports and pull requests with enhancements are most welcome and encouraged!