A general purpose processing framework for corpora of scientific documents
Nightly rust required: minimal supported version currently
- Safe and speedy Rust implementation
- Distributed processing and streaming data transfers via ZeroMQ
- Backend support for Document (via FileSystem), Annotation (via ?) and Task (via PostgreSQL ≥9.5) provenance.
- Representation-aware and -independent (TeX, HTML+RDFa, ePub, TEI, JATS, ...)
- Automatic dependency management of registered Services (TODO)
- Powerful workflow management and development support through the CorTeX web interface
- Supports multi-corpora multi-service installations
- Centralized storage, with distributed computing, motivated to enable collaborations across institutional and national borders.
- Routinely tested on 1 million scientific TeX papers from arXiv.org
- Minimal dashboard frontend written in Rocket
- Originally motivated by the desire to process any Cor-pus of TeX documents.
- Rust reimplementation of the original Perl CorTeX stack.
- Builds on the expertise developed during the arXMLiv project at Jacobs University.
- In particular, CorTeX is a successor to the build system originally developed by Heinrich Stamerjohanns.
- The architecture tiered towards generic processing with conversion, analysis and aggregation services was motivated by the LLaMaPUn project at Jacobs University.
- The messaging conventions are motivated by work on standardizing LaTeXML's log reports with Bruce Miller.
- The CorTeX framework is recurringly converting >1 million articles from arXiv.org.
- We consider the CorTeX job manager largely stable.
- The backend has recently been rewritten in diesel.rs, and the frontend has recently been rewritten in rocket.rs. Both are being retested in production in the last days of 2017.
- The setup of the various framework tasks still requires (imperfectly documented) manual intervention, so I would not advise deploying the repository for third-party use just yet.
- However, both bug reports and pull requests with enhancements are most welcome and encouraged!