A general purpose processing framework for corpora of scientific documents
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
examples
migrations
public
scripts
src
templates
tests
.env
.gitignore
.rustfmt.toml
.travis.yml
CHANGELOG.md
Cargo.toml
INSTALL.md
LICENSE
MANUAL.md
README.md
Rocket.toml
config.default.json
diesel.toml

README.md

CorTeX Framework Framework

A general purpose processing framework for corpora of scientific documents

Build Status Coverage Status API Documentation License version

NEWS: First datasets produced by CorTeX are now available for reuse via the SIGMathLing interest group, see the resource section

Nightly rust required: minimal supported version currently 1.30.0-nightly (2018-08-26)

Features:

  • Safe and speedy Rust implementation
  • Distributed processing and streaming data transfers via ZeroMQ
  • Backend support for Document (via FileSystem), Annotation (via ?) and Task (via PostgreSQL ≥9.5) provenance.
  • Representation-aware and -independent (TeX, HTML+RDFa, ePub, TEI, JATS, ...)
  • Automatic dependency management of registered Services (TODO)
  • Powerful workflow management and development support through the CorTeX web interface
  • Supports multi-corpora multi-service installations
  • Centralized storage, with distributed computing, motivated to enable collaborations across institutional and national borders.
  • Routinely tested on 1 million scientific TeX papers from arXiv.org
  • Minimal dashboard frontend written in Rocket

History:

  • Originally motivated by the desire to process any Cor-pus of TeX documents.
  • Rust reimplementation of the original Perl CorTeX stack.
  • Builds on the expertise developed during the arXMLiv project at Jacobs University.
  • In particular, CorTeX is a successor to the build system originally developed by Heinrich Stamerjohanns.
  • The architecture tiered towards generic processing with conversion, analysis and aggregation services was motivated by the LLaMaPUn project at Jacobs University.
  • The messaging conventions are motivated by work on standardizing LaTeXML's log reports with Bruce Miller.

For more details, consult the Installation instructions and the Manual.


Disclaimer:

  • The CorTeX framework is recurringly converting >1 million articles from arXiv.org.
  • We consider the CorTeX job manager largely stable.
  • The backend has recently been rewritten in diesel.rs, and the frontend has recently been rewritten in rocket.rs. Both are being retested in production in the last days of 2017.
  • The setup of the various framework tasks still requires (imperfectly documented) manual intervention, so I would not advise deploying the repository for third-party use just yet.
  • However, both bug reports and pull requests with enhancements are most welcome and encouraged!