Skip to content
A general purpose processing framework for corpora of scientific documents
Rust CSS Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
examples
public
scripts
src
tests
.gitignore
.travis.yml
Cargo.lock
Cargo.toml
INSTALL.md
LICENSE
MANUAL.md
README.md

README.md

CorTeX Framework Framework

A general purpose processing framework for corpora of scientific documents

Build Status Coverage Status API Documentation license

Features:

  • Safe and speedy Rust implementation
  • Distributed processing and streaming data transfers via ZeroMQ
  • Backend support for Document (via FileSystem), Annotation (via ?) and Task (via PostgreSQL ≥9.5) provenance.
  • Representation-aware and -independent (TeX, HTML+RDFa, ePub, TEI, JATS, ...)
  • Automatic dependency management of registered Services (TODO)
  • Powerful workflow management and development support through the CorTeX web interface
  • Supports multi-corpora multi-service installations
  • Centralized storage, with distributed computing, motivated to enable collaborations across institutional and national borders.
  • Routinely tested on 1 million scientific TeX papers from arXiv.org

History:

  • Originally motivated by the desire to process any Cor-pus of TeX documents.
  • Rust reimplementation of the original Perl CorTeX stack.
  • Builds on the expertise developed during the arXMLiv project at Jacobs University.
  • In particular, CorTeX is a successor to the build system originally developed by Heinrich Stamerjohanns.
  • The architecture tiered towards generic processing with conversion, analysis and aggregation services was motivated by the LLaMaPUn project at Jacobs University.
  • The messaging conventions are motivated by work on standardizing LaTeXML's log reports with Bruce Miller.

For more details, consult the Installation instructions and the Manual.


Disclaimer: This repository has recently undergone first stability runs. We have converted ~1 million articles from arXiv.org with this implementation, and consider the CorTeX job manager largely stable. The backend can still benefit of using an ORM such as diesel.rs, and the setup of the various framework tasks still requires (imperfectly documented) manual intervention, so I would not advise deploying the repository for third-party use just yet. However, both bug reports and pull requests with enhancements are most welcome and encouraged!

Something went wrong with that request. Please try again.