Arvados

Peter Amstutz edited this page Nov 12, 2015 · 5 revisions

Running CWL on Arvados

See Running Common Workflow Language (CWL) workflows on Arvados

About Arvados

The Arvados Project is an open-source bioinformatics data analysis platform, funded by Boston-based startup [Curoverse](https://curoverse.com/). It's designed to be scalable, working on local and cloud-based systems.

The Arvados core is a platform for production data science with very large data sets. It is made up of two major systems and a number of related services and components including APIs, SDKs, and visual tools.

Keep

Keep is a content-addressable storage system for managing and storing large collections of files with durable, cryptographically verifiable references and high-throughput processing. Keep works on a wide range of underyling file systems. Learn More >

Crunch

Is a containerized workflow engine for running complex, multi-part pipelines or workflows in a way that is flexible, scalable, and supports versioning, reproducibilty, and provenance. Crunch runs in virtualized computing environments.

The "core platform" has genomic specific projects such as Lightning (real-time queries and machine learning with population genomic datasets) and Tapestry & GET-Evidence (web apps for managing open science research studies, in particular those collecting genomic data, for personal genomics projects worldwide).

Goals

via Arvados Dev wiki: Computation and Pipeline Processing

The Arvados dev pages list "notable goals and features" of the project as:

  • Make use of multiple cores and nodes to produce results faster
  • Integrate with Keep and git repositories to maintain provenance
  • Use off-the-shelf software tools in distributed computations
  • Efficient over a wide range of problem sizes
  • Maximum flexibility of programming language choice
  • Maximum flexibility of execution environment
  • Tools for building reusable pipelines
  • Lower entry barrier for users

Pipeline processing

As Hadoop was initially designed to work with web data, Arvados uses "a purpose-built MapReduce engine that is optimized for analysis of biomedical data", and "approaches MapReduce from the perspective of a bioinformatician". [Bio]informatics problems are typically carried out as sequences of analysis and data transformation, .

Each Arvados MapReduce job contains sets of job tasks which can be computed independently of one another, and can therefore be scheduled asynchronously according to available compute resources. Typically, jobs split input data into chunks which can be processed independently, and run a task for each chunk. This works well for genomic data.

Arvados does not make a distinction between “map” and “reduce” tasks or provide synchronous communication paths between tasks. However, a job can establish sequencing constraints to achieve a similar result (i.e., ensure that all map tasks have completed before a reduce task starts). In practice, the “reduce” stages of genomic analyses tend to be so simple that there is little to gain by introducing the complexity of scheduling and real-time communication between map and reduce tasks.

  • Pipeline - set of related MapReduce jobs (related in terms of I/O transfer, e.g. BWA to GTK)
  • Pipeline template - JSON description of job relationships (e.g. job A output = job B input), analogous to Makefile
    • constructed as a set of pipeline components, each of which designates name of a job script, data inputs and parameters
    • data inputs are specified as Keep content addresses
    • job scripts are stored in a Git repo, referenced using commit hash/tag
    • parameters and inputs can be defined in template, or upon invoking pipeline (a "pipeline instance")
  • There is no constraint/verification of pipeline logic - nothing prevent use of different pipeline manager / manual build

The pipeline:

  • determines dependencies are satisfied (like Make does)
  • submits new jobs
  • waits for jobs to finish
    • may repeat the above until successful at each pipeline component

If identical/equivalent job has already run, pipeline uses output of existing job rather than submitting anew - faster run, more efficient use of compute resources. This behaviour can be overridden by the client when repetition is desirable.

The Arvados job dispatcher processes submitted jobs:

  • executes each task
  • enforces task sequence order and resource constraints [as dictated by the job]
  • checks process exit codes and other failure indicators
    • re-attempts failed tasks when needed
  • stores status updates in the Arvados system database as the job progresses

Each pipeline instance (and job) is recorded in the system DB to preserve provenance, and aid reproducing jobs even long after initial run

  • job manager records runtime details (e.g. commit hash for job script, compute node O.S. version, ...) for reproducibility "as a rule rather than an exception"

Arvados is "language and tool neutral":

  • suitable "from binary-only tools or in-house C programs" using universal APIs such as HTTP, and UNIX pipes
  • job scripts can be written in any language, run in a normal UNIX environment (TODO: clarify?)

Suggested benefits

  • Efficient processing of small tasks - Arvados MapReduce has very low task latency, making it practical to use for even very short single-task jobs. This makes it feasible for users and applications to routinely do all computations in MapReduce and thereby achieve the benefits of complete provenance, reproducibility, and scalability.
  • Node-level resource allocation - Arvados MapReduce uses a node as the basic computing resource unit: a compute node runs multiple asynchronous tasks, but only accepts tasks from one job at a time. This gives each job the flexibility to allocate CPU and RAM resources among its tasks to best suit the work being done, and avoids interference and resource competition between unrelated job tasks.

Sources / see also:

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.