Skip to content

gitter-badger/bionode-watermill

 
 

Repository files navigation

bionode-watermill

npm version node Build Status codecov.io

Watermill: A Streaming Workflow Engine

NPM

Watermill lets you orchestrate tasks using operators like join, junction, and fork. Each task has a lifecycle where

  1. Input glob patterns are resolved to absolute file paths (e.g. *.bam to reads.bam)
  2. The operation is ran, passed resolved input, params, and other props
  3. The operation completes.
  4. Output glob patterns are resolved to absolute file paths.
  5. Validators are ran over the output. Check for non-null files, can pass in custom validators.
  6. Post-validations are ran. Add task and output to DAG.

CWL?

Coming soon.

What is a task?

A task is the fundamental unit pipelines are built with. For more details, see Task. At a glance, a task is created by passing in props and an operationCreator, which will later be called with the resolved input. Consider this task which takes a "lowercase" file and creates an "uppercase" one:

const uppercase = task({
  input: '*.lowercase',
  output: '*.uppercase'
}, function(resolvedProps) {
  const input = resolvedProps.input

  return fs.createReadStream(input)
  	.pipe(through(function(chunk, enc, next) {
      next(null, chunk.toString().toUpperCase())
  	})
    .pipe(fs.createWriteStream(input.replace(/lowercase$/, 'uppercase')))
})

A "task declaration" like above will not immediately run the task. Instead, the task declaration returns an "invocable task" that can either be called directly or used with an orchestration operator. Tasks can also be created to run shell programs:

const fastqDump = task({
  input: '**/*.sra',
  output: [1, 2].map(n => `*_${n}.fastq.gz`),
  name: 'fastq-dump **/*.sra'
}, ({ input }) => `fastq-dump --split-files --skip-technical --gzip ${input}` )

What are orchestrators?

Orchestrators are functions which can take tasks as params in order to let you compose your pipeline from a high level view. This separates task order from task declaration. For more details, see Orchestration. At a glance, here is a complex usage of join, junction, and fork:

const pipeline = join(
  junction(
    join(getReference, bwaIndex),
    join(getSamples, fastqDump)
  ),
  trim, mergeTrimEnds,
  decompressReference, // only b/c mpileup did not like fna.gz
  join(
    fork(filterKMC, filterKHMER),
    alignAndSort, samtoolsIndex, mpileupAndCall // 2 instances each of these
  )
)

Check out bionode-watermill tutorial!

Example pipelines

Why bionode-watermill?

This blog post compares the available tools to deal with NGS workflows, explaining the advantages of each one, including bionode-watermill.

Who is this tool for?

Bionode-watermill is for programmers who desire an efficient and easy-to-write methodology for developing complex and dynamic data pipelines, while handling parallelization as much as possible. Bionode-watermill is an npm module, and is accessible by anyone willing to learn a little JavaScript. This is in contrast to other tools which develop their own DSL (domain specific language), which is not useful outside the tool. By leveraging the npm ecosystem and JavaScript on the client, Bionode-watermill can be built upon for inclusion on web apis, modern web applications, as well as native applications through Electron. Look forward to seeing Galaxy-like applications backed by a completely configurable Node API.

Bionode-watermill is for biologists who understand it is important to experiment with sample data, parameter values, and tools. Compared to other workflow systems, the ease of swapping around parameters and tools is much improved, allowing you to iteratively compare results and construct more confident inferences. Consider the ability to construct your own Teaser for your data with a simple syntax, and getting utmost performance out of the box.

About

💧Watermill: A Streaming Workflow Engine

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 95.4%
  • HTML 4.6%