Workflows

Naohisa Goto edited this page Jun 14, 2016 · 1 revision

Scientific Workflow

Bioinformatics tasks often require combinations of different programs. The most basic tool for scientific workflow is shell scripts. But now we have Ruby. Ruby-based workflow systems offers well-organized workflow description, easier maintenance, and better performance including parallelism.

Workflow Systems

Pwrake

https://github.com/masa16/Pwrake/

Pwrake is developped by Masahiro Tanaka at University of Tsukuba. He is also the author of "NArray", very fast matrix calculation engine for Ruby. Although Pwrake and regular Rake are compatible in syntax, Pwrake automatically detects workflow steps that can be run in parallel.

Pwrake invokes processes via ssh and supports the Gfarm large-scale distributed filesystem. Of course, it works well on a multi-processor Linux box.

Although Pwrake is developed for astronomy science, its goal is also common in bioinformatics.

This suggestions come from Hiroyuki MISHIMA: "I think that some helper methods may simplify Rakefiles for bioinformatics, and such helper methods are good for a BioRuby plugin. Pwrake's parallelization model is process based. Because I am just a user of bioinformatics packages (like BWA/GATK/DINDEL etc..), it is what I need."

  • Presentation at RubyConfX[1]
  • Presentation at PRAGMA18[2]

Parallel

https://github.com/grosser/parallel

Run any code in parallel Processes(> use all CPUs) or Threads(> speedup blocking operations). Best suited for map-reduce or e.g. parallel downloads/uploads. Processes

Speedup through multiple CPUs Speedup for blocking operations Protects global data Extra memory used ( very low on REE through copy_on_write_friendly ) Child processes are killed when your main process is killed through Ctrl+c or kill -2 Threads

Speedup for blocking operations Global data can be modified No extra memory used Processes/Threads are workers, they grab the next piece of work when they finish

Thor

https://github.com/wycats/thor see also the author's blog: http://yehudakatz.com/2008/05/12/by-thors-hammer/

Thor is "A scripting framework that replaces rake and sake."

Michael Barton have pointed out in Bioruby mailing-list that "I believe this is also aimed at being a more modular rake-like tool. This is developed by Yehuda Katz and I think is used for the basis of few mainstream ruby command line tools (possibly the rails3 CLI? I'm not 100% about this.). I think you could expect Thor to be more mature and likely to be continually developed."

Boson

Michael Barton commented in a Bioruby mailing-list post that "Boson commands, similar to rake tasks, are more modular and can be installed from the web into a ~/.boson directory. This has obvious advantages over a single rake file. Boson tasks can be chained together where the data is passed around in YAML format." See GitHub https://github.com/cldwalker/boson

Job-schedulers

Some workflow systems offer parallel execution directly. However, systems specialized in job-scheduling have advantages in scalability by better resource management. See also articles in Wikipedia and softpanporama

GridEngine

aka SGE (Sun Grid Engine), OGE (Oracle Grid Engine). See http://en.wikipedia.org/wiki/Oracle_Grid_Engine

Now, Univa seems to continue development of GridEngine (HPC wire article).

PBS

Portable Batch System. See http://en.wikipedia.org/wiki/Portable_Batch_System

GNUBatch

Simple batch scheduler written standard C. See http://www.gnu.org/software/gnubatch/

Ruby Queue

Pjotr Prins introduced Ruby Queue (RQ) in the BioRuby mailing list: "rq - which can parallelize calculations with zero-administration (all that is needed is a shared dir)".

See also Pjotr's GitHub repository https://github.com/pjotrp/rq and an article on Linux Journal.