Skip to content

Latest commit

 

History

History
154 lines (112 loc) · 9.72 KB

decorators_compendium.rst

File metadata and controls

154 lines (112 loc) · 9.72 KB

pair: decorators_compendium; Tutorial

: Pipeline topologies and a compendium of Ruffus decorators

  • Manual Table of Contents <new_manual.table_of_contents>
  • decorators <decorators>

Overview

Computational pipelines transform your data in stages until the final result is produced.

You can visualise your pipeline data flowing like water down a system of pipes. Ruffus has many ways of joining up your pipes to create different topologies.

Note

The best way to design a pipeline is to:

  • Write down the file names of the data as it flows across your pipeline.
  • Draw lines between the file names to show how they should be connected together.

@transform <decorators.transform>

So far, our data files have been flowing through our pipelines independently in lockstep.

image

If we drew a graph of the data files moving through the pipeline, all of our flowcharts would look like something like this.

The @transform <decorators.transform> decorator connects up your data files in 1 to 1 operations, ensuring that for every Input, a corresponding Output is generated, ready to got into the next pipeline stage. If we start with three sets of starting data, we would end up with three final sets of results.

A bestiary of Ruffus decorators

Very often, we would like to transform our data in more complex ways, this is where other Ruffus decorators come in.

image

@originate <decorators.originate>

  • Introduced in More on @transform-ing data and @originate <new_manual.transform_in_parallel>, @originate <decorators.originate> generates Output files from scratch without the benefits of any Input files.

@merge <decorators.merge>

  • A many to one operator.
  • The last decorator at the far right to the figure, @merge <decorators.merge> merges multiple Input into one Output.

@split <decorators.split>

  • A one to many operator,
  • @split <decorators.split> is the evil twin of @merge <decorators.merge>. It takes a single set of Input and splits them into multiple smaller pieces.
  • The best part of @split <decorators.split> is that we don't necessarily have to decide ahead of time how many smaller pieces it should produce. If we have encounter a larger file, we might need to split it up into more fragments for greater parallelism.
  • Since @split <decorators.split> is a one to many operator, if you pass it many inputs (e.g. via @transform <decorators.transform>, it performs an implicit @merge <decorators.merge> step to make one set of Input that you can redistribute into a different number of pieces. If you are looking to split each Input into further smaller fragments, then you need @subdivide <decorators.subdivide>

@subdivide <decorators.subdivide>

  • A many to even more operator.
  • It takes each of multiple Input, and further subdivides them.
  • Uses suffix() <decorators.suffix>, formatter() <decorators.formatter> or regex() <decorators.regex> to generate Output names from its Input files but like @split <decorators.split>, we don't have to decide ahead of time how many smaller pieces each Input should be further divided into. For example, a large Input files might be subdivided into 7 pieces while the next job might, however, split its Input into just 4 pieces.

@collate <decorators.collate>

  • A many to fewer operator.
  • @collate <decorators.collate> is the opposite twin of subdivide: it takes multiple Output and groups or collates them into bundles of Output.
  • @collate <decorators.collate> uses formatter() <decorators.formatter> or regex() <decorators.regex> to generate Output names.
  • All Input files which map to the same Output are grouped together into one job (one task function call) which produces one Output.

Combinatorics

More rarely, we need to generate a set of Output based on a combination or permutation or product of the Input.

For example, in bioinformatics, we might need to look for all instances of a set of genes in the genomes of a different number of species. In other words, we need to find the @product <decorators.product> of XXX genes x YYY species.

Ruffus provides decorators modelled on the "Combinatoric generators" in the Standard Python itertools library.

To use combinatoric decorators, you need to explicitly include them from Ruffus:

import ruffus
from ruffus import *
from ruffus.combinatorics import *

image

@product <decorators.product>

  • Given several sets of Input, it generates all versus all Output. For example, if there are four sets of Input files, @product <decorators.product> will generate WWW x XXX x YYY x ZZZ Output.
  • Uses formatter <decorators.transform> to generate unique Output names from components parsed from any parts of any specified files in all Input sets. In the above example, this allows the generation of WWW x XXX x YYY x ZZZ unique names.

@combinations <decorators.combinations>

  • Given one set of Input, it generates the combinations of r-length tuples among them.
  • Uses formatter <decorators.transform> to generate unique Output names from components parsed from any parts of any specified files in all Input sets.
  • For example, given Input called A, B and C, it will generate: A-B, A-C, B-C
  • The order of Input items is ignored so either A-B or B-A will be included, not both
  • Self-vs-self combinations (A-A) are excluded.

@combinations_with_replacement <decorators.combinations_with_replacement>

  • Given one set of Input, it generates the combinations of r-length tuples among them but includes self-vs-self conbinations.
  • Uses formatter <decorators.transform> to generate unique Output names from components parsed from any parts of any specified files in all Input sets.
  • For example, given Input called A, B and C, it will generate: A-A, A-B, A-C, B-B, B-C, C-C

@permutations <decorators.permutations>

  • Given one set of Input, it generates the permutations of r-length tuples among them. This excludes self-vs-self combinations but includes all orderings (A-B and B-A).
  • Uses formatter <decorators.transform> to generate unique Output names from components parsed from any parts of any specified files in all Input sets.
  • For example, given Input called A, B and C, it will generate: A-A, A-B, A-C, B-A, B-C, C-A, C-B