Example pipeline

When writing a bioinformatics pipeline, new functionality or new parameters are frequently needed. This repository contains an example of how one might use the Python pipeline framework ruffus to iteratively develop a pipeline with minimal refactoring.

Python package requirements:

pyYAML

ruffus

HTSeq

Other requirements (must be on the path):

FASTX-toolkit

bowtie

samtools

BEDTools

All example data, including fastq files and bowtie indexes, are in the test directory.

Description

There are 3 pipelines of increasing complexity. They are separated into different directories, pipeline-1, pipeline-2, and pipeline-3 so they can be easily diffed to see what changes have been made betwen them.

All pipelines share the original data, in the test/ dir.

Each directory contains:

a ruffus pipeline (pipeline.py)

a config file in YAML format (config.yaml)

a module of tasks (tasks.py)

a module of helper functions and classes (helpers.py)

Each pipeline is run with:

./pipeline.py --config config.yaml -vv

A flowchart can be created with:

./pipeline.py --config config.yaml --flowchart flowchart.png

All results data are placed in the run1 subdirectory (this location is configured in each config.yaml file).

pipeline-1

pipeline-1 has 2 tasks: mapping, then counting

pipeline-2

pipeline-2 adds an adapter clipping step before mapping. The changes from pipeline-1 are as follows:

An "adapter" parameter was added to the config file

The clip() function was added to pipeline-2.py, and the downsteam map() task in that same file was made to depend on the new clip() by moving the input FASTQ files decorator to clip() and adding a @transform decorator to map().

The actual work gets done in tasks.clip(); since this function get the options object it will see the adapter that was added to the config file, and will clip it.

pipeline-3

pipeline-3 adds the ability to toggle an optional filtering step right after mapping so that (for example) reads in repetitive regions will be masked out when counting. The changes from pipeline-2 are as follows:

A "filter bed" parameter was added to the config file

Conditional code is added to the pipeline that reads the config file and sets up the task flow as needed. Specifically:

A new task is created that only runs if the "filter" parameter evaluates to True.

The "parent task" is assigned so that the downstream count() task will know which one to depend on

The parent and result suffixes are assigned so that result filenames encode whether or not filtering occurred.

The arguments to count()'s @transform decorator are modified to use the new variable for parent task, parent suffix, and result suffix.

The actual work gets done in tasks.filter(). It deals with the awkwardness of converting SAM to BAM so that intersectBed can work with it, and then back again from BAM to SAM so that the pipeline can work the same whether or not filtering was requested

Note in particular that the count task never changes, despite the new upstream clipping and filtering tasks. Even though things are happening upstream (including renaming of file extensions), the ruffus framework takes care of the dependencies and provides count with the right files. All count needs is a SAM file.

Implementation notes

helpers.py contains useful functions and classes. One of these, the Result class, is used to pass information about calls that have been made. It holds input and output files, stdout and stderr, failure codes, the actual commands, and a log file. All of these are optional, since not all tasks will need them.
The Result.report method prints a nice report for each task, including stdout/stderr/commands for debugging if the task failed.
Functions in tasks.py return Result objects that encapsulate all the calling information.
The @helpers.timeit decorator is used to attach the elapsed time of each task to the Results object it returns. This can be very helpful when benchmarking your pipeline.
Each function typically takes a config object -- in this case, a dictionary created from the config.yaml file. This simplifies calls from the pipeline to the tasks module, since you don't have to specify args and kwargs differently for each task.
Each task defined in the pipeline.py file does minimal work -- typically it just passes the input/output files, plus a config object, to the relevant tasks.py task.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
pipeline-1		pipeline-1
pipeline-2		pipeline-2
pipeline-3		pipeline-3
test		test
.gitignore		.gitignore
README.rst		README.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline-1

pipeline-1

pipeline-2

pipeline-2

pipeline-3

pipeline-3

test

test

.gitignore

.gitignore

README.rst

README.rst

Repository files navigation

Example pipeline

Description

pipeline-1

pipeline-2

pipeline-3

Implementation notes

About

Releases

Packages

Languages

daler/pipeline-example

Folders and files

Latest commit

History

Repository files navigation

Example pipeline

Description

pipeline-1

pipeline-2

pipeline-3

Implementation notes

About

Resources

Stars

Watchers

Forks

Languages