Skip to content

Commit

Permalink
Merge pull request #24 from epigen/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
nsheff committed Jan 23, 2017
2 parents a26a02c + 4c79bd8 commit 418776b
Show file tree
Hide file tree
Showing 12 changed files with 453 additions and 250 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![Documentation Status](https://readthedocs.org/projects/pypiper/badge/?version=latest)](http://pypiper.readthedocs.org/en/latest/?badge=latest)

A lightweight python toolkit for gluing together restartable, robust command line pipelines.
A lightweight python toolkit for gluing together restartable, robust command line pipelines. Pypiper works with [Looper](http://github.com/epigen/looper) to form a complete bioinformatics pipeline framework.

# Links

Expand Down
6 changes: 4 additions & 2 deletions doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ Pypiper includes optional "toolkits" (right now just one) -- suites of commonly

The ``follow`` argument
*************
*Follow functions* let you couple functions to run commands; the functions will only be run if the command is run. This lets you avoid running unnecessary processes repeatedly in the event that you restart your pipeline multiple times (for instance, while debugging later steps in the pipeline).
*Follow functions* let you couple follow-up functions to run commands; a follow function will be run if and only if the command is run. This lets you avoid running unnecessary processes repeatedly in the event that you restart your pipeline multiple times (for instance, while debugging later steps in the pipeline).

One of the really useful things in pypiper is that you can pass a python function (we call a "follow function") along in your call to ``run``, and python will call this *follow function* if and only if it runs your command. This is useful for data checks to make sure processes did what you expect. A good example is file conversions; after you convert from one format into another, it's good practice to make sure the conversion completed and you have the same number of lines, or whatever, in the final file. Pypiper's command locking mechanism solves most of this problem, but there are still lots of times you'd like to run a function but only once, right after the command that produced it runs. For example, just counting the number of lines in a file after producing it. You don't need to re-run the counting every time you restart the pipeline, which will skip that step since it is complete.
This is useful for data QC checks to make sure processes did what you expect. Often you'd like to run a function to examine a file but only once, right after the command that produced it runs. For example, just counting the number of lines in a file after producing it, or counting the number of reads that aligned right after an alignment step. You want the counting process coupled to the alignment process, and don't need to re-run the counting every time you restart the pipeline.



Expand Down Expand Up @@ -47,6 +47,8 @@ Additional stuff (to be explained more thorougly soon): You can also add some ot

The most significant of these special keywords is the ``config_file`` argument, which leads us to the concept of ``pipeline config files``:

.. _pipeline_config_files:

Pipeline config files
*************
Optionally, you may choose to conform to our standard for parameterizing pipelines, which enables you to use some powerful features of the pypiper system.
Expand Down
13 changes: 7 additions & 6 deletions doc/source/best-practices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@ Best practices

Here are some guidelines for how you can design the most effective pipelines.

* **Parameterize your pipeline** (from the beginning). Pypiper makes it painfully easy to use a config file to make your pipeline configurable. Start from the very beginning by making a ``yaml`` pipeline config file.

* **Use flexible inputs**. Right now you may always have the same input type (fastq, for example), but later you may want your pipeline to be able to work from ``bam`` files. We've already written simple functions to handle single or multiple bam or fastq inputs; just use this infrastructure (in NGSTk) instead of writing your own, and you'll save yourself future headaches.
* **Compartmentalize output into folders**. In your output, keep pipeline steps separate by organizing output into folders.

* **Use looper args**. Even if you're not using looper at first, use ``looper_args`` and your pipeline will be looper-ready when it comes time to run 500 samples.
* **Use git for versioning**. If you develop your pipeline in a git repository, Pypiper will automatically record what commit you run, making it easy to figure out what code version you ran.

* **Compartmentalize with folders**. In your output, keep pipeline steps separate by organizing output into folders.
* **Record stats as you go**. In other words, don't do all your stats (``report_result()``) and QC at the end, do it along the way. This makes it easy for you to monitor pipeline performance, and couples stats with how far the pipeline makes it, so you could make use of a partially completed (or even ultimately failed) pipelines.

* **Use git for versioning**. If you develop your pipeline in a git repository, Pypiper will automatically record what commit you run, making it easy to figure out what code version you ran.
* **Use looper args**. Even if you're not using looper at first, use ``looper_args`` and your pipeline will be looper-ready when it comes time to run 500 samples.

* **Use NGSTk early on**. NGSTk has lots of useful functions that you will probably need. We've worked hard to make these robust and universal. For example, using NGSTk, you can easily make your pipeline take flexible input formats (fastq or bam) Right now you may always have the same input type (fastq, for example), but later you may want your pipeline to be able to work from ``bam`` files. We've already written simple functions to handle single or multiple bam or fastq inputs; just use this infrastructure (in NGSTk) instead of writing your own, and you'll save yourself future headaches.

* **Record stats as you go**. In other words, don't do all your stats and QC at the end, do it along the way. This makes it easy for you to monitor pipeline performance, and couples stats with how far the pipeline makes it, so you could make use of a partially completed (or even ultimately failed) pipelines.
* **Make some important parameters in the pipeline config, instead of hardcoding them** (from the beginning). Pypiper makes it painfully easy to use a config file to make your pipeline configurable. Typically you'll start by hard-coding in those parameters in your pipeline steps. But you can select a few important parameters and make them customizable in the pipeline config. Start from the very beginning by making a ``yaml`` pipeline config file. See an example of a :ref:`pipeline_config_files <pipeline config file>`_
16 changes: 12 additions & 4 deletions doc/source/commands.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@

Commands
Pypiper Functions
=========================

The key things you need to know that the ``PipelineManager`` can do are:
Pypiper is simple, but powerful. You only need 3 functions to get started. ``PipelineManager`` can do:

.. currentmodule:: pypiper.pypiper.PipelineManager
.. autosummary::
Expand All @@ -11,6 +11,14 @@ The key things you need to know that the ``PipelineManager`` can do are:
stop_pipeline


With that you can create a simple pipeline.
With that you can create a simple pipeline. You can click on each function to view the in-depth documentation for that function. There are quite a few optional parameters to the ``run`` function, which is where most of Pypiper's power comes from

Complete specifications for these and the more advanced functions can be found in the :doc:`API <api>`.
When you've mastered the basics and are ready to get more powerful, add in a few new (optional) commands that make debugging and development easier:

.. autosummary::
timestamp
clean_add
report_result
get_stat

The complete documentation for these functions can be found in the :doc:`API <api>`.
6 changes: 6 additions & 0 deletions doc/source/faq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@

FAQ
=========================

- **How can I run my pipeline on more than 1 sample?** Pypiper only handles individual-sample pipelines. To run it on multiple samples, you need `Looper <http://looper.readthedocs.io/>`_. Dividing multi-sample handling from individual sample handling is a conceptual advantage that allows us to write a nice, universal, generic sample-handler that you only have to learn once.
- **What cluster resources can pypiper use?** PyPiper is compute-agnostic. You run it wherever you want; If you want a nice way to submit pipelines for samples any cluster manager, check out `Looper <http://looper.readthedocs.io/>`_.
25 changes: 25 additions & 0 deletions doc/source/functions.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@

Pypiper Functions
=========================

Pypiper is simple, but powerful. You only need 3 functions to get started. ``PipelineManager`` can do:

.. currentmodule:: pypiper.pypiper.PipelineManager
.. autosummary::
start_pipeline
run
stop_pipeline


With that you can create a simple pipeline. You can click on each function to view the in-depth documentation for that function. There are quite a few optional parameters to the ``run`` function, which is where most of Pypiper's power comes from

When you've mastered the basics and are ready to get more powerful, add in a few new (optional) commands that make debugging and development easier:

.. autosummary::
clean_add
report_result
get_stat
timestamp


The complete documentation for these functions can be found in the :doc:`API <api>`.
4 changes: 3 additions & 1 deletion doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,16 @@ Contents
features.rst
tutorials.rst
outputs.rst
commands.rst
functions.rst
advanced.rst
best-practices.rst
ngstk.rst
api.rst
faq.rst
support.rst



Indices and tables
==================

Expand Down
2 changes: 1 addition & 1 deletion doc/source/tutorials-basic.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Basic tutorial
*****************

Now, download `basic.py <https://github.com/epigen/pypiper/blob/master/example_pipelines/basic.py>`_ and run it with `python basic.py` (or, better yet, make it executable (`chmod 755 basic.py`) and then run it directly with `./basic.py`)
Now, download `basic.py <https://github.com/epigen/pypiper/blob/master/example_pipelines/basic.py>`_ and run it with `python basic.py` (or, better yet, make it executable (`chmod 755 basic.py`) and then run it directly with `./basic.py`). This example is a documented vignette; so just read it and run it to get an idea of how things work.

.. literalinclude:: ../../example_pipelines/basic.py

Expand Down
2 changes: 1 addition & 1 deletion doc/source/tutorials-first.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,4 @@ Now string together whatever commands your pipeline requires! At the end, termin
That's it! By running commands through ``run()`` instead of directly in bash, you get a robust, logged, restartable pipeline manager for free!

To see an example of a simple pipline, look in the `example_pipelines` folder in this respository (also listed here under tutorials), which are thoroughly commented to act as vignettes. This is the best way to learn how to use Pyipiper.
Go to the next page (:doc:`basic tutorial <tutorials-basic>`) to see a more complicated example.
11 changes: 10 additions & 1 deletion pypiper/AttributeDict.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@

import os

class AttributeDict(object):
"""
A class to convert a nested Dictionary into an object with key-values
Expand All @@ -22,7 +24,14 @@ def add_entries(self, entries, default=False):
if type(value) is dict:
self.__dict__[key] = AttributeDict(value, default)
else:
self.__dict__[key] = value
try:
# try expandvars() to allow the yaml to use
# shell variables.
self.__dict__[key] = os.path.expandvars(value) # value
except TypeError:
# if value is an int, expandvars() will fail; if
# expandvars() fails, just use the raw value
self.__dict__[key] = value

def __getitem__(self, key):
"""
Expand Down

0 comments on commit 418776b

Please sign in to comment.