Merge pull request #24 from epigen/dev

Dev
databio · Jan 23, 2017 · 418776b · 418776b
2 parents a26a02c + 4c79bd8
commit 418776b
Show file tree

Hide file tree

Showing 12 changed files with 453 additions and 250 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 [![Documentation Status](https://readthedocs.org/projects/pypiper/badge/?version=latest)](http://pypiper.readthedocs.org/en/latest/?badge=latest)
 
-A lightweight python toolkit for gluing together restartable, robust command line pipelines.
+A lightweight python toolkit for gluing together restartable, robust command line pipelines. Pypiper works with [Looper](http://github.com/epigen/looper) to form a complete bioinformatics pipeline framework.
 
 # Links
 

diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -10,9 +10,9 @@ Pypiper includes optional "toolkits" (right now just one) -- suites of commonly
 
 The ``follow`` argument
 *************
-*Follow functions* let you couple functions to run commands; the functions will only be run if the command is run. This lets you avoid running unnecessary processes repeatedly in the event that you restart your pipeline multiple times (for instance, while debugging later steps in the pipeline).
+*Follow functions* let you couple follow-up functions to run commands; a follow function will be run if and only if the command is run. This lets you avoid running unnecessary processes repeatedly in the event that you restart your pipeline multiple times (for instance, while debugging later steps in the pipeline).
 
-One of the really useful things in pypiper is that you can pass a python function (we call a "follow function") along in your call to ``run``, and python will call this *follow function* if and only if it runs your command. This is useful for data checks to make sure processes did what you expect. A good example is file conversions; after you convert from one format into another, it's good practice to make sure the conversion completed and you have the same number of lines, or whatever, in the final file. Pypiper's command locking mechanism solves most of this problem, but there are still lots of times you'd like to run a function but only once, right after the command that produced it runs. For example, just counting the number of lines in a file after producing it. You don't need to re-run the counting every time you restart the pipeline, which will skip that step since it is complete. 
+This is useful for data QC checks to make sure processes did what you expect. Often you'd like to run a function to examine a file but only once, right after the command that produced it runs. For example, just counting the number of lines in a file after producing it, or counting the number of reads that aligned right after an alignment step. You want the counting process coupled to the alignment process, and don't need to re-run the counting every time you restart the pipeline. 
 
 
 
@@ -47,6 +47,8 @@ Additional stuff (to be explained more thorougly soon): You can also add some ot
 
 The most significant of these special keywords is the ``config_file`` argument, which leads us to the concept of ``pipeline config files``:
 
+.. _pipeline_config_files:
+
 Pipeline config files
 *************
 Optionally, you may choose to conform to our standard for parameterizing pipelines, which enables you to use some powerful features of the pypiper system.

diff --git a/doc/source/best-practices.rst b/doc/source/best-practices.rst
@@ -4,14 +4,15 @@ Best practices
 
 Here are some guidelines for how you can design the most effective pipelines.
 
-* **Parameterize your pipeline** (from the beginning). Pypiper makes it painfully easy to use a config file to make your pipeline configurable. Start from the very beginning by making a ``yaml`` pipeline config file.
 
-* **Use flexible inputs**. Right now you may always have the same input type (fastq, for example), but later you may want your pipeline to be able to work from ``bam`` files. We've already written simple functions to handle single or multiple bam or fastq inputs; just use this infrastructure (in NGSTk) instead of writing your own, and you'll save yourself future headaches.
+* **Compartmentalize output into folders**. In your output, keep pipeline steps separate by organizing output into folders.
 
-* **Use looper args**. Even if you're not using looper at first, use ``looper_args`` and your pipeline will be looper-ready when it comes time to run 500 samples.
+* **Use git for versioning**. If you develop your pipeline in a git repository, Pypiper will automatically record what commit you run, making it easy to figure out what code version you ran.
 
-* **Compartmentalize with folders**. In your output, keep pipeline steps separate by organizing output into folders.
+* **Record stats as you go**. In other words, don't do all your stats (``report_result()``) and QC at the end, do it along the way. This makes it easy for you to monitor pipeline performance, and couples stats with how far the pipeline makes it, so you could make use of a partially completed (or even ultimately failed) pipelines.
 
-* **Use git for versioning**. If you develop your pipeline in a git repository, Pypiper will automatically record what commit you run, making it easy to figure out what code version you ran.
+* **Use looper args**. Even if you're not using looper at first, use ``looper_args`` and your pipeline will be looper-ready when it comes time to run 500 samples.
+
+* **Use NGSTk early on**. NGSTk has lots of useful functions that you will probably need. We've worked hard to make these robust and universal. For example, using NGSTk, you can easily make your pipeline take flexible input formats (fastq or bam) Right now you may always have the same input type (fastq, for example), but later you may want your pipeline to be able to work from ``bam`` files. We've already written simple functions to handle single or multiple bam or fastq inputs; just use this infrastructure (in NGSTk) instead of writing your own, and you'll save yourself future headaches.
 
-* **Record stats as you go**. In other words, don't do all your stats and QC at the end, do it along the way. This makes it easy for you to monitor pipeline performance, and couples stats with how far the pipeline makes it, so you could make use of a partially completed (or even ultimately failed) pipelines.
+* **Make some important parameters in the pipeline config, instead of hardcoding them** (from the beginning). Pypiper makes it painfully easy to use a config file to make your pipeline configurable. Typically you'll start by hard-coding in those parameters in your pipeline steps. But you can select a few important parameters and make them customizable in the pipeline config. Start from the very beginning by making a ``yaml`` pipeline config file. See an example of a :ref:`pipeline_config_files <pipeline config file>`_
diff --git a/doc/source/commands.rst b/doc/source/commands.rst
@@ -1,8 +1,8 @@
 
-Commands
+Pypiper Functions
 =========================
 
-The key things you need to know that the ``PipelineManager`` can do are:
+Pypiper is simple, but powerful. You only need 3 functions to get started. ``PipelineManager`` can do:
 
 .. currentmodule:: pypiper.pypiper.PipelineManager
 .. autosummary:: 
@@ -11,6 +11,14 @@ The key things you need to know that the ``PipelineManager`` can do are:
 	stop_pipeline
 
 
-With that you can create a simple pipeline.
+With that you can create a simple pipeline. You can click on each function to view the in-depth documentation for that function. There are quite a few optional parameters to the ``run`` function, which is where most of Pypiper's power comes from
 
-Complete specifications for these and the more advanced functions can be found in the :doc:`API <api>`.
+When you've mastered the basics and are ready to get more powerful, add in a few new (optional) commands that make debugging and development easier:
+
+.. autosummary:: 
+	timestamp
+	clean_add
+	report_result
+	get_stat
+
+The complete documentation for these functions can be found in the :doc:`API <api>`.
diff --git a/doc/source/faq.rst b/doc/source/faq.rst
@@ -0,0 +1,6 @@
+
+FAQ
+=========================
+
+-   **How can I run my pipeline on more than 1 sample?** Pypiper only handles individual-sample pipelines. To run it on multiple samples, you need `Looper <http://looper.readthedocs.io/>`_. Dividing multi-sample handling from individual sample handling is a conceptual advantage that allows us to write a nice, universal, generic sample-handler that you only have to learn once.
+-   **What cluster resources can pypiper use?** PyPiper is compute-agnostic. You run it wherever you want; If you want a nice way to submit pipelines for samples any cluster manager, check out `Looper <http://looper.readthedocs.io/>`_.
diff --git a/doc/source/functions.rst b/doc/source/functions.rst
@@ -0,0 +1,25 @@
+
+Pypiper Functions
+=========================
+
+Pypiper is simple, but powerful. You only need 3 functions to get started. ``PipelineManager`` can do:
+
+.. currentmodule:: pypiper.pypiper.PipelineManager
+.. autosummary:: 
+	start_pipeline
+	run
+	stop_pipeline
+
+
+With that you can create a simple pipeline. You can click on each function to view the in-depth documentation for that function. There are quite a few optional parameters to the ``run`` function, which is where most of Pypiper's power comes from
+
+When you've mastered the basics and are ready to get more powerful, add in a few new (optional) commands that make debugging and development easier:
+
+.. autosummary:: 
+	clean_add
+	report_result
+	get_stat
+	timestamp
+
+
+The complete documentation for these functions can be found in the :doc:`API <api>`.
diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -28,14 +28,16 @@ Contents
    features.rst
    tutorials.rst
    outputs.rst
-   commands.rst
+   functions.rst
    advanced.rst
    best-practices.rst
    ngstk.rst
    api.rst
+   faq.rst
    support.rst
 
 
+
 Indices and tables
 ==================
 

diff --git a/doc/source/tutorials-basic.rst b/doc/source/tutorials-basic.rst
@@ -1,7 +1,7 @@
 Basic tutorial
 *****************
 
-Now, download `basic.py <https://github.com/epigen/pypiper/blob/master/example_pipelines/basic.py>`_ and run it with `python basic.py` (or, better yet, make it executable (`chmod 755 basic.py`) and then run it directly with `./basic.py`) 
+Now, download `basic.py <https://github.com/epigen/pypiper/blob/master/example_pipelines/basic.py>`_ and run it with `python basic.py` (or, better yet, make it executable (`chmod 755 basic.py`) and then run it directly with `./basic.py`). This example is a documented vignette; so just read it and run it to get an idea of how things work.
 
 .. literalinclude:: ../../example_pipelines/basic.py
 

diff --git a/doc/source/tutorials-first.rst b/doc/source/tutorials-first.rst
@@ -33,4 +33,4 @@ Now string together whatever commands your pipeline requires! At the end, termin
 
 That's it! By running commands through ``run()`` instead of directly in bash, you get a robust, logged, restartable pipeline manager for free!
 
-To see an example of a simple pipline, look in the `example_pipelines` folder in this respository (also listed here under tutorials), which are thoroughly commented to act as vignettes. This is the best way to learn how to use Pyipiper.
+Go to the next page (:doc:`basic tutorial <tutorials-basic>`) to see a more complicated example.
diff --git a/pypiper/AttributeDict.py b/pypiper/AttributeDict.py
@@ -1,4 +1,6 @@
 
+import os
+
 class AttributeDict(object):
 	"""
 	A class to convert a nested Dictionary into an object with key-values
@@ -22,7 +24,14 @@ def add_entries(self, entries, default=False):
 			if type(value) is dict:
 				self.__dict__[key] = AttributeDict(value, default)
 			else:
-				self.__dict__[key] = value
+				try:
+					# try expandvars() to allow the yaml to use
+					# shell variables.
+					self.__dict__[key] = os.path.expandvars(value)  # value
+				except TypeError:
+					# if value is an int, expandvars() will fail; if 
+					# expandvars() fails, just use the raw value
+					self.__dict__[key] = value
 
 	def __getitem__(self, key):
 		"""