Skip to content

Commit

Permalink
Merge branch 'master' of github.com:epigen/pypiper
Browse files Browse the repository at this point in the history
  • Loading branch information
Charles Dietz committed Jul 6, 2016
2 parents a4616f8 + 7f7e122 commit 0e5e538
Show file tree
Hide file tree
Showing 10 changed files with 68 additions and 42 deletions.
7 changes: 7 additions & 0 deletions doc/source/advanced.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
Advanced
=========================


Toolkits
*************

Pypiper includes optional "toolkits" (right now just one) -- suites of commonly used code snippets which simplifies tasks for a pipeline author. For example, the next-gen sequencing toolkit, NGSTk, which simply provides some convenient helper functions to create common shell commands, like converting from file formats (_e.g._ ``bam_to_fastq()``), merging files (_e.g._ ``merge_bams()``), counting reads, etc. These make it faster to design bioinformatics pipelines in Pypiper, but are entirely optional. Contributions of additional toolkits or functions to an existing toolkit are welcome.


The ``follow`` argument
*************
*Follow functions* let you couple functions to run commands; the functions will only be run if the command is run. This lets you avoid running unnecessary processes repeatedly in the event that you restart your pipeline multiple times (for instance, while debugging later steps in the pipeline).
Expand Down
2 changes: 1 addition & 1 deletion doc/source/features.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

Features
Features at-a-glance
=========================
Pypiper provides the following benefits:

Expand Down
8 changes: 2 additions & 6 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,7 @@
Welcome
^^^^^^^^

Making robust pipelines just got easier. Pypiper helps you take your current pipeline and make it better with minimal effort on your part.

Pypiper is a lightweight python toolkit for gluing together restartable, robust
command line pipelines. With Pypiper, simplicity is paramount. A new user can start building pipelines using Pypiper in under 15 minutes. Learning all the :doc:`features and benefits <features>` of Pypiper takes just an hour or two. At the same time, Pypiper provides immediate and significant advantages over a simple shell script.

To get started, proceed with the :doc:`Introduction <intro>` or use the table of contents below to navigate the docs. When you have a feel for what Pypiper is, the best way to learn is by looking at the :doc:`Tutorials <tutorials>`. That should get you started building your first pipeline. But before you get too deep, make sure you get the most out Pypiper by checking out the :doc:`Advanced Functions <advanced>` and :doc:`Best Practices <best-practices>` for building pipelines.
Making robust pipelines just got easier. Pypiper helps you take your current pipeline and make it better with minimal effort on your part. Pypiper is a lightweight python toolkit that helps you write slick pipelines in python. You'll be running in minutes. Interested? Proceed to the :doc:`Introduction <intro>` or jump straight to the :doc:`Tutorials <tutorials>`.

Links
^^^^^^^^
Expand Down Expand Up @@ -38,6 +33,7 @@ Contents
best-practices.rst
ngstk.rst
api.rst
support.rst


Indices and tables
Expand Down
30 changes: 13 additions & 17 deletions doc/source/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,18 @@ Introduction
Overview
*************

The target user of Pypiper is a computational scientist comfortable on the command line, who has something like a bash script that would benefit from a layer of "handling code". Pypiper helps you convert that set of shell commands into a production-scale workflow, automatically handling the annoying details to make your pipeline robust and restartable, with minimal learning curve.
Pypiper is a lightweight python toolkit for gluing together restartable, robust
command line pipelines. With Pypiper, simplicity is paramount. A new user can start building pipelines using Pypiper in under 15 minutes. Learning all the :doc:`features and benefits <features>` of Pypiper takes just an hour or two. At the same time, Pypiper provides immediate and significant advantages over a simple shell script.

The target user of Pypiper is a computational scientist comfortable on the command line, who has something like a bash script that would benefit from a layer of "handling code". Pypiper helps you convert that set of shell commands into a production-scale workflow, automatically handling the annoying details (restartablilty, file integrity, logging) to make your pipeline robust and restartable, with minimal learning curve.

Pypiper does not handle any sort of cluster job submission, resource requesting, or parallel dependency management (other than node-threaded parallelism inherent in your commands) -- we use `Looper <http://looper.readthedocs.io/>`_ for that (but you can use whatever you want). Pypiper just handles a one-sample, sequential pipeline, but now it's robust, restartable, and logged. When coupled with `Looper <http://looper.readthedocs.io/>`_ you get a complete pipeline management system.

Power in simplicity
*********************

Pypiper does not include a new language to learn. You **write your pipeline in python**. Pypiper does not assume you want a complex dependency structure. You write a simple **ordered sequence of commands**, just like a shell script. Pypiper tries to exploit the `Pareto principle <https://en.wikipedia.org/wiki/Pareto_principle>`_ -- you'll get 80% of the features with only 20% of the work of other pipeline management systems.

Pypiper does not handle any sort of cluster job submission, resource requesting, or parallel dependency management (other than node-threaded parallelism inherent in your commands). You can use your current setup for those things, and use Pypiper just to produce a robust, restartable, and logged procedural pipeline.

Installing
*************
Expand All @@ -26,16 +35,11 @@ Update with:
pip install --user --upgrade https://github.com/epigen/pypiper/zipball/master
Toolkits
*************

Pypiper includes optional "toolkits" -- suites of commonly used code snippets which simplifies tasks for a pipeline author. For example, the next-gen sequencing toolkit, NGSTk, which simply provides some convenient helper functions to create common shell commands, like converting from file formats (_e.g._ ``bam_to_fastq()``), merging files (_e.g._ ``merge_bams()``), counting reads, etc. These make it faster to design bioinformatics pipelines in Pypiper, but are entirely optional. Contributions of additional toolkits or functions to an existing toolkit are welcome.

Motivation
*************

As I began to put together production-scale pipelines, I found a lot of relevant pipelining systems, but was universally disappointed. For my needs, they were all overly complex. I wanted something simple enough to quickly write and maintain a pipeline without having to learn a lot of new functions and conventions, but robust enough to handle requirements like restartability and memory usage monitoring. Everything related was either a pre-packaged pipeline for a defined purpose, or a heavy-duty development environment that was overkill for a simple pipeline. Both of these seemed to be targeted toward less experienced developers who sought structure, and neither fit my needs: I had a set of commands already in mind; I just needed a wrapper that could take that code and make it automatically restartable, logged, robust to crashing, easy to debug, and so forth.
As I began to put together production-scale pipelines, I found a lot of relevant pipelining systems, but was universally disappointed. For my needs, they were all overly complex. I wanted something simple enough to **quickly write and maintain** a pipeline without having to learn a lot of new functions and conventions, but robust enough to handle requirements like restartability and memory usage monitoring. Everything related was either a pre-packaged pipeline for a defined purpose, or a heavy-duty development environment that was overkill for a simple pipeline. Both of these seemed to be targeted toward ultra-efficient uses, and neither fit my needs: I had a set of commands already in mind -- I just needed a wrapper that could take that code and make it automatically restartable, logged, robust to crashing, easy to debug, and so forth.

If you need a full-blown, datacenter-scale environment that can do everything, look elsewhere. Pypiper's strength is its simplicity. If all you want is a shell-like script, but now with the power of python, and restartability, then Pypiper is for you.

Expand All @@ -47,13 +51,5 @@ You can also generate docs locally using `sphinx <http://www.sphinx-doc.org/en/s

Testing
*************
You can test pypiper by running ``python test_pypiper.py``, which has some unit tests.

License
*************
Pypiper ___ licensed source code is available at http://github.com/epigen/pypiper/ .

Contributing
*************
We welcome contributions in the form of pull requests; or, if you find a bug or want request a feature, open an issue in https://github.com/epigen/pypiper/issues.
You can test pypiper by cloning it and running the included unit tests: ``python test_pypiper.py``.

15 changes: 11 additions & 4 deletions doc/source/ngstk.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,18 @@ Example:

.. code-block:: python
from pypiper.ngstk import NGSTk
import pypiper
pm = pypiper.PipelineManager(..., args = args)
tk = NGSTk()
tk.index_bam("sample.bam")
# Create a ngstk object (pass the PipelineManager as an argument)
ngstk = pypiper.NGSTk(pm = pm)
# Now you use use ngstk functions
ngstk.index_bam("sample.bam")
A list of available functions can be found in the :doc:`API <api>` or in the source code for `NGSTk`_.

Contributions of additional toolkits or functions in an existing toolkit are welcome.

.. _NGSTk: https://github.com/epigen/pypiper/pypiper/ngstk.py
.. _NGSTk: https://github.com/epigen/pypiper/blob/master/pypiper/ngstk.py
11 changes: 11 additions & 0 deletions doc/source/support.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@

Support
=========================

Bug Reports
*************
If you find a bug or want request a feature, open an issue at https://github.com/epigen/pypiper/issues.

Contributing
*************
We welcome contributions in the form of pull requests.
2 changes: 2 additions & 0 deletions doc/source/tutorials-basic.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
Basic tutorial
*****************

Now, download `basic.py <https://github.com/epigen/pypiper/blob/master/example_pipelines/basic.py>`_ and run it with `python basic.py` (or, better yet, make it executable (`chmod 755 basic.py`) and then run it directly with `./basic.py`)

.. literalinclude:: ../../example_pipelines/basic.py


13 changes: 7 additions & 6 deletions doc/source/tutorials-first.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@
Your first pipeline
***************************

Using pypiper is simple. First, import pypiper, specify an output folder, and create a new ``PipelineManager`` object:
Using pypiper is simple. Your pipeline is a python script, say `pipeline.py`. First, import pypiper, specify an output folder, and create a new ``PipelineManager`` object:

.. code-block:: python
#!/usr/bin/env python
import pypiper, os
outfolder = "pipeline_output/" # Choose a folder for your results
pipeline = pypiper.PipelineManager(name="my_pipeline", outfolder=outfolder)
pm = pypiper.PipelineManager(name="my_pipeline", outfolder=outfolder)
This creates your ``outfolder`` and places a flag called ``my_pipeline_running.flag`` in the folder. It also initializes the log file (``my_pipeline_log.md``) with statistics such as time of starting, compute node, software versions, command-line parameters, etc.

Expand All @@ -18,17 +19,17 @@ Now, the workhorse of ``PipelineManager`` is the ``run()`` function. Essentially
# our command will produce this output file
target = os.path.join(outfolder, "outfile.txt")
cmd = "shuf -i 1-500000000 -n 10000000 > " + target
pipeline.run(command, target)
command = "shuf -i 1-500000000 -n 10000000 > " + target
pm.run(command, target)
The command (``cmd``) is the only required argument to ``run()``. You can leave ``target`` empty (pass ``None``). If you **do** specify a target, the command will only be run if the target file does not already exist. If you **do not** specify a target, the command will be run every time the pipeline is run.
The command (``command``) is the only required argument to ``run()``. You can leave ``target`` empty (pass ``None``). If you **do** specify a target, the command will only be run if the target file does not already exist. If you **do not** specify a target, the command will be run every time the pipeline is run.

Now string together whatever commands your pipeline requires! At the end, terminate the pipeline so it gets flagged as successfully completed:

.. code-block:: python
pipeline.stop_pipeline()
pm.stop_pipeline()
That's it! By running commands through ``run()`` instead of directly in bash, you get a robust, logged, restartable pipeline manager for free!

Expand Down
6 changes: 6 additions & 0 deletions doc/source/tutorials.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@


Tutorials
=========================

Start with the 'your first pipeline' tutorial to get a quick overview of a simple pipeline, then look through the other examples for more advanced concepts.

.. toctree::
:maxdepth: 2

Expand All @@ -11,3 +15,5 @@ Tutorials
* `example_pipelines/advanced.py` (under construction) - A tutorial demonstrating some more advanced features.
* `example_pipelines/bioinformatics.py` (under construction) - A tutorial showing some bioinformatics use cases.

Finally, check out some real-world pipelines in our `open_pipelines repository <https://github.com/epigen/open_pipelines>`_

16 changes: 8 additions & 8 deletions example_pipelines/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# Create a PipelineManager instance (don't forget to name it!)
# This starts the pipeline.

mypiper = pypiper.PipelineManager(name="BASIC",
pm = pypiper.PipelineManager(name="BASIC",
outfolder="pipeline_output/")

# Now just build shell command strings, and use the run function
Expand All @@ -28,33 +28,33 @@
cmd = "shuf -i 1-500000000 -n 10000000 > " + tgt

# and run with run().
mypiper.run(cmd, target=tgt)
pm.run(cmd, target=tgt)

# Now copy the data into a new file.
# first specify target file and build command:
tgt = "pipeline_output/copied.out"
cmd = "cp pipeline_output/test.out " + tgt
mypiper.run(cmd, target=tgt)
pm.run(cmd, target=tgt)

# You can also string multiple commands together, which will execute
# in order as a group to create the final target.
cmd1 = "sleep 5"
cmd2 = "touch pipeline_output/touched.out"
mypiper.run([cmd1, cmd2], target="pipeline_output/touched.out")
pm.run([cmd1, cmd2], target="pipeline_output/touched.out")

# A command without a target will run every time.
# Find the biggest line
cmd = "awk 'n < $0 {n=$0} END{print n}' pipeline_output/test.out"
mypiper.run(cmd, "lock.max")
pm.run(cmd, "lock.max")

# Use checkprint() to get the results of a command, and then use
# report_result() to print and log key-value pairs in the stats file:
last_entry = mypiper.checkprint("tail -n 1 pipeline_output/copied.out")
mypiper.report_result("last_entry", last_entry)
last_entry = pm.checkprint("tail -n 1 pipeline_output/copied.out")
pm.report_result("last_entry", last_entry)


# Now, stop the pipeline to complete gracefully.
mypiper.stop_pipeline()
pm.stop_pipeline()

# Observe your outputs in the pipeline_output folder
# to see what you've created.

0 comments on commit 0e5e538

Please sign in to comment.