Skip to content

Commit

Permalink
Restructure docs, add hello, world! example
Browse files Browse the repository at this point in the history
  • Loading branch information
nsheff committed Mar 2, 2017
1 parent fa2c12d commit 134b3cb
Show file tree
Hide file tree
Showing 10 changed files with 168 additions and 62 deletions.
2 changes: 1 addition & 1 deletion doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Toolkits
Pypiper includes optional "toolkits" (right now just one) -- suites of commonly used code snippets which simplifies tasks for a pipeline author. For example, the next-gen sequencing toolkit, NGSTk, which simply provides some convenient helper functions to create common shell commands, like converting from file formats (_e.g._ ``bam_to_fastq()``), merging files (_e.g._ ``merge_bams()``), counting reads, etc. These make it faster to design bioinformatics pipelines in Pypiper, but are entirely optional. Contributions of additional toolkits or functions to an existing toolkit are welcome.


The ``follow`` argument
The follow argument
*************
*Follow functions* let you couple follow-up functions to run commands; a follow function will be run if and only if the command is run. This lets you avoid running unnecessary processes repeatedly in the event that you restart your pipeline multiple times (for instance, while debugging later steps in the pipeline).

Expand Down
7 changes: 5 additions & 2 deletions doc/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,8 @@
FAQ
=========================

- **How can I run my pipeline on more than 1 sample?** Pypiper only handles individual-sample pipelines. To run it on multiple samples, you need `Looper <http://looper.readthedocs.io/>`_. Dividing multi-sample handling from individual sample handling is a conceptual advantage that allows us to write a nice, universal, generic sample-handler that you only have to learn once.
- **What cluster resources can pypiper use?** PyPiper is compute-agnostic. You run it wherever you want; If you want a nice way to submit pipelines for samples any cluster manager, check out `Looper <http://looper.readthedocs.io/>`_.
- **How can I run my pipeline on more than 1 sample?**
Pypiper only handles individual-sample pipelines. To run it on multiple samples, write a loop, or use `Looper <http://looper.readthedocs.io/>`_. Dividing multi-sample handling from individual sample handling is a conceptual advantage that allows us to write a nice, universal, generic sample-handler that you only have to learn once.

- **What cluster resources can pypiper use?**
PyPiper is compute-agnostic. You run it wherever you want; If you want a nice way to submit pipelines for samples any cluster manager, check out `Looper <http://looper.readthedocs.io/>`_.
24 changes: 16 additions & 8 deletions doc/source/features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,22 @@ Pypiper provides the following benefits:

.. image:: _static/pypiper.svg

- **Restartability:** Commands check for their targets and only run if the target needs to be created, much like a `makefile`, making the pipeline pick up where it left off in case it needs to be restarted or extended.
- **Pipeline integrity protection:** PyPiper uses file locking to ensure that multiple pipeline runs will not interfere with one another -- even if the steps are identical and produce the same files. One run will seamlessly wait for the other, making it possible to share steps seamlessly across pipelines.
- **Memory use monitoring:** Processes are polled for high water mark memory use, allowing you to more accurately guage your future memory requirements.
- **Easy job status monitoring:** Pypiper uses status flag files to make it possible to assess the current state (`running`, `failed`, or `completed`) of hundreds of jobs simultaneously.
- **Robust error handling:** Pypiper closes pipelines gracefully on interrupt or termination signals, converting the status to `failed`. By default, a process that returns a nonzero value halts the pipeline, unlike in bash, where by default the pipeline would continue using an incomplete or failed result. This behavior can be overridden as desired with a single parameter.
- **Logging:** Pypiper automatically records the output of your pipeline and its subprocesses, and provides copious information on pipeline initiation, as well as easy timestamping.
- **Easy result reports:** Pypiper provides functions to put key-value pairs into an easy-to-parse stats file, making it easy to summarize your pipeline results.
- **Simplicity:** It should only take you 15 minutes to run your first pipeline. The basic documentation is just a few pages long. The codebase itself is also only a few thousand lines of code, making it very lightweight.
- **Restartability:**
Commands check for their targets and only run if the target needs to be created, much like a `makefile`, making the pipeline pick up where it left off in case it needs to be restarted or extended.
- **Pipeline integrity protection:**
PyPiper uses file locking to ensure that multiple pipeline runs will not interfere with one another -- even if the steps are identical and produce the same files. One run will seamlessly wait for the other, making it possible to share steps seamlessly across pipelines.
- **Memory use monitoring:**
Processes are polled for high water mark memory use, allowing you to more accurately guage your future memory requirements.
- **Easy job status monitoring:**
Pypiper uses status flag files to make it possible to assess the current state (`running`, `failed`, or `completed`) of hundreds of jobs simultaneously.
- **Robust error handling:**
Pypiper closes pipelines gracefully on interrupt or termination signals, converting the status to `failed`. By default, a process that returns a nonzero value halts the pipeline, unlike in bash, where by default the pipeline would continue using an incomplete or failed result. This behavior can be overridden as desired with a single parameter.
- **Logging:**
Pypiper automatically records the output of your pipeline and its subprocesses, and provides copious information on pipeline initiation, as well as easy timestamping.
- **Easy result reports:**
Pypiper provides functions to put key-value pairs into an easy-to-parse stats file, making it easy to summarize your pipeline results.
- **Simplicity:**
It should only take you 15 minutes to run your first pipeline. The basic documentation is just a few pages long. The codebase itself is also only a few thousand lines of code, making it very lightweight.


Furthermore, Pypiper includes a suite of commonly used pieces of code (toolkits) which the user may use to build pipelines.
Expand Down
4 changes: 2 additions & 2 deletions doc/source/functions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ With that you can create a simple pipeline. You can click on each function to vi
When you've mastered the basics and are ready to get more powerful, add in a few new (optional) commands that make debugging and development easier:

.. autosummary::
clean_add
timestamp
report_result
clean_add
get_stat
timestamp


The complete documentation for these functions can be found in the :doc:`API <api>`.
86 changes: 86 additions & 0 deletions doc/source/hello-world.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
Installing and Hello World
==============================

Release versions are posted on the GitHub `releases page <https://github.com/epigen/pypiper/releases>`_. You can install the latest version directly from GitHub using pip:

.. code-block:: bash
pip install --user https://github.com/epigen/pypiper/zipball/master
Update with:

.. code-block:: bash
pip install --user --upgrade https://github.com/epigen/pypiper/zipball/master
Now, to test pypiper, follow the commands in the `Hello, Pypiper! tutorial: just run these 3 lines of code and you're running your first pypiper pipeline!

.. code:: bash

# Install the latest version of pypiper:
pip install --user https://github.com/epigen/pypiper/zipball/master

# download hello_pypiper.py
wget https://github.com/databio/pypiper/example_pipelines/hello_pypiper.py

# Run it:
python hello_pypiper.py


You should see printed to screen some output like this:

.. code::

----------------------------------------
##### [Pipeline run code and environment:]
* Command: `hello_pypiper.py`
* Compute host: puma
* Working dir: /home/nsheff/code/pypiper/example_pipelines
* Outfolder: hello_pypiper_results/
* Pipeline started at: (03-02 14:58:20) elapsed:0:00:00 _TIME_

##### [Version log:]
* Python version: 2.7.12
* Pypiper dir: `/home/nsheff/.local/lib/python2.7/site-packages/pypiper`
* Pipeline dir: `/home/nsheff/code/pypiper/example_pipelines`
* Pipeline version: fa2c12d8fba0a9b50a678e7682f39a7a756e0ead
* Pipeline branch: * dev
* Pipeline date: 2017-01-27 14:05:15 -0500

##### [Arguments passed to pipeline:]

----------------------------------------


Change status from initializing to running
Hello! (03-02 14:58:20) elapsed:0:00:00 _TIME_

Target to produce: `hello_pypiper_results/output.txt`
> `echo 'Hello, Pypiper!' > hello_pypiper_results/output.txt`

<pre>
</pre>
Process 22693 returned: (0). Elapsed: 0:00:00.

Change status from running to completed
> `Time` 0:00:00 hello_pypiper _RES_
> `Success` 03-02 14:58:20 hello_pypiper _RES_

##### [Epilogue:]
* Total elapsed time: 0:00:00
* Peak memory used: 0.0 GB
* Pipeline completed at: (03-02 14:58:20) elapsed:0:00:00 _TIME_

Pypiper terminating spawned child process 22679

Now observe your results in the folder ``hello_pypiper_results``:

* hello_pypiper_commands.sh
* hello_pypiper_completed.flag
* hello_pypiper_log.md
* hello_pypiper_profile.tsv
* output.txt
* stats.tsv

These files are explained in more detail in the :doc:`Outputs <outputs>` section.
22 changes: 20 additions & 2 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,16 +22,34 @@ Contents
^^^^^^^^

.. toctree::
:maxdepth: 2
:caption: Getting Started
:maxdepth: 1

intro.rst
hello-world.rst
usage.rst
features.rst
tutorials.rst
outputs.rst

.. toctree::
:caption: Developing Pipelines
:maxdepth: 2

tutorials.rst
functions.rst
advanced.rst
best-practices.rst

.. toctree::
:caption: Toolkits
:maxdepth: 2

ngstk.rst

.. toctree::
:caption: Further Reading
:maxdepth: 2

api.rst
faq.rst
support.rst
Expand Down
42 changes: 2 additions & 40 deletions doc/source/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,54 +2,16 @@
Introduction
=========================

Overview
*************

Pypiper is a lightweight python toolkit for gluing together restartable, robust
command line pipelines. With Pypiper, simplicity is paramount. A new user can start building pipelines using Pypiper in under 15 minutes. Learning all the :doc:`features and benefits <features>` of Pypiper takes just an hour or two. At the same time, Pypiper provides immediate and significant advantages over a simple shell script.

The target user of Pypiper is a computational scientist comfortable on the command line, who has something like a bash script that would benefit from a layer of "handling code". Pypiper helps you convert that set of shell commands into a production-scale workflow, automatically handling the annoying details (restartablilty, file integrity, logging) to make your pipeline robust and restartable, with minimal learning curve.

Pypiper does not handle any sort of cluster job submission, resource requesting, or parallel dependency management (other than node-threaded parallelism inherent in your commands) -- we use `Looper <http://looper.readthedocs.io/>`_ for that (but you can use whatever you want). Pypiper just handles a one-sample, sequential pipeline, but now it's robust, restartable, and logged. When coupled with `Looper <http://looper.readthedocs.io/>`_ you get a complete pipeline management system.

Power in simplicity
*********************

Pypiper does not include a new language to learn. You **write your pipeline in python**. Pypiper does not assume you want a complex dependency structure. You write a simple **ordered sequence of commands**, just like a shell script. Pypiper tries to exploit the `Pareto principle <https://en.wikipedia.org/wiki/Pareto_principle>`_ -- you'll get 80% of the features with only 20% of the work of other pipeline management systems.


Installing
*************

The source code lives at Github. You can install directly from GitHub using pip:

.. code-block:: bash
pip install --user https://github.com/epigen/pypiper/zipball/master
Update with:
**Power in simplicity**. Pypiper does not include a new language to learn. You **write your pipeline in python**. Pypiper does not assume you want a complex dependency structure. You write a simple **ordered sequence of commands**, just like a shell script. Pypiper tries to exploit the `Pareto principle <https://en.wikipedia.org/wiki/Pareto_principle>`_ -- you'll get 80% of the features with only 20% of the work of other pipeline management systems.

.. code-block:: bash

pip install --user --upgrade https://github.com/epigen/pypiper/zipball/master
Motivation
*************

As I began to put together production-scale pipelines, I found a lot of relevant pipelining systems, but was universally disappointed. For my needs, they were all overly complex. I wanted something simple enough to **quickly write and maintain** a pipeline without having to learn a lot of new functions and conventions, but robust enough to handle requirements like restartability and memory usage monitoring. Everything related was either a pre-packaged pipeline for a defined purpose, or a heavy-duty development environment that was overkill for a simple pipeline. Both of these seemed to be targeted toward ultra-efficient uses, and neither fit my needs: I had a set of commands already in mind -- I just needed a wrapper that could take that code and make it automatically restartable, logged, robust to crashing, easy to debug, and so forth.
**Motivation**. As I began to put together production-scale pipelines, I found a lot of relevant pipelining systems, but was universally disappointed. For my needs, they were all overly complex. I wanted something simple enough to **quickly write and maintain** a pipeline without having to learn a lot of new functions and conventions, but robust enough to handle requirements like restartability and memory usage monitoring. Everything related was either a pre-packaged pipeline for a defined purpose, or a heavy-duty development environment that was overkill for a simple pipeline. Both of these seemed to be targeted toward ultra-efficient uses, and neither fit my needs: I had a set of commands already in mind -- I just needed a wrapper that could take that code and make it automatically restartable, logged, robust to crashing, easy to debug, and so forth.

If you need a full-blown, datacenter-scale environment that can do everything, look elsewhere. Pypiper's strength is its simplicity. If all you want is a shell-like script, but now with the power of python, and restartability, then Pypiper is for you.

Documentation
*************
Pypiper's documentation is at http://pypiper.readthedocs.org/.

You can also generate docs locally using `sphinx <http://www.sphinx-doc.org/en/stable/install.html>`_. Change your working directory to ``doc`` and run ``make`` to see available documentation formats *e.g.*: ``make html``. The documentation is built in ``doc/build``.

Testing
*************
You can test pypiper by cloning it and running the included unit tests: ``python test_pypiper.py``.

19 changes: 12 additions & 7 deletions doc/source/outputs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,23 @@
Outputs
=========================

Assume you name your pipeline `PIPE` (by `passing name="PIPE"` to the PipelineManager constructor), by default, your ``PipelineManager`` will produce the following outputs automatically (in additional to any output created by the actual pipeline commands you run):
Assume you are using a pypiper pipeline named `PIPE` ( it passes ``name="PIPE"`` to the PipelineManager constructor). By default, your ``PipelineManager`` will produce the following outputs automatically (in addition to any output created by the actual pipeline commands you run):

* **PIPE_log.md** - the log starts with a bunch of useful information about your run: a starting timestamp, version numbers of the pipeline and pypiper, a declaration of all arguments passed to the pipeline, the compute host, etc. Then, all output sent to screen is automatically logged to this file, providing a complete record of your run without requiring you to write **any** logging code on your own.
* **PIPE_log.md**
The log starts with a bunch of useful information about your run: a starting timestamp, version numbers of the pipeline and pypiper, a declaration of all arguments passed to the pipeline, the compute host, etc. Then, all output sent to screen is automatically logged to this file, providing a complete record of your run.

* **PIPE_status.flag** - As the pipeline runs, it produces a flag in the output directory, which can be either `PIPE_running.flag`, `PIPE_failed.flag`, or `PIPE_completed.flag`. These flags make it easy to assess the current state of running pipelines for individual samples, and for many samples in a project simultaneously (the `flagCheck.sh` script produces a summary of all pipeline runs in an output folder).
* **PIPE_status.flag**
As the pipeline runs, it produces a flag in the output directory, which can be either ``PIPE_running.flag``, ``PIPE_failed.flag``, or ``PIPE_completed.flag``. These flags make it easy to assess the current state of running pipelines for individual samples, and for many samples in a project simultaneously.

* **PIPE_stats.md** - any results reported using ``report_result()`` are saved as key-value pairs in this file, for easy parsing (we also have a post-pipeline script to aggregate these results into a summary table).
* **stats.tsv**
Any results reported by the pipeline are saved as key-value pairs in this file, for easy parsing.

* **PIPE_profile.md** - A profile log file that provides, for every process run by the pipeline, 3 items: 1) the process name; 2) the clock time taken by the process; and 3) the high water mark memory used by the process. This file makes it easy to profile pipelines for memory and time resources.
* **PIPE_profile.md**
A profile log file that provides, for every process run by the pipeline, 3 items: 1) the process name; 2) the clock time taken by the process; and 3) the memory high water mark used by the process. This file makes it easy to profile pipelines for memory and time resources.

* **PIPE_commands.md** - Pypiper produces a log file containing all the commands run by the pipeline, verbatim. These are also included in the main log.
* **PIPE_commands.md**
Pypiper produces a log file containing all the commands run by the pipeline, verbatim. These are also included in the main log.

Multiple pipelines can easily be run on the same sample, using the same output folder (and possibly sharing intermediate files), as the result outputs will be identifiable by the `PIPE` identifier.
Multiple pipelines can easily be run on the same sample, using the same output folder (and possibly sharing intermediate files), as the result outputs will be identifiable by the ``PIPE_`` identifier.

These files are `markdown <https://daringfireball.net/projects/markdown/>`_ making it easy to read either in text format, or to quickly convert to a pretty format like HTML.
12 changes: 12 additions & 0 deletions doc/source/usage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@

Using Pipelines
=========================

Pypiper pipelines are python scripts. There is no special requirement or syntax. To run a pypiper pipeline, you run it as you would any basic script, usage will vary based on the script. Pipeline authors determine what command line arguments their pipeline will recognize. You can usually figure this out by passing the ``--help argument``, like so: ``script.py --help``. You should check the individual Pypiper script you're using to see what it does.

Many pipelines employ default pypiper arguments which are detailed below:

- ``-R, --recover``
Recover mode, overwrite locks. This argument will tell pypiper to recover from a failed previous run. Pypiper will execute commands until it encounters a locked file, at which point it will re-execute the failed command and continue from there.
- ``-F, --follow``
Run follow-functions. :doc:`the follow argument <the-follow-argument>`.
12 changes: 12 additions & 0 deletions example_pipelines/hello_pypiper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/usr/bin/env python
import pypiper
outfolder = "hello_pypiper_results" # Choose a folder for your results
pm = pypiper.PipelineManager(name="hello_pypiper", outfolder=outfolder)


pm.timestamp("Hello!")
target_file = "hello_pypiper_results/output.txt"
cmd = "echo 'Hello, Pypiper!' > " + target_file
pm.run(cmd, target_file)

pm.stop_pipeline()

0 comments on commit 134b3cb

Please sign in to comment.