Merge branch 'master' of github.com:epigen/pypiper

databio · Jul 6, 2016 · 0e5e538 · 0e5e538
2 parents a4616f8 + 7f7e122
commit 0e5e538
Show file tree

Hide file tree

Showing 10 changed files with 68 additions and 42 deletions.
diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -1,6 +1,13 @@
 Advanced
 =========================
 
+
+Toolkits
+*************
+
+Pypiper includes optional "toolkits" (right now just one) -- suites of commonly used code snippets which simplifies tasks for a pipeline author. For example, the next-gen sequencing toolkit, NGSTk, which simply provides some convenient helper functions to create common shell commands, like converting from file formats (_e.g._ ``bam_to_fastq()``), merging files (_e.g._ ``merge_bams()``), counting reads, etc. These make it faster to design bioinformatics pipelines in Pypiper, but are entirely optional. Contributions of additional toolkits or functions to an existing toolkit are welcome.
+
+
 The ``follow`` argument
 *************
 *Follow functions* let you couple functions to run commands; the functions will only be run if the command is run. This lets you avoid running unnecessary processes repeatedly in the event that you restart your pipeline multiple times (for instance, while debugging later steps in the pipeline).

diff --git a/doc/source/features.rst b/doc/source/features.rst
@@ -1,5 +1,5 @@
 
-Features
+Features at-a-glance
 =========================
 Pypiper provides the following benefits:
 

diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -2,12 +2,7 @@
 Welcome
 ^^^^^^^^
 
-Making robust pipelines just got easier. Pypiper helps you take your current pipeline and make it better with minimal effort on your part. 
-
-Pypiper is a lightweight python toolkit for gluing together restartable, robust
-command line pipelines. With Pypiper, simplicity is paramount. A new user can start building pipelines using Pypiper in under 15 minutes. Learning all the :doc:`features and  benefits <features>` of Pypiper takes just an hour or two. At the same time, Pypiper provides immediate and significant advantages over a simple shell script.
-
-To get started, proceed with the :doc:`Introduction <intro>` or use the table of contents below to navigate the docs. When you have a feel for what Pypiper is, the best way to learn is by looking at the :doc:`Tutorials <tutorials>`. That should get you started building your first pipeline. But before you get too deep, make sure you get the most out Pypiper by checking out the :doc:`Advanced Functions <advanced>` and :doc:`Best Practices <best-practices>` for building pipelines.
+Making robust pipelines just got easier. Pypiper helps you take your current pipeline and make it better with minimal effort on your part. Pypiper is a lightweight python toolkit that helps you write slick pipelines in python. You'll be running in minutes. Interested? Proceed to the :doc:`Introduction <intro>` or jump straight to the :doc:`Tutorials <tutorials>`.
 
 Links
 ^^^^^^^^
@@ -38,6 +33,7 @@ Contents
    best-practices.rst
    ngstk.rst
    api.rst
+   support.rst
 
 
 Indices and tables

diff --git a/doc/source/intro.rst b/doc/source/intro.rst
@@ -5,9 +5,18 @@ Introduction
 Overview
 *************
 
-The target user of Pypiper is a computational scientist comfortable on the command line, who has something like a bash script that would benefit from a layer of "handling code". Pypiper helps you convert that set of shell commands into a production-scale workflow, automatically handling the annoying details to make your pipeline robust and restartable, with minimal learning curve.
+Pypiper is a lightweight python toolkit for gluing together restartable, robust
+command line pipelines. With Pypiper, simplicity is paramount. A new user can start building pipelines using Pypiper in under 15 minutes. Learning all the :doc:`features and  benefits <features>` of Pypiper takes just an hour or two. At the same time, Pypiper provides immediate and significant advantages over a simple shell script.
+
+The target user of Pypiper is a computational scientist comfortable on the command line, who has something like a bash script that would benefit from a layer of "handling code". Pypiper helps you convert that set of shell commands into a production-scale workflow, automatically handling the annoying details (restartablilty, file integrity, logging) to make your pipeline robust and restartable, with minimal learning curve.
+
+Pypiper does not handle any sort of cluster job submission, resource requesting, or parallel dependency management (other than node-threaded parallelism inherent in your commands) -- we use `Looper <http://looper.readthedocs.io/>`_ for that (but you can use whatever you want). Pypiper  just handles a one-sample, sequential pipeline, but now it's robust, restartable, and logged. When coupled with `Looper <http://looper.readthedocs.io/>`_ you get a complete pipeline management system.
+
+Power in simplicity
+*********************
+
+Pypiper does not include a new language to learn. You **write your pipeline in python**. Pypiper does not assume you want a complex dependency structure. You write a simple **ordered sequence of commands**, just like a shell script. Pypiper tries to exploit the `Pareto principle <https://en.wikipedia.org/wiki/Pareto_principle>`_ -- you'll get 80% of the features with only 20% of the work of other pipeline management systems.
 
-Pypiper does not handle any sort of cluster job submission, resource requesting, or parallel dependency management (other than node-threaded parallelism inherent in your commands). You can use your current setup for those things, and use Pypiper just to produce a robust, restartable, and logged procedural pipeline.
 
 Installing
 *************
@@ -26,16 +35,11 @@ Update with:
 	pip install --user --upgrade https://github.com/epigen/pypiper/zipball/master
 
 
-Toolkits
-*************
-
-Pypiper includes optional "toolkits" -- suites of commonly used code snippets which simplifies tasks for a pipeline author. For example, the next-gen sequencing toolkit, NGSTk, which simply provides some convenient helper functions to create common shell commands, like converting from file formats (_e.g._ ``bam_to_fastq()``), merging files (_e.g._ ``merge_bams()``), counting reads, etc. These make it faster to design bioinformatics pipelines in Pypiper, but are entirely optional. Contributions of additional toolkits or functions to an existing toolkit are welcome.
-
 
 Motivation
 *************
 
-As I began to put together production-scale pipelines, I found a lot of relevant pipelining systems, but was universally disappointed. For my needs, they were all overly complex. I wanted something simple enough to quickly write and maintain a pipeline without having to learn a lot of new functions and conventions, but robust enough to handle requirements like restartability and memory usage monitoring. Everything related was either a pre-packaged pipeline for a defined purpose, or a heavy-duty development environment that was overkill for a simple pipeline. Both of these seemed to be targeted toward less experienced developers who sought structure, and neither fit my needs: I had a set of commands already in mind; I just needed a wrapper that could take that code and make it automatically restartable, logged, robust to crashing, easy to debug, and so forth.
+As I began to put together production-scale pipelines, I found a lot of relevant pipelining systems, but was universally disappointed. For my needs, they were all overly complex. I wanted something simple enough to **quickly write and maintain** a pipeline without having to learn a lot of new functions and conventions, but robust enough to handle requirements like restartability and memory usage monitoring. Everything related was either a pre-packaged pipeline for a defined purpose, or a heavy-duty development environment that was overkill for a simple pipeline. Both of these seemed to be targeted toward ultra-efficient uses, and neither fit my needs: I had a set of commands already in mind -- I just needed a wrapper that could take that code and make it automatically restartable, logged, robust to crashing, easy to debug, and so forth.
 
 If you need a full-blown, datacenter-scale environment that can do everything, look elsewhere. Pypiper's strength is its simplicity. If all you want is a shell-like script, but now with the power of python, and restartability, then Pypiper is for you.
 
@@ -47,13 +51,5 @@ You can also generate docs locally using `sphinx <http://www.sphinx-doc.org/en/s
 
 Testing
 *************
-You can test pypiper by running ``python test_pypiper.py``, which has some unit tests.
-
-License
-*************
-Pypiper ___ licensed source code is available at http://github.com/epigen/pypiper/ .
-
-Contributing
-*************
-We welcome contributions in the form of pull requests; or, if you find a bug or want request a feature, open an issue in https://github.com/epigen/pypiper/issues.
+You can test pypiper by cloning it and running the included unit tests: ``python test_pypiper.py``.
 
diff --git a/doc/source/ngstk.rst b/doc/source/ngstk.rst
@@ -8,11 +8,18 @@ Example:
 
 .. code-block:: python
 
-	from pypiper.ngstk import NGSTk
+	import pypiper
+	pm = pypiper.PipelineManager(..., args = args)
 
-	tk = NGSTk()
-	tk.index_bam("sample.bam")
+	# Create a ngstk object (pass the PipelineManager as an argument)
+	ngstk = pypiper.NGSTk(pm = pm)
+
+	# Now you use use ngstk functions
+	ngstk.index_bam("sample.bam")
+
+
+A list of available functions can be found in the :doc:`API <api>` or in the source code for `NGSTk`_.
 
 Contributions of additional toolkits or functions in an existing toolkit are welcome.
 
-.. _NGSTk: https://github.com/epigen/pypiper/pypiper/ngstk.py
+.. _NGSTk: https://github.com/epigen/pypiper/blob/master/pypiper/ngstk.py
diff --git a/doc/source/support.rst b/doc/source/support.rst
@@ -0,0 +1,11 @@
+
+Support
+=========================
+
+Bug Reports
+*************
+If you find a bug or want request a feature, open an issue at https://github.com/epigen/pypiper/issues.
+
+Contributing
+*************
+We welcome contributions in the form of pull requests.
diff --git a/doc/source/tutorials-basic.rst b/doc/source/tutorials-basic.rst
@@ -1,6 +1,8 @@
 Basic tutorial
 *****************
 
+Now, download `basic.py <https://github.com/epigen/pypiper/blob/master/example_pipelines/basic.py>`_ and run it with `python basic.py` (or, better yet, make it executable (`chmod 755 basic.py`) and then run it directly with `./basic.py`) 
+
 .. literalinclude:: ../../example_pipelines/basic.py
 
 
diff --git a/doc/source/tutorials-first.rst b/doc/source/tutorials-first.rst
@@ -2,13 +2,14 @@
 Your first pipeline
 ***************************
 
-Using pypiper is simple. First, import pypiper, specify an output folder, and create a new ``PipelineManager`` object:
+Using pypiper is simple. Your pipeline is a python script, say `pipeline.py`. First, import pypiper, specify an output folder, and create a new ``PipelineManager`` object:
 
 .. code-block:: python
 
+	#!/usr/bin/env python
 	import pypiper, os
 	outfolder = "pipeline_output/" # Choose a folder for your results
-	pipeline = pypiper.PipelineManager(name="my_pipeline", outfolder=outfolder)
+	pm = pypiper.PipelineManager(name="my_pipeline", outfolder=outfolder)
 
 This creates your ``outfolder`` and places a flag called ``my_pipeline_running.flag`` in the folder. It also initializes the log file (``my_pipeline_log.md``) with statistics such as time of starting, compute node, software versions, command-line parameters, etc.
 
@@ -18,17 +19,17 @@ Now, the workhorse of ``PipelineManager`` is the ``run()`` function. Essentially
 
 	# our command will produce this output file
 	target = os.path.join(outfolder, "outfile.txt")
-	cmd = "shuf -i 1-500000000 -n 10000000 > " + target
-	pipeline.run(command, target)
+	command = "shuf -i 1-500000000 -n 10000000 > " + target
+	pm.run(command, target)
 
 
-The command (``cmd``) is the only required argument to ``run()``. You can leave ``target`` empty (pass ``None``). If you **do** specify a target, the command will only be run if the target file does not already exist. If you **do not** specify a target, the command will be run every time the pipeline is run. 
+The command (``command``) is the only required argument to ``run()``. You can leave ``target`` empty (pass ``None``). If you **do** specify a target, the command will only be run if the target file does not already exist. If you **do not** specify a target, the command will be run every time the pipeline is run. 
 
 Now string together whatever commands your pipeline requires! At the end, terminate the pipeline so it gets flagged as successfully completed:
 
 .. code-block:: python
 
-	pipeline.stop_pipeline()
+	pm.stop_pipeline()
 
 That's it! By running commands through ``run()`` instead of directly in bash, you get a robust, logged, restartable pipeline manager for free!
 

diff --git a/doc/source/tutorials.rst b/doc/source/tutorials.rst
@@ -1,6 +1,10 @@
 
+
 Tutorials
 =========================
+
+Start with the 'your first pipeline' tutorial to get a quick overview of a simple pipeline, then look through the other examples for more advanced concepts.
+
 .. toctree::
   :maxdepth: 2
 
@@ -11,3 +15,5 @@ Tutorials
 * `example_pipelines/advanced.py` (under construction) - A tutorial demonstrating some more advanced features.
 * `example_pipelines/bioinformatics.py` (under construction) - A tutorial showing some bioinformatics use cases.
 
+Finally, check out some real-world pipelines in our `open_pipelines repository <https://github.com/epigen/open_pipelines>`_
+
diff --git a/example_pipelines/basic.py b/example_pipelines/basic.py
@@ -12,7 +12,7 @@
 # Create a PipelineManager instance (don't forget to name it!)
 # This starts the pipeline.
 
-mypiper = pypiper.PipelineManager(name="BASIC",
+pm = pypiper.PipelineManager(name="BASIC",
 	outfolder="pipeline_output/")
 
 # Now just build shell command strings, and use the run function
@@ -28,33 +28,33 @@
 cmd = "shuf -i 1-500000000 -n 10000000 > " + tgt
 
 # and run with run().
-mypiper.run(cmd, target=tgt)
+pm.run(cmd, target=tgt)
 
 # Now copy the data into a new file.
 # first specify target file and build command:
 tgt = "pipeline_output/copied.out"
 cmd = "cp pipeline_output/test.out " + tgt
-mypiper.run(cmd, target=tgt)
+pm.run(cmd, target=tgt)
 
 # You can also string multiple commands together, which will execute
 # in order as a group to create the final target.
 cmd1 = "sleep 5"
 cmd2 = "touch pipeline_output/touched.out"
-mypiper.run([cmd1, cmd2], target="pipeline_output/touched.out")
+pm.run([cmd1, cmd2], target="pipeline_output/touched.out")
 
 # A command without a target will run every time.
 # Find the biggest line
 cmd = "awk 'n < $0 {n=$0} END{print n}' pipeline_output/test.out"
-mypiper.run(cmd, "lock.max")
+pm.run(cmd, "lock.max")
 
 # Use checkprint() to get the results of a command, and then use
 # report_result() to print and log key-value pairs in the stats file:
-last_entry = mypiper.checkprint("tail -n 1 pipeline_output/copied.out")
-mypiper.report_result("last_entry", last_entry)
+last_entry = pm.checkprint("tail -n 1 pipeline_output/copied.out")
+pm.report_result("last_entry", last_entry)
 
 
 # Now, stop the pipeline to complete gracefully.
-mypiper.stop_pipeline()
+pm.stop_pipeline()
 
 # Observe your outputs in the pipeline_output folder 
 # to see what you've created.