Merge pull request #91 from dib-lab/docs/overhaul

Documentation overhaul
kevlar-dev · Jul 19, 2017 · 1a9dfa0 · 1a9dfa0
2 parents 7668a05 + c01cc64
commit 1a9dfa0
Show file tree

Hide file tree

Showing 11 changed files with 244 additions and 41 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -20,7 +20,7 @@ install:
 script:
     - make testall
     - make style
-    - PYTHONPATH=$(pwd) make doc
+    - make doc
 after_success:
     - make loc
     - bash <(curl -s https://codecov.io/bash)
diff --git a/Makefile b/Makefile
@@ -1,7 +1,7 @@
 SHELL=bash
 
 devenv:
-	pip install --upgrade pip setuptools pytest pytest-cov pep8 cython sphinx
+	pip install --upgrade pip setuptools pytest pytest-cov pep8 cython sphinx sphinx-argparse
 
 style:
 	pep8 kevlar/*.py kevlar/*/*.py kevlar/*/*/*.py
@@ -13,7 +13,7 @@ testall:
 	py.test -v --cov=kevlar kevlar/tests/*.py
 
 doc:
-	make -C docs/ html
+	PYTHONPATH=$$(pwd) make -C docs/ html
 
 loc:
 	@- echo -e "\n\n===== Core kevlar ====="

diff --git a/docs/cli.rst b/docs/cli.rst
@@ -1,10 +1,78 @@
-Command-line interface
-======================
+Comprehensive command-line interface reference
+==============================================
 
 The **kevlar** command-line interface is designed around a single command :code:`kevlar`.
 From this one command, a variety of tasks and procedures can be invoked using several *subcommands*.
 
 Once **kevlar** is installed, available subcommands can be listed by executing :code:`kevlar -h`.
 To see instructions for running a specific subcommand, execute :code:`kevlar <subcommand> -h` (of course replacing :code:`subcommand` with the actual name of the subcommand).
 
-More information will be posted here soon!
+kevlar dump
+-----------
+
+.. argparse::
+   :module: kevlar.cli
+   :func: parser
+   :nodefault:
+   :prog: kevlar
+   :path: dump
+
+kevlar count
+------------
+
+.. argparse::
+   :module: kevlar.cli
+   :func: parser
+   :nodefault:
+   :prog: kevlar
+   :path: count
+
+kevlar novel
+------------
+
+.. argparse::
+   :module: kevlar.cli
+   :func: parser
+   :nodefault:
+   :prog: kevlar
+   :path: novel
+
+kevlar filter
+-------------
+
+.. argparse::
+   :module: kevlar.cli
+   :func: parser
+   :nodefault:
+   :prog: kevlar
+   :path: filter
+
+kevlar assemble
+---------------
+
+.. argparse::
+   :module: kevlar.cli
+   :func: parser
+   :nodefault:
+   :prog: kevlar
+   :path: assemble
+
+kevlar localize
+---------------
+
+.. argparse::
+   :module: kevlar.cli
+   :func: parser
+   :nodefault:
+   :prog: kevlar
+   :path: localize
+
+kevlar mutate
+-------------
+
+.. argparse::
+   :module: kevlar.cli
+   :func: parser
+   :nodefault:
+   :prog: kevlar
+   :path: mutate
diff --git a/docs/conf.py b/docs/conf.py
@@ -33,6 +33,7 @@
     'sphinx.ext.autodoc',
     'sphinx.ext.doctest',
     'sphinx.ext.coverage',
+    'sphinxarg.ext',
 ]
 
 # Add any paths that contain templates here, relative to this directory.

diff --git a/docs/formats.rst b/docs/formats.rst
@@ -0,0 +1,49 @@
+File formats in **kevlar**
+==========================
+
+Although **kevlar** performs many operations on *k*-mers, read sequences are the primary currency of exchange between different stages of the analysis workflow.
+**kevlar** supports reading from and writing to Fasta and Fastq files, and treats these identically since it does not use any base call quality information.
+In most cases, **kevlar** should also be able to automatically detect whether an input file is gzip-compressed or not and handle it accordingly (no bzip2 support).
+
+Augmented sequences
+-------------------
+
+"Interesing *k*-mers" are putatively novel *k*-mers that are high abundance in the proband/case sample(s) and effectively absent from control samples.
+To facilitate reading and writing these "interesting *k*-mers" along with the reads to which they belong, **kevlar** uses an *augmented* version of the Fasta and Fastq formats.
+Here is an example of an augmented Fastq file.
+
+.. code::
+
+   @read1
+   TTTTACCCGATGGGCGAGGTGAAATACTATGCCGATTTATTCTTACACAATTAAATTGCTAGTCCGGTTAGGGTTAGTTTGCGGCCTTCGTTCCAGCGCCGTGTT
+   +
+   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
+        CCCGATGGGCGAGGTGAAA          18 1 0#
+                                                                        AGGGTTAGTTTGCGGCCTT          11 0 0#
+   @read2
+   AAGAGATTGTCGCTTGCCCCGTAAAGGAATTAGACCGGGCGACCAGAGCCTATTAGTAGCCCGCGCCTGTAGCACAAACGACTTTCGTACTATTATTAGACGTCG
+   +
+   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
+    AGAGATTGTCGCTTGCCCC          14 0 1#
+     GAGATTGTCGCTTGCCCCG          12 0 0#
+      AGATTGTCGCTTGCCCCGT          14 0 0#
+   @read3
+   GAGACCATAAACCAGCTCTTGGTACCGAAAGAACACCTATGAATAACCGTGAGTGCATGATTCCTGTGAAGAGATTGTCGCTTGCCCCGTAAAGGAATTAGACCG
+   +
+   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
+                  CTCTTGGTACCGAAAGAAC          19 1 0#
+                                                                        AGAGATTGTCGCTTGCCCC          14 0 1#
+                                                                         GAGATTGTCGCTTGCCCCG          12 0 0#
+                                                                          AGATTGTCGCTTGCCCCGT          14 0 0#
+   @read4
+   TCCGGTTAGGGTTAGTTTGCGGCCTTCGTTCCAGCGCCGTGTTGTTGCAATTTAATCCCGAGAAACCTCATGTAGCGGCTACTGGACCGCTGGGTAAGCTCAGAC
+   +
+   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
+          AGGGTTAGTTTGCGGCCTT          11 0 0#
+
+As with a normal Fastq file, each record contains 4 lines to declare the read sequence and qualities.
+However, these 4 lines are followed by one or more lines indicating the "interesing *k*-mers", showing their sequence followed by their abundance in each sample (case first, then controls), with a ``#`` as the final character.
+Augmented Fastq files are easily converted to normal Fastq files by invoking a command like ``grep -v '#$' reads.augfastq > reads.fastq`` (same for augmented Fasta files).
+
+The functions ``kevlar.parse_augmented_fastx`` and ``kevlar.print_augmented_fastx`` are used internally to read and write augmented Fastq/Fasta files.
+However, these functions can easily be imported and called from third-party Python scripts as well.
diff --git a/docs/index.rst b/docs/index.rst
@@ -2,27 +2,23 @@
 
     <img src="_static/kevlar-logo.png"alt="kevlar logo" style="height: 150px; display: block" />
 
-The **kevlar** software is a testbed for developing reference-free variant discovery methods for genomics.
-The initial focus of development is novel germline variant discovery in human trio / quad experimental designs.
-However, the method lends itself easily to more general experimental designs, which will get more attention and support in the near future.
-
-Although a reference genome is not required, it can be utilized to reduce data volume at an early stage in the workflow and reduce the computational demands of subsequent steps.
-
-**kevlar** is currently under heavy development and is not yet stable.
-That said, the core features of the software are reasonbly well tested, and leverage software components from `the khmer library <https://khmer.readthedocs.io>`_ which are very well tested.
+Documentation for **kevlar**
+============================
 
 .. toctree::
    :maxdepth: 1
 
-   conduct
+   intro
    install
+   running
    quick-start
+   formats
+   sim
    cli
+   conduct
 
+Links
+-----
 
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
+- `Github repository <https://github.com/dib-lab/kevlar>`_
+- `License <https://github.com/dib-lab/kevlar/blob/master/LICENSE>`_
diff --git a/docs/install.rst b/docs/install.rst
@@ -4,6 +4,9 @@ Installing **kevlar**
 For the impatient
 -----------------
 
+If this isn't your first time in the wing, the following 4 commands should be sufficient to install **kevlar** in the majority of cases.
+Otherwise, we suggest reading through the entire installation instructions before beginning.
+
 .. code::
 
     virtualenv kevlar-env
@@ -58,7 +61,7 @@ Development environment
 
 If you'd like to contribute to **kevlar**'s development or simply poke around, the source code can be cloned from Github.
 In addition to the dependencies listed above, a few additional dependencies are required for a complete development environment.
-These can be installed with `make` for your convenience.
+These can be installed with ``make`` for your convenience.
 
 .. code::
 

diff --git a/docs/intro.rst b/docs/intro.rst
@@ -0,0 +1,11 @@
+Introduction to **kevlar**
+==========================
+
+The **kevlar** software is a testbed for developing reference-free variant discovery methods for genomics.
+The initial focus of development is novel germline variant discovery in human trio / quad experimental designs.
+However, the method lends itself easily to more general experimental designs, which will get more attention and support in the near future.
+
+Although a reference genome is not required, it can be utilized to reduce data volume at an early stage in the workflow and reduce the computational demands of subsequent steps.
+
+**kevlar** is currently under heavy development and is not yet stable.
+That said, the core features of the software are reasonbly well tested, and leverage software components from `the khmer library <https://khmer.readthedocs.io>`_ which are very well tested.
diff --git a/docs/quick-start.rst b/docs/quick-start.rst
@@ -1,9 +1,17 @@
 Quick start
 ===========
 
+Currently, **kevlar** development is focused heavily on trio and quad experimental designs.
+This document gives a bare-bones walkthrough of the **kevlar** workflow, from raw data to contig assembly.
+Final variant calling comming soon.
+
+Eventually, this should hopefully be cleaned up into a smaller sequence of commands, but for now things are changing frequently enough that thorough documentation isn't yet feasible.
+
+----------
+
 If you have not already done so, install **kevlar** using :doc:`the following instructions <install>`.
 
-A complete listing of all available configuration options for each script can be shown by executing ``kevlar <subcommand> -h`` in the terminal.
+A complete listing of all available configuration options for each script can be found in :doc:`the CLI documentation <cli>`, or by executing ``kevlar <subcommand> -h`` in the terminal.
 
 ----------
 
@@ -12,42 +20,48 @@ A complete listing of all available configuration options for each script can be
    .. code:: bash
 
        kevlar count \
-           --case kevlar/tests/data/trio1/case1.fq \
-           --controls kevlar/tests/data/trio1/ctrl[1,2].fq \
+           --case case.counttable case-1.fq.gz case-2.fq.gz \
+           --control control1.counttable control-a-1.fq.gz control-a-2.fq.gz \
+           --control control2.counttable control-b-1.fq.gz control-b-2.fq.gz \
            --ksize 21 \
-           --ctrl_max 0
+           --ctrl-max 0
 
 #. Find "interesting" (potentially novel) k-mers
 
    .. code:: bash
 
        kevlar novel \
-           --cases kevlar/tests/data/trio1/case1.counttable \
-           --case_min 8 \
-           --controls kevlar/tests/data/trio1/ctrl[1,2].counttable \
-           --ctrl_max 0 \
+           --case-counts case-1.counttable \
+           --case-min 8 \
+           --controls control-1.counttable control-2.counttable \
+           --ctrl-max 0 \
            --ksize 21 \
-           --out case1.novel.unfiltered.augfastq.gz
-           kevlar/tests/data/trio1/case1.fq
+           --out case-1.novel.unfiltered.augfastq.gz
+           case-1.fq.gz
 
-#. Recompute k-mer abundances to discard false positives, partition reads by shared novel k-mers
+#. Recompute k-mer abundances to discard false positives
 
    .. code:: bash
 
        kevlar filter \
-           --refr kevlar/tests/data/bogus-genome/refr.fa \
-           --contam kevlar/tests/data/bogus-genome/contam1.fa \
+           --refr refr.fa.gz \
+           --contam contaminants.fa \
            --min-abund 8 \
            --ksize 21 \
-           --aug-out case1.novel.filtered.augfastq.gz \
-           --out case1.novel.filtered.fq.gz \
-           --cc-prefix case1 \
-           case1.novel.unfiltered.augfastq.gz
+           --aug-out case-1.novel.filtered.augfastq.gz \
+           --out case-1.novel.filtered.fq.gz \
+           case-1.novel.unfiltered.augfastq.gz
+
+#. Partition reads by shared novel k-mers
+
+   .. code:: bash
 
-   Currently partitioning is done by ``kevlar filter``, but this will soon be handled by a dedicated ``kevlar partition`` command.
+       kevlar partition case-1-partition case-1.novel.filtered.augfastq.gz
 
 #. Assemble partitioned reads
 
    .. code:: bash
 
-       kevlar assemble --out case1.cc0.augfasta case1.cc0.augfastq.gz
+       kevlar assemble \
+           --out case-1-partition.cc0.augfasta \
+           case-1-partition.cc0.augfastq.gz
diff --git a/docs/running.rst b/docs/running.rst
@@ -0,0 +1,36 @@
+Running **kevlar**
+==================
+
+The **kevlar** software implements a Python library for genetic sequence and variant analysis.
+**kevlar**'s primary interface is invoked via the command line, but is also designed so that it can be seamlessly integrated into third-party Python programs.
+
+
+Command line interface
+----------------------
+
+Once installed, the **kevlar** software can be invoked from the shell using the ``kevlar`` command.
+The **kevlar** command line interface (CLI) uses the *subcommand* pattern, in which a single master command supports several different operations by defining multiple subcommands (such as ``kevlar novel`` and ``kevlar partition``).
+Comprehensive documentation of the **kevlar** CLI is available :doc:`here <cli>`.
+
+Starting with version 1.0, the CLI will be under `semantic versioning <http://semver.org/>`_.
+
+
+Python interface
+----------------
+
+As a result of **kevlar**'s design to facilitate internal testing, the "main method" of each **kevlar** subcommand can easily be executed programmatically.
+The following example shows how to execute ``kevlar reaugment`` from a standalone Python program.
+
+.. code:: python
+
+   import kevlar
+
+   # Declare arguments just like you would on the command line
+   arglist = ['reaugment', '-o', 'new.augfastq', 'old.augfastq', 'new.fastq']
+
+   args = kevlar.cli.parser().parse_args(arglist)
+   kevlar.reaugment.main(args)
+
+Other units of code in the **kevlar** package may also be amenable to importing and executing programmatically.
+However, the code internals are not under semantic versioning and by necessity will be less stable and have poorer documentation.
+Have fun and knock yourself out, but be prepared for changes in internal behavior in subsequent releases!
diff --git a/docs/sim.rst b/docs/sim.rst
@@ -0,0 +1,25 @@
+Simulating variants with **kevlar**
+===================================
+
+To facilitate testing, **kevlar** implements a simple command to apply simulated "mutations" to a reference sequence.
+We have used this internally to simulate data sets for testing, to verify that **kevlar** can recover the simulated "mutation" or variant.
+
+The command-line interface for ``kevlar mutate`` is very simple (for full details see `the CLI documentation <cli.html#kevlar-mutate>`_).
+
+The "mutation file" format is described here by way of example.
+
+.. code::
+
+   seq1	2345915	del	141
+   seq1	1022305	snv	2
+   seq1	2062327	inv	429
+   seq1	1234310	del	32
+   seq1	388954	ins	TGTTTCCTTTCATACCCCACCAC
+   seq1	2460047	snv	2
+
+The mutations file is a plain text tabular data file with four fields separated by spaces or tabs.
+
+- sequence ID
+- variant starting position (0-based)
+- variant type (currently supported types: ``snv``, ``ins``, ``del``, and ``inv`` for single-nucleotide variants, insertions, deletions, and inversions)
+- value; represents lexicographic offset for SNVs, variant length for deletions and inversions, and inserted sequence for insertions.