Skip to content

Commit

Permalink
Merge pull request #91 from dib-lab/docs/overhaul
Browse files Browse the repository at this point in the history
Documentation overhaul
  • Loading branch information
standage committed Jul 19, 2017
2 parents 7668a05 + c01cc64 commit 1a9dfa0
Show file tree
Hide file tree
Showing 11 changed files with 244 additions and 41 deletions.
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ install:
script:
- make testall
- make style
- PYTHONPATH=$(pwd) make doc
- make doc
after_success:
- make loc
- bash <(curl -s https://codecov.io/bash)
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
SHELL=bash

devenv:
pip install --upgrade pip setuptools pytest pytest-cov pep8 cython sphinx
pip install --upgrade pip setuptools pytest pytest-cov pep8 cython sphinx sphinx-argparse

style:
pep8 kevlar/*.py kevlar/*/*.py kevlar/*/*/*.py
Expand All @@ -13,7 +13,7 @@ testall:
py.test -v --cov=kevlar kevlar/tests/*.py

doc:
make -C docs/ html
PYTHONPATH=$$(pwd) make -C docs/ html

loc:
@- echo -e "\n\n===== Core kevlar ====="
Expand Down
74 changes: 71 additions & 3 deletions docs/cli.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,78 @@
Command-line interface
======================
Comprehensive command-line interface reference
==============================================

The **kevlar** command-line interface is designed around a single command :code:`kevlar`.
From this one command, a variety of tasks and procedures can be invoked using several *subcommands*.

Once **kevlar** is installed, available subcommands can be listed by executing :code:`kevlar -h`.
To see instructions for running a specific subcommand, execute :code:`kevlar <subcommand> -h` (of course replacing :code:`subcommand` with the actual name of the subcommand).

More information will be posted here soon!
kevlar dump
-----------

.. argparse::
:module: kevlar.cli
:func: parser
:nodefault:
:prog: kevlar
:path: dump

kevlar count
------------

.. argparse::
:module: kevlar.cli
:func: parser
:nodefault:
:prog: kevlar
:path: count

kevlar novel
------------

.. argparse::
:module: kevlar.cli
:func: parser
:nodefault:
:prog: kevlar
:path: novel

kevlar filter
-------------

.. argparse::
:module: kevlar.cli
:func: parser
:nodefault:
:prog: kevlar
:path: filter

kevlar assemble
---------------

.. argparse::
:module: kevlar.cli
:func: parser
:nodefault:
:prog: kevlar
:path: assemble

kevlar localize
---------------

.. argparse::
:module: kevlar.cli
:func: parser
:nodefault:
:prog: kevlar
:path: localize

kevlar mutate
-------------

.. argparse::
:module: kevlar.cli
:func: parser
:nodefault:
:prog: kevlar
:path: mutate
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
'sphinx.ext.autodoc',
'sphinx.ext.doctest',
'sphinx.ext.coverage',
'sphinxarg.ext',
]

# Add any paths that contain templates here, relative to this directory.
Expand Down
49 changes: 49 additions & 0 deletions docs/formats.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
File formats in **kevlar**
==========================

Although **kevlar** performs many operations on *k*-mers, read sequences are the primary currency of exchange between different stages of the analysis workflow.
**kevlar** supports reading from and writing to Fasta and Fastq files, and treats these identically since it does not use any base call quality information.
In most cases, **kevlar** should also be able to automatically detect whether an input file is gzip-compressed or not and handle it accordingly (no bzip2 support).

Augmented sequences
-------------------

"Interesing *k*-mers" are putatively novel *k*-mers that are high abundance in the proband/case sample(s) and effectively absent from control samples.
To facilitate reading and writing these "interesting *k*-mers" along with the reads to which they belong, **kevlar** uses an *augmented* version of the Fasta and Fastq formats.
Here is an example of an augmented Fastq file.

.. code::
@read1
TTTTACCCGATGGGCGAGGTGAAATACTATGCCGATTTATTCTTACACAATTAAATTGCTAGTCCGGTTAGGGTTAGTTTGCGGCCTTCGTTCCAGCGCCGTGTT
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCGATGGGCGAGGTGAAA 18 1 0#
AGGGTTAGTTTGCGGCCTT 11 0 0#
@read2
AAGAGATTGTCGCTTGCCCCGTAAAGGAATTAGACCGGGCGACCAGAGCCTATTAGTAGCCCGCGCCTGTAGCACAAACGACTTTCGTACTATTATTAGACGTCG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
AGAGATTGTCGCTTGCCCC 14 0 1#
GAGATTGTCGCTTGCCCCG 12 0 0#
AGATTGTCGCTTGCCCCGT 14 0 0#
@read3
GAGACCATAAACCAGCTCTTGGTACCGAAAGAACACCTATGAATAACCGTGAGTGCATGATTCCTGTGAAGAGATTGTCGCTTGCCCCGTAAAGGAATTAGACCG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CTCTTGGTACCGAAAGAAC 19 1 0#
AGAGATTGTCGCTTGCCCC 14 0 1#
GAGATTGTCGCTTGCCCCG 12 0 0#
AGATTGTCGCTTGCCCCGT 14 0 0#
@read4
TCCGGTTAGGGTTAGTTTGCGGCCTTCGTTCCAGCGCCGTGTTGTTGCAATTTAATCCCGAGAAACCTCATGTAGCGGCTACTGGACCGCTGGGTAAGCTCAGAC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
AGGGTTAGTTTGCGGCCTT 11 0 0#
As with a normal Fastq file, each record contains 4 lines to declare the read sequence and qualities.
However, these 4 lines are followed by one or more lines indicating the "interesing *k*-mers", showing their sequence followed by their abundance in each sample (case first, then controls), with a ``#`` as the final character.
Augmented Fastq files are easily converted to normal Fastq files by invoking a command like ``grep -v '#$' reads.augfastq > reads.fastq`` (same for augmented Fasta files).

The functions ``kevlar.parse_augmented_fastx`` and ``kevlar.print_augmented_fastx`` are used internally to read and write augmented Fastq/Fasta files.
However, these functions can easily be imported and called from third-party Python scripts as well.
26 changes: 11 additions & 15 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,23 @@

<img src="_static/kevlar-logo.png"alt="kevlar logo" style="height: 150px; display: block" />

The **kevlar** software is a testbed for developing reference-free variant discovery methods for genomics.
The initial focus of development is novel germline variant discovery in human trio / quad experimental designs.
However, the method lends itself easily to more general experimental designs, which will get more attention and support in the near future.

Although a reference genome is not required, it can be utilized to reduce data volume at an early stage in the workflow and reduce the computational demands of subsequent steps.

**kevlar** is currently under heavy development and is not yet stable.
That said, the core features of the software are reasonbly well tested, and leverage software components from `the khmer library <https://khmer.readthedocs.io>`_ which are very well tested.
Documentation for **kevlar**
============================

.. toctree::
:maxdepth: 1

conduct
intro
install
running
quick-start
formats
sim
cli
conduct

Links
-----

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
- `Github repository <https://github.com/dib-lab/kevlar>`_
- `License <https://github.com/dib-lab/kevlar/blob/master/LICENSE>`_
5 changes: 4 additions & 1 deletion docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ Installing **kevlar**
For the impatient
-----------------

If this isn't your first time in the wing, the following 4 commands should be sufficient to install **kevlar** in the majority of cases.
Otherwise, we suggest reading through the entire installation instructions before beginning.

.. code::
virtualenv kevlar-env
Expand Down Expand Up @@ -58,7 +61,7 @@ Development environment

If you'd like to contribute to **kevlar**'s development or simply poke around, the source code can be cloned from Github.
In addition to the dependencies listed above, a few additional dependencies are required for a complete development environment.
These can be installed with `make` for your convenience.
These can be installed with ``make`` for your convenience.

.. code::
Expand Down
11 changes: 11 additions & 0 deletions docs/intro.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Introduction to **kevlar**
==========================

The **kevlar** software is a testbed for developing reference-free variant discovery methods for genomics.
The initial focus of development is novel germline variant discovery in human trio / quad experimental designs.
However, the method lends itself easily to more general experimental designs, which will get more attention and support in the near future.

Although a reference genome is not required, it can be utilized to reduce data volume at an early stage in the workflow and reduce the computational demands of subsequent steps.

**kevlar** is currently under heavy development and is not yet stable.
That said, the core features of the software are reasonbly well tested, and leverage software components from `the khmer library <https://khmer.readthedocs.io>`_ which are very well tested.
52 changes: 33 additions & 19 deletions docs/quick-start.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,17 @@
Quick start
===========

Currently, **kevlar** development is focused heavily on trio and quad experimental designs.
This document gives a bare-bones walkthrough of the **kevlar** workflow, from raw data to contig assembly.
Final variant calling comming soon.

Eventually, this should hopefully be cleaned up into a smaller sequence of commands, but for now things are changing frequently enough that thorough documentation isn't yet feasible.

----------

If you have not already done so, install **kevlar** using :doc:`the following instructions <install>`.

A complete listing of all available configuration options for each script can be shown by executing ``kevlar <subcommand> -h`` in the terminal.
A complete listing of all available configuration options for each script can be found in :doc:`the CLI documentation <cli>`, or by executing ``kevlar <subcommand> -h`` in the terminal.

----------

Expand All @@ -12,42 +20,48 @@ A complete listing of all available configuration options for each script can be
.. code:: bash
kevlar count \
--case kevlar/tests/data/trio1/case1.fq \
--controls kevlar/tests/data/trio1/ctrl[1,2].fq \
--case case.counttable case-1.fq.gz case-2.fq.gz \
--control control1.counttable control-a-1.fq.gz control-a-2.fq.gz \
--control control2.counttable control-b-1.fq.gz control-b-2.fq.gz \
--ksize 21 \
--ctrl_max 0
--ctrl-max 0
#. Find "interesting" (potentially novel) k-mers

.. code:: bash
kevlar novel \
--cases kevlar/tests/data/trio1/case1.counttable \
--case_min 8 \
--controls kevlar/tests/data/trio1/ctrl[1,2].counttable \
--ctrl_max 0 \
--case-counts case-1.counttable \
--case-min 8 \
--controls control-1.counttable control-2.counttable \
--ctrl-max 0 \
--ksize 21 \
--out case1.novel.unfiltered.augfastq.gz
kevlar/tests/data/trio1/case1.fq
--out case-1.novel.unfiltered.augfastq.gz
case-1.fq.gz
#. Recompute k-mer abundances to discard false positives, partition reads by shared novel k-mers
#. Recompute k-mer abundances to discard false positives

.. code:: bash
kevlar filter \
--refr kevlar/tests/data/bogus-genome/refr.fa \
--contam kevlar/tests/data/bogus-genome/contam1.fa \
--refr refr.fa.gz \
--contam contaminants.fa \
--min-abund 8 \
--ksize 21 \
--aug-out case1.novel.filtered.augfastq.gz \
--out case1.novel.filtered.fq.gz \
--cc-prefix case1 \
case1.novel.unfiltered.augfastq.gz
--aug-out case-1.novel.filtered.augfastq.gz \
--out case-1.novel.filtered.fq.gz \
case-1.novel.unfiltered.augfastq.gz
#. Partition reads by shared novel k-mers

.. code:: bash
Currently partitioning is done by ``kevlar filter``, but this will soon be handled by a dedicated ``kevlar partition`` command.
kevlar partition case-1-partition case-1.novel.filtered.augfastq.gz
#. Assemble partitioned reads

.. code:: bash
kevlar assemble --out case1.cc0.augfasta case1.cc0.augfastq.gz
kevlar assemble \
--out case-1-partition.cc0.augfasta \
case-1-partition.cc0.augfastq.gz
36 changes: 36 additions & 0 deletions docs/running.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
Running **kevlar**
==================

The **kevlar** software implements a Python library for genetic sequence and variant analysis.
**kevlar**'s primary interface is invoked via the command line, but is also designed so that it can be seamlessly integrated into third-party Python programs.


Command line interface
----------------------

Once installed, the **kevlar** software can be invoked from the shell using the ``kevlar`` command.
The **kevlar** command line interface (CLI) uses the *subcommand* pattern, in which a single master command supports several different operations by defining multiple subcommands (such as ``kevlar novel`` and ``kevlar partition``).
Comprehensive documentation of the **kevlar** CLI is available :doc:`here <cli>`.

Starting with version 1.0, the CLI will be under `semantic versioning <http://semver.org/>`_.


Python interface
----------------

As a result of **kevlar**'s design to facilitate internal testing, the "main method" of each **kevlar** subcommand can easily be executed programmatically.
The following example shows how to execute ``kevlar reaugment`` from a standalone Python program.

.. code:: python
import kevlar
# Declare arguments just like you would on the command line
arglist = ['reaugment', '-o', 'new.augfastq', 'old.augfastq', 'new.fastq']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.reaugment.main(args)
Other units of code in the **kevlar** package may also be amenable to importing and executing programmatically.
However, the code internals are not under semantic versioning and by necessity will be less stable and have poorer documentation.
Have fun and knock yourself out, but be prepared for changes in internal behavior in subsequent releases!
25 changes: 25 additions & 0 deletions docs/sim.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Simulating variants with **kevlar**
===================================

To facilitate testing, **kevlar** implements a simple command to apply simulated "mutations" to a reference sequence.
We have used this internally to simulate data sets for testing, to verify that **kevlar** can recover the simulated "mutation" or variant.

The command-line interface for ``kevlar mutate`` is very simple (for full details see `the CLI documentation <cli.html#kevlar-mutate>`_).

The "mutation file" format is described here by way of example.

.. code::
seq1 2345915 del 141
seq1 1022305 snv 2
seq1 2062327 inv 429
seq1 1234310 del 32
seq1 388954 ins TGTTTCCTTTCATACCCCACCAC
seq1 2460047 snv 2
The mutations file is a plain text tabular data file with four fields separated by spaces or tabs.

- sequence ID
- variant starting position (0-based)
- variant type (currently supported types: ``snv``, ``ins``, ``del``, and ``inv`` for single-nucleotide variants, insertions, deletions, and inversions)
- value; represents lexicographic offset for SNVs, variant length for deletions and inversions, and inserted sequence for insertions.

0 comments on commit 1a9dfa0

Please sign in to comment.