Update change log and docs

kevlar-dev · Mar 13, 2018 · e299def · e299def
1 parent 66e8582
commit e299def
Show file tree

Hide file tree

Showing 3 changed files with 55 additions and 2 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,18 +7,27 @@ This project adheres to [Semantic Versioning](http://semver.org/).
 ### Added
 - New `kevlar gentrio` command for a more realistic similation of trios for testing and evaluation (#171).
 - New filter for `kevlar alac` for discarding partitions with a small number of interesting k-mers (#189).
+- New `kevlar split` subcommand for splitting a partitioned augfastq file into N chunks (see #206).
+- New `-p/--part-id` flag in `kevlar alac` for processing a single partition in a partitioned augfastq file (see #206).
+- New reader/parser for parititioned augfastx files (see #206).
+- New strategy for discriminating between variants and off-target calls using pairing information (see #210).
+- New "fallback" assembly strategy: if fermi-lite fails, try our homegrown greedy assembly algorithm (see #214).
 
 ### Changed
 - Replaced `pep8` with `pycodestyle` for enforcing code style in development (see #167).
 - The `--refr` argument of the `kevlar dump` command is now optional, and when no reference is explicitly specified `kevlar dump` acts primarily as a BAM to Fastq converter (see #170).
 - Split the functionality of the `count` subcommand: simple single-sample k-mer counting was kept in `count` with a much simplified interface, while the memory efficient multi-sample "masked counting" strategy was split out to a new subcommand `effcount` (see #185).
 - Replaced `kevlar reaugment` with a more generalizable `kevlar augment` subcommand (see #188).
+- Replaced `--ksize` with `--seed-size` in `kevlar localize` so that `kevlar alac` can now support different values for k-mers and localizing seeds/anchors (see #198).
+- Improved variant sorting, scoring, and reporting strategy (see #199).
+- The augmented Fastx format now permits annotation of 1 or more mate sequences (see #210).
 
 ### Fixed
 - Incorrect file names in the quick start documentation page (see 9f6bec06d4).
 - The `kevlar alac` procedure now accepts a stream of read partitions (instead of a stream of reads) at the Python API level, and correctly handles a single partition-labeled sequence file at the CLI level (see #165).
 - CIGARs that begin with I blocks (alternate allele contig is longer than reference locus) are now handled properly (see #191).
 - Bug with how `kevlar alac` handles "no reference match" scenarios resolved (see #192).
+- Bug with `kevlar count` when reading from multiple input files (see #202).
 
 ## [0.3.0] - 2017-11-03
 

diff --git a/README.rst b/README.rst
@@ -25,8 +25,8 @@ How do I use kevlar?
 - Quick start guide: http://kevlar.readthedocs.io/en/latest/quick-start.html
 - Tutorial: http://kevlar.readthedocs.io/en/latest/tutorial.html
 
-**Note**: kevlar is currently focused almost entirely on finding novel germline variants in simplex pedigrees.
-We hope to support a wider range of experimental designs soon.
+**Note**: kevlar is currently focused almost entirely on finding novel germline variants in related individuals.
+We hope to benchmark kevlar a wider range of experimental designs soon.
 
 Contributing
 ------------

diff --git a/docs/formats.rst b/docs/formats.rst
@@ -5,13 +5,26 @@ Although kevlar performs many operations on *k*-mers, read sequences are the pri
 kevlar supports reading from and writing to Fasta and Fastq files, and treats these identically since it does not use any base call quality information.
 In most cases, kevlar should also be able to automatically detect whether an input file is gzip-compressed or not and handle it accordingly (no bzip2 support).
 
+Broken-paired Fasta / Fastq files
+---------------------------------
+
+While kevlar does not require pairing information for variant discovery, it can be helpful in the final stages of variant calling.
+The bioinformatics community uses two common conventions for encoding pairing information: paired files and interleaved files.
+In paired files, the first record in file1 is paired with the first record in file2, the second record in file1 is paired with the second record in file2, and so on.
+In an interleaved file, the first record is paired with the second record, the third record is paired with the fourth record, and so on.
+
+If you want to retain and use pairing information, kevlar only supports reading pairing information from interleaved files.
+kevlar also supports "broken paired" files, where single-end/orphaned reads are occasionally scattered in between paired reads in an interleaved file.
+
 Augmented sequences
 -------------------
 
 "Interesing *k*-mers" are putatively novel *k*-mers that are high abundance in the proband/case sample(s) and effectively absent from control samples.
 To facilitate reading and writing these "interesting *k*-mers" along with the reads to which they belong, kevlar uses an *augmented* version of the Fasta and Fastq formats.
 Here is an example of an augmented Fastq file.
 
+.. highlight:: none
+
 .. code::
 
    @read1
@@ -47,3 +60,34 @@ Augmented Fastq files are easily converted to normal Fastq files by invoking a c
 
 The functions ``kevlar.parse_augmented_fastx`` and ``kevlar.print_augmented_fastx`` are used internally to read and write augmented Fastq/Fasta files.
 However, these functions can easily be imported and called from third-party Python scripts as well.
+
+Mate sequences
+--------------
+
+Although kevlar does not require pairing information, it can be used to improve calling when it's available.
+The augmented Fastq/Fasta format also allows mate sequences to be associated with each record.
+If a contig assembled from novel reads maps to multiple regions of the reference genome with the same score, this pairing information can be used to predict the most likely true variant.
+
+The ``mateseq`` annotation should be placed after the first 4 lines of the record, and shown in the two records below.
+
+.. code::
+
+    @DraconisOccidentalisRead12/1
+    CAGGTAGTGTGATGCCTCCAGCTTTGTTCTTTTGGCTTAGGATTGACTTGGCAATTCGGGCTCTTTTTTGGTTCCATATGAACTTTAAAGTAGTTTTTTC
+    +
+    8888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888
+                              TTCTTTTGGCTTAGGATTGACTTGGCAATTC          7 0 1#
+                               TCTTTTGGCTTAGGATTGACTTGGCAATTCG          5 0 1#
+                                CTTTTGGCTTAGGATTGACTTGGCAATTCGG          5 0 1#
+                                 TTTTGGCTTAGGATTGACTTGGCAATTCGGG          5 1 1#
+                                  TTTGGCTTAGGATTGACTTGGCAATTCGGGC          5 0 1#
+    #mateseq=CTGATAAGCAACTTCAGCAAAGTCTCAGGATACAAAATCAATGTACGAAAATCACAAGCGTTCTTATACACCAACAACAGACAAACAGAGAGCCAAATCA#
+    @DraconisOccidentalisRead56/1
+    TCTTGAATTCCCATGTGTTGTGGGAGGGACCCATTGGGAGGTAATTGAATCATGGGGGCACGTCTTTCCCATGCTGTTCTCATGATAGAGACTAAGTCTC
+    +
+    8888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888888
+         AATTCCCATGTGTTGTGGGAGGGACCCATTG          5 0 1#
+          ATTCCCATGTGTTGTGGGAGGGACCCATTGG          5 0 1#
+    #mateseq=ATTAGAAAAAAAAAGTGCATTCGTAAATGTCATAACAATAAAATTATACTCCAAGACTTTGTACAAGATGAAAGTAATATGAAGAAGGGGCTACAGGAAA#
+           TTCCCATGTGTTGTGGGAGGGACCCATTGGG          5 0 1#
+            TCCCATGTGTTGTGGGAGGGACCCATTGGGA          5 0 1#