Documentation updated; requirements for RTD removed

compmetagen · Feb 27, 2017 · 17141e6 · 17141e6
1 parent cd71f30
commit 17141e6
Show file tree

Hide file tree

Showing 23 changed files with 1,026 additions and 37 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,5 @@ dist/
 MANIFEST
 .#*
 micca.egg-info/
-.vscode
+.vscode
+micca/thirdparty_bin/
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -1,4 +1,4 @@
-CHANGES
+Changes
 =======
 
 Version 1.6.0
@@ -7,7 +7,7 @@ Version 1.6.0
 * Now the mergepairs command allows merging staggered reads by default.
   With the new option ``-n/--nostagger`` the command produces the same 
   results of the previous versions (<=1.5.0);
-* classify and tabletotax commands  now strip the 'D_X__' prefix from the Silva 
+* classify and tabletotax commands  now strip the ``D_X__`` prefix from the Silva 
   taxonomy files;
 * Documentation updated;
 * Fix: remove duplicate file closing in micca.api.merge().

diff --git a/doc/source/commands/classify.rst b/doc/source/commands/classify.rst
@@ -1,4 +1,136 @@
 classify
 ========
 
-.. command-output:: micca classify --help
+.. code-block:: console
+
+    usage: micca classify [-h] -i FILE -o FILE [-m {cons,rdp,otuid}] [-r FILE]
+                        [-x FILE] [--cons-id CONS_ID]
+                        [--cons-maxhits CONS_MAXHITS]
+                        [--cons-minfrac CONS_MINFRAC]
+                        [--cons-mincov CONS_MINCOV] [--cons-strand {both,plus}]
+                        [--cons-threads THREADS]
+                        [--rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}]
+                        [--rdp-maxmem GB] [--rdp-minconf RDP_MINCONF]
+
+    micca classify assigns taxonomy for each sequence in the input file
+    and provides three methods for classification:
+
+    * VSEARCH-based consensus classifier (cons): input sequences are
+    searched in the reference database with VSEARCH
+    (https://github.com/torognes/vsearch). For each query sequence the
+    method retrives up to 'cons-maxhits' hits (i.e. identity >=
+    'cons-id'). Then, the most specific taxonomic label that is
+    associated with at least 'cons-minfrac' of the hits is
+    assigned. The method is similar to the UCLUST-based consensus
+    taxonomy assigner presented in doi: 10.7287/peerj.preprints.934v2
+    and available in QIIME.
+
+    * RDP classifier (rdp): only RDP classifier version >= 2.8 is
+    supported (doi:10.1128/AEM.00062-07). In order to use this
+    classifier RDP must be installed (download at
+    http://sourceforge.net/projects/rdp-classifier/files/rdp-classifier/)
+    and the RDPPATH environment variable setted. The available
+    databases (--rdp-gene) are:
+
+    - 16S (16srrna)
+    - Fungal LSU (28S) (fungallsu)
+    - Warcup ITS (fungalits_warcup, doi: 10.3852/14-293)
+    - UNITE ITS (fungalits_unite)
+
+    For more information about the RDP classifier go to
+    http://rdp.cme.msu.edu/classifier/classifier.jsp
+
+    * OTU ID classifier (otuid): simply perform a sequence ID matching
+    with the reference taxonomy file. Recommended strategy when the
+    closed reference clustering (--method closedref in micca-otu) was
+    performed. OTU ID classifier requires a tab-delimited file where
+    the first column contains the current OTU ids and the second column
+    the reference taxonomy ids (see otuids.txt in micca-otu), e.g.:
+
+    REF1[TAB]1110191
+    REF2[TAB]1104777
+    REF3[TAB]1078527
+    ...
+
+    The input reference taxonomy file (--ref-tax) should be a
+    tab-delimited file where rows are either in the form:
+
+    1. SEQID[TAB]k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__;
+    2. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales;;;
+    3. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
+    4. SEQID[TAB]D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__;D_5__;
+
+    Compatible reference database are Greengenes
+    (http://greengenes.secondgenome.com/downloads), QIIME-formatted SILVA
+    (https://www.arb-silva.de/download/archive/qiime/) and UNITE
+    (https://unite.ut.ee/repository.php).
+
+    The output file is a tab-delimited file where each row is in the
+    format:
+
+    SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
+
+    optional arguments:
+    -h, --help            show this help message and exit
+
+    arguments:
+    -i FILE, --input FILE
+                            input FASTA file (for 'cons' and 'rdp') or a tab-
+                            delimited OTU ids file (for 'otuid') (required).
+    -o FILE, --output FILE
+                            output taxonomy file (required).
+    -m {cons,rdp,otuid}, --method {cons,rdp,otuid}
+                            classification method (default cons)
+    -r FILE, --ref FILE   reference sequences in FASTA format, required for
+                            'cons' classifier.
+    -x FILE, --ref-tax FILE
+                            tab-separated reference taxonomy file, required for
+                            'cons' and 'otuid' classifiers.
+
+    VSEARCH-based consensus classifierspecific options:
+    --cons-id CONS_ID     sequence identity threshold (0.0 to 1.0, default 0.9).
+    --cons-maxhits CONS_MAXHITS
+                            number of hits to consider (>=1, default 3).
+    --cons-minfrac CONS_MINFRAC
+                            for each taxonomic rank, a specific taxa will be
+                            assigned if it is present in at least MINFRAC of the
+                            hits (0.0 to 1.0, default 0.5).
+    --cons-mincov CONS_MINCOV
+                            reject sequence if the fraction of alignment to the
+                            reference sequence is lower than MINCOV. This
+                            parameter prevents low-coverage alignments at the end
+                            of the sequences (default 0.75).
+    --cons-strand {both,plus}
+                            search both strands or the plus strand only (default
+                            both).
+    --cons-threads THREADS
+                            number of threads to use (1 to 256, default 1).
+
+    RDP Classifier/Database specific options:
+    --rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}
+                            marker gene/region
+    --rdp-maxmem GB       maximum memory size for the java virtual machine in GB
+                            (default 2)
+    --rdp-minconf RDP_MINCONF
+                            minimum confidence value to assign taxonomy to a
+                            sequence (default 0.8)
+
+    Examples
+
+    Classification of 16S sequences using the consensus classifier and
+    Greengenes:
+
+        micca classify -m cons -i input.fasta -o tax.txt \
+        --ref greengenes_2013_05/rep_set/97_otus.fasta \
+        --ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt
+
+    Classification of ITS sequences using the RDP classifier and the
+    UNITE database:
+
+        micca classify -m rdp --rdp-gene fungalits_unite -i input.fasta \
+        -o tax.txt
+
+    OTU ID matching after the closed reference OTU picking protocol:
+
+        micca classify -m otuid -i otuids.txt -o tax.txt \
+        --ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt
diff --git a/doc/source/commands/convert.rst b/doc/source/commands/convert.rst
@@ -1,4 +1,49 @@
 convert
 =======
 
-.. command-output:: micca convert --help
+.. code-block:: console
+
+    usage: micca convert [-h] -i FILE -o FILE [-q FILE] [-d DEFAULTQ]
+                        [-f INPUT_FORMAT] [-F OUTPUT_FORMAT]
+
+    micca convert converts between sequence file formats. See
+    http://biopython.org/wiki/SeqIO#File_Formats for a comprehnsive list
+    of the supported file formats.
+
+    Supported input formats:
+    abi, abi-trim, ace, embl, embl-cds, fasta, fasta-qual, fastq, fastq-illumina, 
+    fastq-sanger, fastq-solexa, gb, genbank, genbank-cds, ig, imgt, pdb-atom, 
+    pdb-seqres, phd, pir, qual, seqxml, sff, sff-trim, swiss, tab, uniprot-xml
+
+    Supported output formats:
+    embl, fasta, fastq, fastq-illumina, fastq-sanger, fastq-solexa, gb, genbank,
+    imgt, phd, qual, seqxml, sff, tab
+
+    optional arguments:
+    -h, --help            show this help message and exit
+
+    arguments:
+    -i FILE, --input FILE
+                            input sequence file (required).
+    -o FILE, --output FILE
+                            output sequence file (required).
+    -q FILE, --qual FILE  input quality file (required for 'fasta-qual' input
+                            format.
+    -d DEFAULTQ, --defaultq DEFAULTQ
+                            default phred quality score for format-without-quality
+                            to format-with-quality conversion (default 40).
+    -f INPUT_FORMAT, --input-format INPUT_FORMAT
+                            input file format (default fastq).
+    -F OUTPUT_FORMAT, --output-format OUTPUT_FORMAT
+                            input file format (default fasta).
+
+    Examples
+
+    Convert FASTA+QUAL files into a FASTQ (Sanger/Illumina 1.8+) file:
+
+        micca convert -i input.fasta -q input.qual -o output.fastq \
+        -f fasta-qual -F fastq
+
+    Convert a SFF file into a FASTQ (Sanger/Illumina 1.8+) file:
+
+        micca convert -i input.sff -o output.fastq -f sff -F fastq
diff --git a/doc/source/commands/filter.rst b/doc/source/commands/filter.rst
@@ -1,4 +1,73 @@
 filter
 ======
 
-.. command-output:: micca filter --help
+.. code-block:: console
+
+    usage: micca filter [-h] -i FILE -o FILE [-e MAXEERATE] [-m MINLEN] [-t]
+                        [-n MAXNS] [-f {fastq,fasta}]
+
+    micca filter filters sequences according to the maximum allowed
+    expected error (EE) rate %%. Optionally, you can:
+
+    * discard sequences that are shorter than the specified length
+    (suggested for Illumina overlapping paired-end (already merged)
+    reads) (option --minlen MINLEN);
+
+    * discard sequences that are shorter than the specified length AND
+    truncate sequences that are longer (suggested for Illumina and 454
+    unpaired reads) (options --minlen MINLEN --trunc);
+
+    * discard sequences that contain more than a specified number of Ns
+    (--maxns).
+
+    Sequences are first shortened and then filtered. Overlapping paired
+    reads with should be merged first (using micca-mergepairs) and then
+    filtered.
+
+    The expected error (EE) rate %% in a sequence of length L is defined
+    as (doi: 10.1093/bioinformatics/btv401):
+
+                    sum(error probabilities)
+        EE rate %% = ------------------------ * 100
+                                L
+
+    Before filtering, run 'micca filterstats' to see how many reads will
+    pass the filter at different minimum lengths with or without
+    truncation, given a maximum allowed expected error rate %% and maximum
+    allowed number of Ns.
+
+    micca-filter is based on VSEARCH (https://github.com/torognes/vsearch).
+
+    optional arguments:
+    -h, --help            show this help message and exit
+
+    arguments:
+    -i FILE, --input FILE
+                            input FASTQ file, Sanger/Illumina 1.8+ format
+                            (phred+33) (required).
+    -o FILE, --output FILE
+                            output FASTA/FASTQ file (required).
+    -e MAXEERATE, --maxeerate MAXEERATE
+                            discard sequences with more than the specified expeced
+                            error rate % (values <=1%, i.e. less or equal than one
+                            error per 100 bases, are highly recommended).
+                            Sequences are discarded after truncation (if enabled)
+                            (default 1).
+    -m MINLEN, --minlen MINLEN
+                            discard sequences that are shorter than MINLEN
+                            (default 1).
+    -t, --trunc           truncate sequences that are longer than MINLEN
+                            (disabled by default).
+    -n MAXNS, --maxns MAXNS
+                            discard sequences with more than the specified number
+                            of Ns. Sequences are discarded after truncation
+                            (disabled by default).
+    -f {fastq,fasta}, --output-format {fastq,fasta}
+                            file format (default fasta).
+
+    Examples
+
+    Truncate reads at 300 bp, discard low quality sequences
+    (with EE rate > 0.5%%) and write a FASTA file:
+
+        micca filter -i reads.fastq -o filtered.fasta -m 300 -t -e 0.5
diff --git a/doc/source/commands/filterstats.rst b/doc/source/commands/filterstats.rst
@@ -1,4 +1,54 @@
 filterstats
 ===========
 
-.. command-output:: micca filterstats --help
+.. code-block:: console
+
+    usage: micca filterstats [-h] -i FILE [-o DIR] [-t TOPN]
+                            [-e MAXEERATES [MAXEERATES ...]] [-n MAXNS]
+
+    micca filterstats reports the fraction of reads that would pass for each
+    specified maximum expected error (EE) rate %% and the maximum number of
+    allowed Ns after:
+
+    * discarding sequences that are shorter than the specified length
+    (suggested for Illumina overlapping paired-end (already merged)
+    reads);
+
+    * discarding sequences that are shorter than the specified length AND
+    truncating sequences that are longer (suggested for Illumina and 454
+    unpaired reads);
+
+    Parameters for the 'micca filter' command should be chosen for each
+    sequencing run using this tool.
+
+    micca filterstats returns in the output directory 3 files:
+
+    * filterstats_minlen.txt: fraction of reads that would pass the filter after
+    the minimum length filtering;
+    * filterstats_trunclen.txt: fraction of reads that would pass the filter after
+    the minimum length filtering + truncation;
+    * filterstats_plot.png: plot in PNG format.
+
+    optional arguments:
+    -h, --help            show this help message and exit
+
+    arguments:
+    -i FILE, --input FILE
+                            input FASTQ file, Sanger/Illumina 1.8+ format
+                            (phred+33) (required).
+    -o DIR, --output DIR  output directory (default .).
+    -t TOPN, --topn TOPN  perform statistics on the first TOPN sequences
+                            (disabled by default)
+    -e MAXEERATES [MAXEERATES ...], --maxeerates MAXEERATES [MAXEERATES ...]
+                            max expected error rates (%). (default [0.25, 0.5,
+                            0.75, 1, 1.25, 1.5])
+    -n MAXNS, --maxns MAXNS
+                            max number of Ns. (disabled by default).
+
+    Examples
+
+    Compute filter statistics on the top 10000 sequences, predicting
+    the fraction of reads that would pass for each maximum EE error
+    rate (default values):
+
+        micca filterstats -i input.fastq -o stats -t 10000
diff --git a/doc/source/commands/merge.rst b/doc/source/commands/merge.rst
@@ -1,4 +1,36 @@
 merge
 =====
 
-.. command-output:: micca merge --help
+.. code-block:: console
+
+    usage: micca merge [-h] -i FILE [FILE ...] -o FILE [-s SEP] [-f {fastq,fasta}]
+
+    micca merge merges several FASTQ or FASTA files in a single file.
+    Different samples will be merged in a single file and sample names
+    will be appended to the sequence identifier
+    (e.g. >SEQID;sample=SAMPLENAME). Sample names are defined as the
+    leftmost part of the file name splitted by the first occurence of '.'
+    (-s/--sep option). Whitespace characters in names will be replaced
+    with a single character underscore ('_').
+
+    optional arguments:
+    -h, --help            show this help message and exit
+
+    arguments:
+    -i FILE [FILE ...], --input FILE [FILE ...]
+                            input FASTQ/FASTA file(s) (required).
+    -o FILE, --output FILE
+                            output FASTQ/FASTA file (required).
+    -s SEP, --sep SEP     Sample names are defined as the leftmost part of the
+                            file name splitted by the first occurence of 'SEP'
+                            (default .)
+    -f {fastq,fasta}, --format {fastq,fasta}
+                            file format (default fastq).
+
+    Examples
+
+    Merge files in FASTA format:
+
+        micca merge -i in1.fasta in2.fasta in3.fasta -o merged.fasta \
+        -f fasta
+