Skip to content

Commit

Permalink
Documentation updated; requirements for RTD removed
Browse files Browse the repository at this point in the history
  • Loading branch information
Davide Albanese committed Feb 27, 2017
1 parent cd71f30 commit 17141e6
Show file tree
Hide file tree
Showing 23 changed files with 1,026 additions and 37 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ dist/
MANIFEST
.#*
micca.egg-info/
.vscode
.vscode
micca/thirdparty_bin/
4 changes: 2 additions & 2 deletions CHANGES.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CHANGES
Changes
=======

Version 1.6.0
Expand All @@ -7,7 +7,7 @@ Version 1.6.0
* Now the mergepairs command allows merging staggered reads by default.
With the new option ``-n/--nostagger`` the command produces the same
results of the previous versions (<=1.5.0);
* classify and tabletotax commands now strip the 'D_X__' prefix from the Silva
* classify and tabletotax commands now strip the ``D_X__`` prefix from the Silva
taxonomy files;
* Documentation updated;
* Fix: remove duplicate file closing in micca.api.merge().
Expand Down
134 changes: 133 additions & 1 deletion doc/source/commands/classify.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,136 @@
classify
========

.. command-output:: micca classify --help
.. code-block:: console
usage: micca classify [-h] -i FILE -o FILE [-m {cons,rdp,otuid}] [-r FILE]
[-x FILE] [--cons-id CONS_ID]
[--cons-maxhits CONS_MAXHITS]
[--cons-minfrac CONS_MINFRAC]
[--cons-mincov CONS_MINCOV] [--cons-strand {both,plus}]
[--cons-threads THREADS]
[--rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}]
[--rdp-maxmem GB] [--rdp-minconf RDP_MINCONF]
micca classify assigns taxonomy for each sequence in the input file
and provides three methods for classification:
* VSEARCH-based consensus classifier (cons): input sequences are
searched in the reference database with VSEARCH
(https://github.com/torognes/vsearch). For each query sequence the
method retrives up to 'cons-maxhits' hits (i.e. identity >=
'cons-id'). Then, the most specific taxonomic label that is
associated with at least 'cons-minfrac' of the hits is
assigned. The method is similar to the UCLUST-based consensus
taxonomy assigner presented in doi: 10.7287/peerj.preprints.934v2
and available in QIIME.
* RDP classifier (rdp): only RDP classifier version >= 2.8 is
supported (doi:10.1128/AEM.00062-07). In order to use this
classifier RDP must be installed (download at
http://sourceforge.net/projects/rdp-classifier/files/rdp-classifier/)
and the RDPPATH environment variable setted. The available
databases (--rdp-gene) are:
- 16S (16srrna)
- Fungal LSU (28S) (fungallsu)
- Warcup ITS (fungalits_warcup, doi: 10.3852/14-293)
- UNITE ITS (fungalits_unite)
For more information about the RDP classifier go to
http://rdp.cme.msu.edu/classifier/classifier.jsp
* OTU ID classifier (otuid): simply perform a sequence ID matching
with the reference taxonomy file. Recommended strategy when the
closed reference clustering (--method closedref in micca-otu) was
performed. OTU ID classifier requires a tab-delimited file where
the first column contains the current OTU ids and the second column
the reference taxonomy ids (see otuids.txt in micca-otu), e.g.:
REF1[TAB]1110191
REF2[TAB]1104777
REF3[TAB]1078527
...
The input reference taxonomy file (--ref-tax) should be a
tab-delimited file where rows are either in the form:
1. SEQID[TAB]k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__;
2. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales;;;
3. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
4. SEQID[TAB]D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__;D_5__;
Compatible reference database are Greengenes
(http://greengenes.secondgenome.com/downloads), QIIME-formatted SILVA
(https://www.arb-silva.de/download/archive/qiime/) and UNITE
(https://unite.ut.ee/repository.php).
The output file is a tab-delimited file where each row is in the
format:
SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTA file (for 'cons' and 'rdp') or a tab-
delimited OTU ids file (for 'otuid') (required).
-o FILE, --output FILE
output taxonomy file (required).
-m {cons,rdp,otuid}, --method {cons,rdp,otuid}
classification method (default cons)
-r FILE, --ref FILE reference sequences in FASTA format, required for
'cons' classifier.
-x FILE, --ref-tax FILE
tab-separated reference taxonomy file, required for
'cons' and 'otuid' classifiers.
VSEARCH-based consensus classifierspecific options:
--cons-id CONS_ID sequence identity threshold (0.0 to 1.0, default 0.9).
--cons-maxhits CONS_MAXHITS
number of hits to consider (>=1, default 3).
--cons-minfrac CONS_MINFRAC
for each taxonomic rank, a specific taxa will be
assigned if it is present in at least MINFRAC of the
hits (0.0 to 1.0, default 0.5).
--cons-mincov CONS_MINCOV
reject sequence if the fraction of alignment to the
reference sequence is lower than MINCOV. This
parameter prevents low-coverage alignments at the end
of the sequences (default 0.75).
--cons-strand {both,plus}
search both strands or the plus strand only (default
both).
--cons-threads THREADS
number of threads to use (1 to 256, default 1).
RDP Classifier/Database specific options:
--rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}
marker gene/region
--rdp-maxmem GB maximum memory size for the java virtual machine in GB
(default 2)
--rdp-minconf RDP_MINCONF
minimum confidence value to assign taxonomy to a
sequence (default 0.8)
Examples
Classification of 16S sequences using the consensus classifier and
Greengenes:
micca classify -m cons -i input.fasta -o tax.txt \
--ref greengenes_2013_05/rep_set/97_otus.fasta \
--ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt
Classification of ITS sequences using the RDP classifier and the
UNITE database:
micca classify -m rdp --rdp-gene fungalits_unite -i input.fasta \
-o tax.txt
OTU ID matching after the closed reference OTU picking protocol:
micca classify -m otuid -i otuids.txt -o tax.txt \
--ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt
47 changes: 46 additions & 1 deletion doc/source/commands/convert.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,49 @@
convert
=======

.. command-output:: micca convert --help
.. code-block:: console
usage: micca convert [-h] -i FILE -o FILE [-q FILE] [-d DEFAULTQ]
[-f INPUT_FORMAT] [-F OUTPUT_FORMAT]
micca convert converts between sequence file formats. See
http://biopython.org/wiki/SeqIO#File_Formats for a comprehnsive list
of the supported file formats.
Supported input formats:
abi, abi-trim, ace, embl, embl-cds, fasta, fasta-qual, fastq, fastq-illumina,
fastq-sanger, fastq-solexa, gb, genbank, genbank-cds, ig, imgt, pdb-atom,
pdb-seqres, phd, pir, qual, seqxml, sff, sff-trim, swiss, tab, uniprot-xml
Supported output formats:
embl, fasta, fastq, fastq-illumina, fastq-sanger, fastq-solexa, gb, genbank,
imgt, phd, qual, seqxml, sff, tab
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input sequence file (required).
-o FILE, --output FILE
output sequence file (required).
-q FILE, --qual FILE input quality file (required for 'fasta-qual' input
format.
-d DEFAULTQ, --defaultq DEFAULTQ
default phred quality score for format-without-quality
to format-with-quality conversion (default 40).
-f INPUT_FORMAT, --input-format INPUT_FORMAT
input file format (default fastq).
-F OUTPUT_FORMAT, --output-format OUTPUT_FORMAT
input file format (default fasta).
Examples
Convert FASTA+QUAL files into a FASTQ (Sanger/Illumina 1.8+) file:
micca convert -i input.fasta -q input.qual -o output.fastq \
-f fasta-qual -F fastq
Convert a SFF file into a FASTQ (Sanger/Illumina 1.8+) file:
micca convert -i input.sff -o output.fastq -f sff -F fastq
71 changes: 70 additions & 1 deletion doc/source/commands/filter.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,73 @@
filter
======

.. command-output:: micca filter --help
.. code-block:: console
usage: micca filter [-h] -i FILE -o FILE [-e MAXEERATE] [-m MINLEN] [-t]
[-n MAXNS] [-f {fastq,fasta}]
micca filter filters sequences according to the maximum allowed
expected error (EE) rate %%. Optionally, you can:
* discard sequences that are shorter than the specified length
(suggested for Illumina overlapping paired-end (already merged)
reads) (option --minlen MINLEN);
* discard sequences that are shorter than the specified length AND
truncate sequences that are longer (suggested for Illumina and 454
unpaired reads) (options --minlen MINLEN --trunc);
* discard sequences that contain more than a specified number of Ns
(--maxns).
Sequences are first shortened and then filtered. Overlapping paired
reads with should be merged first (using micca-mergepairs) and then
filtered.
The expected error (EE) rate %% in a sequence of length L is defined
as (doi: 10.1093/bioinformatics/btv401):
sum(error probabilities)
EE rate %% = ------------------------ * 100
L
Before filtering, run 'micca filterstats' to see how many reads will
pass the filter at different minimum lengths with or without
truncation, given a maximum allowed expected error rate %% and maximum
allowed number of Ns.
micca-filter is based on VSEARCH (https://github.com/torognes/vsearch).
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTQ file, Sanger/Illumina 1.8+ format
(phred+33) (required).
-o FILE, --output FILE
output FASTA/FASTQ file (required).
-e MAXEERATE, --maxeerate MAXEERATE
discard sequences with more than the specified expeced
error rate % (values <=1%, i.e. less or equal than one
error per 100 bases, are highly recommended).
Sequences are discarded after truncation (if enabled)
(default 1).
-m MINLEN, --minlen MINLEN
discard sequences that are shorter than MINLEN
(default 1).
-t, --trunc truncate sequences that are longer than MINLEN
(disabled by default).
-n MAXNS, --maxns MAXNS
discard sequences with more than the specified number
of Ns. Sequences are discarded after truncation
(disabled by default).
-f {fastq,fasta}, --output-format {fastq,fasta}
file format (default fasta).
Examples
Truncate reads at 300 bp, discard low quality sequences
(with EE rate > 0.5%%) and write a FASTA file:
micca filter -i reads.fastq -o filtered.fasta -m 300 -t -e 0.5
52 changes: 51 additions & 1 deletion doc/source/commands/filterstats.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,54 @@
filterstats
===========

.. command-output:: micca filterstats --help
.. code-block:: console
usage: micca filterstats [-h] -i FILE [-o DIR] [-t TOPN]
[-e MAXEERATES [MAXEERATES ...]] [-n MAXNS]
micca filterstats reports the fraction of reads that would pass for each
specified maximum expected error (EE) rate %% and the maximum number of
allowed Ns after:
* discarding sequences that are shorter than the specified length
(suggested for Illumina overlapping paired-end (already merged)
reads);
* discarding sequences that are shorter than the specified length AND
truncating sequences that are longer (suggested for Illumina and 454
unpaired reads);
Parameters for the 'micca filter' command should be chosen for each
sequencing run using this tool.
micca filterstats returns in the output directory 3 files:
* filterstats_minlen.txt: fraction of reads that would pass the filter after
the minimum length filtering;
* filterstats_trunclen.txt: fraction of reads that would pass the filter after
the minimum length filtering + truncation;
* filterstats_plot.png: plot in PNG format.
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTQ file, Sanger/Illumina 1.8+ format
(phred+33) (required).
-o DIR, --output DIR output directory (default .).
-t TOPN, --topn TOPN perform statistics on the first TOPN sequences
(disabled by default)
-e MAXEERATES [MAXEERATES ...], --maxeerates MAXEERATES [MAXEERATES ...]
max expected error rates (%). (default [0.25, 0.5,
0.75, 1, 1.25, 1.5])
-n MAXNS, --maxns MAXNS
max number of Ns. (disabled by default).
Examples
Compute filter statistics on the top 10000 sequences, predicting
the fraction of reads that would pass for each maximum EE error
rate (default values):
micca filterstats -i input.fastq -o stats -t 10000
34 changes: 33 additions & 1 deletion doc/source/commands/merge.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,36 @@
merge
=====

.. command-output:: micca merge --help
.. code-block:: console
usage: micca merge [-h] -i FILE [FILE ...] -o FILE [-s SEP] [-f {fastq,fasta}]
micca merge merges several FASTQ or FASTA files in a single file.
Different samples will be merged in a single file and sample names
will be appended to the sequence identifier
(e.g. >SEQID;sample=SAMPLENAME). Sample names are defined as the
leftmost part of the file name splitted by the first occurence of '.'
(-s/--sep option). Whitespace characters in names will be replaced
with a single character underscore ('_').
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE [FILE ...], --input FILE [FILE ...]
input FASTQ/FASTA file(s) (required).
-o FILE, --output FILE
output FASTQ/FASTA file (required).
-s SEP, --sep SEP Sample names are defined as the leftmost part of the
file name splitted by the first occurence of 'SEP'
(default .)
-f {fastq,fasta}, --format {fastq,fasta}
file format (default fastq).
Examples
Merge files in FASTA format:
micca merge -i in1.fasta in2.fasta in3.fasta -o merged.fasta \
-f fasta

0 comments on commit 17141e6

Please sign in to comment.