-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Documentation updated; requirements for RTD removed
- Loading branch information
Davide Albanese
committed
Feb 27, 2017
1 parent
cd71f30
commit 17141e6
Showing
23 changed files
with
1,026 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,4 +8,5 @@ dist/ | |
MANIFEST | ||
.#* | ||
micca.egg-info/ | ||
.vscode | ||
.vscode | ||
micca/thirdparty_bin/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,136 @@ | ||
classify | ||
======== | ||
|
||
.. command-output:: micca classify --help | ||
.. code-block:: console | ||
usage: micca classify [-h] -i FILE -o FILE [-m {cons,rdp,otuid}] [-r FILE] | ||
[-x FILE] [--cons-id CONS_ID] | ||
[--cons-maxhits CONS_MAXHITS] | ||
[--cons-minfrac CONS_MINFRAC] | ||
[--cons-mincov CONS_MINCOV] [--cons-strand {both,plus}] | ||
[--cons-threads THREADS] | ||
[--rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}] | ||
[--rdp-maxmem GB] [--rdp-minconf RDP_MINCONF] | ||
micca classify assigns taxonomy for each sequence in the input file | ||
and provides three methods for classification: | ||
* VSEARCH-based consensus classifier (cons): input sequences are | ||
searched in the reference database with VSEARCH | ||
(https://github.com/torognes/vsearch). For each query sequence the | ||
method retrives up to 'cons-maxhits' hits (i.e. identity >= | ||
'cons-id'). Then, the most specific taxonomic label that is | ||
associated with at least 'cons-minfrac' of the hits is | ||
assigned. The method is similar to the UCLUST-based consensus | ||
taxonomy assigner presented in doi: 10.7287/peerj.preprints.934v2 | ||
and available in QIIME. | ||
* RDP classifier (rdp): only RDP classifier version >= 2.8 is | ||
supported (doi:10.1128/AEM.00062-07). In order to use this | ||
classifier RDP must be installed (download at | ||
http://sourceforge.net/projects/rdp-classifier/files/rdp-classifier/) | ||
and the RDPPATH environment variable setted. The available | ||
databases (--rdp-gene) are: | ||
- 16S (16srrna) | ||
- Fungal LSU (28S) (fungallsu) | ||
- Warcup ITS (fungalits_warcup, doi: 10.3852/14-293) | ||
- UNITE ITS (fungalits_unite) | ||
For more information about the RDP classifier go to | ||
http://rdp.cme.msu.edu/classifier/classifier.jsp | ||
* OTU ID classifier (otuid): simply perform a sequence ID matching | ||
with the reference taxonomy file. Recommended strategy when the | ||
closed reference clustering (--method closedref in micca-otu) was | ||
performed. OTU ID classifier requires a tab-delimited file where | ||
the first column contains the current OTU ids and the second column | ||
the reference taxonomy ids (see otuids.txt in micca-otu), e.g.: | ||
REF1[TAB]1110191 | ||
REF2[TAB]1104777 | ||
REF3[TAB]1078527 | ||
... | ||
The input reference taxonomy file (--ref-tax) should be a | ||
tab-delimited file where rows are either in the form: | ||
1. SEQID[TAB]k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__; | ||
2. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales;;; | ||
3. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales | ||
4. SEQID[TAB]D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__;D_5__; | ||
Compatible reference database are Greengenes | ||
(http://greengenes.secondgenome.com/downloads), QIIME-formatted SILVA | ||
(https://www.arb-silva.de/download/archive/qiime/) and UNITE | ||
(https://unite.ut.ee/repository.php). | ||
The output file is a tab-delimited file where each row is in the | ||
format: | ||
SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales | ||
optional arguments: | ||
-h, --help show this help message and exit | ||
arguments: | ||
-i FILE, --input FILE | ||
input FASTA file (for 'cons' and 'rdp') or a tab- | ||
delimited OTU ids file (for 'otuid') (required). | ||
-o FILE, --output FILE | ||
output taxonomy file (required). | ||
-m {cons,rdp,otuid}, --method {cons,rdp,otuid} | ||
classification method (default cons) | ||
-r FILE, --ref FILE reference sequences in FASTA format, required for | ||
'cons' classifier. | ||
-x FILE, --ref-tax FILE | ||
tab-separated reference taxonomy file, required for | ||
'cons' and 'otuid' classifiers. | ||
VSEARCH-based consensus classifierspecific options: | ||
--cons-id CONS_ID sequence identity threshold (0.0 to 1.0, default 0.9). | ||
--cons-maxhits CONS_MAXHITS | ||
number of hits to consider (>=1, default 3). | ||
--cons-minfrac CONS_MINFRAC | ||
for each taxonomic rank, a specific taxa will be | ||
assigned if it is present in at least MINFRAC of the | ||
hits (0.0 to 1.0, default 0.5). | ||
--cons-mincov CONS_MINCOV | ||
reject sequence if the fraction of alignment to the | ||
reference sequence is lower than MINCOV. This | ||
parameter prevents low-coverage alignments at the end | ||
of the sequences (default 0.75). | ||
--cons-strand {both,plus} | ||
search both strands or the plus strand only (default | ||
both). | ||
--cons-threads THREADS | ||
number of threads to use (1 to 256, default 1). | ||
RDP Classifier/Database specific options: | ||
--rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite} | ||
marker gene/region | ||
--rdp-maxmem GB maximum memory size for the java virtual machine in GB | ||
(default 2) | ||
--rdp-minconf RDP_MINCONF | ||
minimum confidence value to assign taxonomy to a | ||
sequence (default 0.8) | ||
Examples | ||
Classification of 16S sequences using the consensus classifier and | ||
Greengenes: | ||
micca classify -m cons -i input.fasta -o tax.txt \ | ||
--ref greengenes_2013_05/rep_set/97_otus.fasta \ | ||
--ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt | ||
Classification of ITS sequences using the RDP classifier and the | ||
UNITE database: | ||
micca classify -m rdp --rdp-gene fungalits_unite -i input.fasta \ | ||
-o tax.txt | ||
OTU ID matching after the closed reference OTU picking protocol: | ||
micca classify -m otuid -i otuids.txt -o tax.txt \ | ||
--ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,49 @@ | ||
convert | ||
======= | ||
|
||
.. command-output:: micca convert --help | ||
.. code-block:: console | ||
usage: micca convert [-h] -i FILE -o FILE [-q FILE] [-d DEFAULTQ] | ||
[-f INPUT_FORMAT] [-F OUTPUT_FORMAT] | ||
micca convert converts between sequence file formats. See | ||
http://biopython.org/wiki/SeqIO#File_Formats for a comprehnsive list | ||
of the supported file formats. | ||
Supported input formats: | ||
abi, abi-trim, ace, embl, embl-cds, fasta, fasta-qual, fastq, fastq-illumina, | ||
fastq-sanger, fastq-solexa, gb, genbank, genbank-cds, ig, imgt, pdb-atom, | ||
pdb-seqres, phd, pir, qual, seqxml, sff, sff-trim, swiss, tab, uniprot-xml | ||
Supported output formats: | ||
embl, fasta, fastq, fastq-illumina, fastq-sanger, fastq-solexa, gb, genbank, | ||
imgt, phd, qual, seqxml, sff, tab | ||
optional arguments: | ||
-h, --help show this help message and exit | ||
arguments: | ||
-i FILE, --input FILE | ||
input sequence file (required). | ||
-o FILE, --output FILE | ||
output sequence file (required). | ||
-q FILE, --qual FILE input quality file (required for 'fasta-qual' input | ||
format. | ||
-d DEFAULTQ, --defaultq DEFAULTQ | ||
default phred quality score for format-without-quality | ||
to format-with-quality conversion (default 40). | ||
-f INPUT_FORMAT, --input-format INPUT_FORMAT | ||
input file format (default fastq). | ||
-F OUTPUT_FORMAT, --output-format OUTPUT_FORMAT | ||
input file format (default fasta). | ||
Examples | ||
Convert FASTA+QUAL files into a FASTQ (Sanger/Illumina 1.8+) file: | ||
micca convert -i input.fasta -q input.qual -o output.fastq \ | ||
-f fasta-qual -F fastq | ||
Convert a SFF file into a FASTQ (Sanger/Illumina 1.8+) file: | ||
micca convert -i input.sff -o output.fastq -f sff -F fastq |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,73 @@ | ||
filter | ||
====== | ||
|
||
.. command-output:: micca filter --help | ||
.. code-block:: console | ||
usage: micca filter [-h] -i FILE -o FILE [-e MAXEERATE] [-m MINLEN] [-t] | ||
[-n MAXNS] [-f {fastq,fasta}] | ||
micca filter filters sequences according to the maximum allowed | ||
expected error (EE) rate %%. Optionally, you can: | ||
* discard sequences that are shorter than the specified length | ||
(suggested for Illumina overlapping paired-end (already merged) | ||
reads) (option --minlen MINLEN); | ||
* discard sequences that are shorter than the specified length AND | ||
truncate sequences that are longer (suggested for Illumina and 454 | ||
unpaired reads) (options --minlen MINLEN --trunc); | ||
* discard sequences that contain more than a specified number of Ns | ||
(--maxns). | ||
Sequences are first shortened and then filtered. Overlapping paired | ||
reads with should be merged first (using micca-mergepairs) and then | ||
filtered. | ||
The expected error (EE) rate %% in a sequence of length L is defined | ||
as (doi: 10.1093/bioinformatics/btv401): | ||
sum(error probabilities) | ||
EE rate %% = ------------------------ * 100 | ||
L | ||
Before filtering, run 'micca filterstats' to see how many reads will | ||
pass the filter at different minimum lengths with or without | ||
truncation, given a maximum allowed expected error rate %% and maximum | ||
allowed number of Ns. | ||
micca-filter is based on VSEARCH (https://github.com/torognes/vsearch). | ||
optional arguments: | ||
-h, --help show this help message and exit | ||
arguments: | ||
-i FILE, --input FILE | ||
input FASTQ file, Sanger/Illumina 1.8+ format | ||
(phred+33) (required). | ||
-o FILE, --output FILE | ||
output FASTA/FASTQ file (required). | ||
-e MAXEERATE, --maxeerate MAXEERATE | ||
discard sequences with more than the specified expeced | ||
error rate % (values <=1%, i.e. less or equal than one | ||
error per 100 bases, are highly recommended). | ||
Sequences are discarded after truncation (if enabled) | ||
(default 1). | ||
-m MINLEN, --minlen MINLEN | ||
discard sequences that are shorter than MINLEN | ||
(default 1). | ||
-t, --trunc truncate sequences that are longer than MINLEN | ||
(disabled by default). | ||
-n MAXNS, --maxns MAXNS | ||
discard sequences with more than the specified number | ||
of Ns. Sequences are discarded after truncation | ||
(disabled by default). | ||
-f {fastq,fasta}, --output-format {fastq,fasta} | ||
file format (default fasta). | ||
Examples | ||
Truncate reads at 300 bp, discard low quality sequences | ||
(with EE rate > 0.5%%) and write a FASTA file: | ||
micca filter -i reads.fastq -o filtered.fasta -m 300 -t -e 0.5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,54 @@ | ||
filterstats | ||
=========== | ||
|
||
.. command-output:: micca filterstats --help | ||
.. code-block:: console | ||
usage: micca filterstats [-h] -i FILE [-o DIR] [-t TOPN] | ||
[-e MAXEERATES [MAXEERATES ...]] [-n MAXNS] | ||
micca filterstats reports the fraction of reads that would pass for each | ||
specified maximum expected error (EE) rate %% and the maximum number of | ||
allowed Ns after: | ||
* discarding sequences that are shorter than the specified length | ||
(suggested for Illumina overlapping paired-end (already merged) | ||
reads); | ||
* discarding sequences that are shorter than the specified length AND | ||
truncating sequences that are longer (suggested for Illumina and 454 | ||
unpaired reads); | ||
Parameters for the 'micca filter' command should be chosen for each | ||
sequencing run using this tool. | ||
micca filterstats returns in the output directory 3 files: | ||
* filterstats_minlen.txt: fraction of reads that would pass the filter after | ||
the minimum length filtering; | ||
* filterstats_trunclen.txt: fraction of reads that would pass the filter after | ||
the minimum length filtering + truncation; | ||
* filterstats_plot.png: plot in PNG format. | ||
optional arguments: | ||
-h, --help show this help message and exit | ||
arguments: | ||
-i FILE, --input FILE | ||
input FASTQ file, Sanger/Illumina 1.8+ format | ||
(phred+33) (required). | ||
-o DIR, --output DIR output directory (default .). | ||
-t TOPN, --topn TOPN perform statistics on the first TOPN sequences | ||
(disabled by default) | ||
-e MAXEERATES [MAXEERATES ...], --maxeerates MAXEERATES [MAXEERATES ...] | ||
max expected error rates (%). (default [0.25, 0.5, | ||
0.75, 1, 1.25, 1.5]) | ||
-n MAXNS, --maxns MAXNS | ||
max number of Ns. (disabled by default). | ||
Examples | ||
Compute filter statistics on the top 10000 sequences, predicting | ||
the fraction of reads that would pass for each maximum EE error | ||
rate (default values): | ||
micca filterstats -i input.fastq -o stats -t 10000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,36 @@ | ||
merge | ||
===== | ||
|
||
.. command-output:: micca merge --help | ||
.. code-block:: console | ||
usage: micca merge [-h] -i FILE [FILE ...] -o FILE [-s SEP] [-f {fastq,fasta}] | ||
micca merge merges several FASTQ or FASTA files in a single file. | ||
Different samples will be merged in a single file and sample names | ||
will be appended to the sequence identifier | ||
(e.g. >SEQID;sample=SAMPLENAME). Sample names are defined as the | ||
leftmost part of the file name splitted by the first occurence of '.' | ||
(-s/--sep option). Whitespace characters in names will be replaced | ||
with a single character underscore ('_'). | ||
optional arguments: | ||
-h, --help show this help message and exit | ||
arguments: | ||
-i FILE [FILE ...], --input FILE [FILE ...] | ||
input FASTQ/FASTA file(s) (required). | ||
-o FILE, --output FILE | ||
output FASTQ/FASTA file (required). | ||
-s SEP, --sep SEP Sample names are defined as the leftmost part of the | ||
file name splitted by the first occurence of 'SEP' | ||
(default .) | ||
-f {fastq,fasta}, --format {fastq,fasta} | ||
file format (default fastq). | ||
Examples | ||
Merge files in FASTA format: | ||
micca merge -i in1.fasta in2.fasta in3.fasta -o merged.fasta \ | ||
-f fasta | ||
Oops, something went wrong.