To characterize the taxonomic structure of the samples, the sequences are organized into Operational Taxonomic Units (OTUs) at varying levels of identity. An identity of 97% represent the common working definition of bacterial species. The commands/otu
command assigns similar sequences (marker genes such as 16S rRNA and the fungal ITS region) to operational taxonomic units (OTUs). The command commands/otu
wraps VSEARCH for low-level clustering, chimera detection an searching operations.
The commands/otu
command returns in a single directory 5 files:
- otutable.txt
TAB-delimited file, containing the number of times an OTU is found in each sample (OTU x sample, see
formats
):OTU Mw_01 Mw_02 Mw_03 ... DENOVO1 151 178 177 ... DENOVO2 339 181 142 ... DENOVO3 533 305 63 ... ... ... ... ... ...
- otus.fasta
FASTA containing the representative sequences (OTUs):
>DENOVO1 GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAACGGGG... >DENOVO2 GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCA... >DENOVO3 AGTGAACGCTGGCGACGTGGTTAAGACATGCAAGTCGAGCGGTA...
...
- otuids.txt
TAB-delimited file which maps the OTU ids to original sequence ids:
DENOVO1 IS0AYJS04JQKIS;sample=Mw_01 DENOVO2 IS0AYJS04JL6RS;sample=Mw_01 DENOVO3 IS0AYJS04H4XNN;sample=Mw_01
...
- hits.txt
TAB-separated file, three-columns, where each column contains: the matching sequence, the representative (seed) and the identity (if available), see
otu-definition_identity
:IS0AYJS04JE658;sample=Mw_01; IS0AYJS04I4XYN;sample=Mw_01 99.4 IS0AYJS04JPH34;sample=Mw_01; IS0AYJS04JVUBC;sample=Mw_01 98.0 IS0AYJS04I67XN;sample=Mw_01; IS0AYJS04JVUBC;sample=Mw_01 99.7
...
- otuschim.fasta
(only for 'denovo_greedy', 'denovo_swarm' and 'open_ref' mathods, when
-c/--rmchim
is specified) FASTA file containing the chimeric otus.
Warning
Trimming the sequences to a fixed position before clustering is strongly recommended when they cover partial amplicons or if quality deteriorates towards the end (common when you have long amplicons and single-end sequencing), see singleend-quality_filtering
.
Note
De novo OTUs are renamed to DENOVO[N]
and reference OTUs to REF[N]
.
commands/otu
implements several state-of-the-art clustering strategies:
In denovo greedy clustering (parameter --method denovo_greedy
), sequences are clustered without relying on an external reference database, using an approach similar to the UPARSE pipeline (https://doi.org/10.1038/nmeth.2604) and tested in https://doi.org/10.7287/peerj.preprints.1466v1. commands/otu
includes in a single command dereplication, clustering and chimera filtering:
- Dereplication. Predict sequence abundances of each sequence by dereplication, order by abundance and discard sequences with abundance value smaller than DEREP_MINSIZE (option
--derep-minsize
recommended value 2);- Greedy clustering. Distance (DGC) and abundance-based (AGC) strategies are supported (option
--greedy
, see https://doi.org/10.1186/s40168-015-0081-x and https://doi.org/10.7287/peerj.preprints.1466v1 ). Therefore, the candidate representative sequences are obtained;- Chimera filtering (optional). Remove chimeric sequences from the representatives performing a de novo chimera detection (option
--rmchim
, recommended);- Map sequences. Map sequences to the representatives.
Example (requires singleend-quality_filtering
in singleend
to be done):
micca otu -m denovo_greedy -i filtered.fasta -o denovo_greedy_otus -d 0.97 -c -t 4
Sequences are clustered against an external reference database and reads that could not be matched are discarded. Example (requires singleend-quality_filtering
in singleend
to be done):
Download the reference database (Greengenes), clustered at 97% identity:
wget ftp://ftp.fmach.it/metagenomics/micca/dbs/gg_2013_05.tar.gz
tar -zxvf gg_2013_05.tar.gz
Run the closed-reference protocol:
micca otu -m closed_ref -i filtered.fasta -o closed_ref_otus -r 97_otus.fasta -d 0.97 -t 4
Simply perform a sequence ID matching with the reference taxonomy file (see commands/classify
):
cd closed_ref_otus
micca classify -m otuid -i otuids.txt -o taxa.txt -x ../97_otu_taxonomy.txt
Open-reference clustering (open_ref): sequences are clustered against an external reference database (as in otu-closed_reference
) and reads that could not be matched are clustered with the otu-de_novo_greedy
protocol. Example (requires singleend-quality_filtering
in singleend
to be done):
Download the reference database (Greengenes), clustered at 97% identity:
wget ftp://ftp.fmach.it/metagenomics/micca/dbs/gg_2013_05.tar.gz
tar -zxvf gg_2013_05.tar.gz
Run the open-reference protocol:
micca otu -m open_ref -i filtered.fasta -o open_ref_otus -r 97_otus.fasta -d 0.97 -t 7 -c
Run the VSEARCH-based consensus classifier or the RDP classifier (see commands/classify
):
cd open_ref_otus
micca classify -m cons -i otus.fasta -o taxa.txt -r ../97_otus.fasta -x ../97_otu_taxonomy.txt -t 4
In denovo swarm clustering (doi: 10.7717/peerj.593, doi: 10.7717/peerj.1420, https://github.com/torognes/swarm, parameter --method denovo_swarm
), sequences are clustered without relying on an external reference database. From https://github.com/torognes/swarm:
The purpose of swarm is to provide a novel clustering algorithm that handles massive sets of amplicons. Results of traditional clustering algorithms are strongly input-order dependent, and rely on an arbitrary global clustering threshold. swarm results are resilient to input-order changes and rely on a small local linking threshold d, representing the maximum number of differences between two amplicons. swarm forms stable, high-resolution clusters, with a high yield of biological information.
commands/otu
includes in a single command dereplication, clustering and de novo chimera filtering:
- Dereplication. Predict sequence abundances of each sequence by dereplication, order by abundance and discard sequences with abundance value smaller than DEREP_MINSIZE (option
--derep-minsize
recommended value is 1, i.e. no filtering);- Swarm clustering. Number of differences 1 and the fastidious option are recommended (
--swarm-differences 1 --swarm-fastidious
).- Chimera filtering (optional). Remove chimeric sequences from the representatives performing a de novo chimera detection (option
--rmchim
);
Warning
Removing ambiguous nucleotides (N
) (with the option --maxns 0
in commands/filter
) is mandatory if you use the de novo swarm clustering method.
Example (requires singleend-primer_trimming
in singleend
to be done):
micca filter -i trimmed.fastq -o filtered.fasta -e 0.5 -m 350 -t --maxns 0
micca otu -m denovo_swarm -i filtered.fasta -o otus_denovo_swarm -c --minsize 1 --swarm-fastidious -t 4
In micca, the pairwise identity (except for de novo swarm) is defined as the edit distance excluding terminal gaps (same as in USEARCH and BLAST):