Updates 9 Mar 2016:
- QIIME command is now pick_closed_reference_otus.py.
- See this paper for more information on the pros and cons of open and closed reference OTU picking.
Input sequences and Phred scores are provided to split_libraries_fastq.py, and demultiplexed/quality filtered using its default parameters. On a per-read basis, quality filtering works as follows:
- Identify the first quality score below Q3 and truncate the read just prior to that position.
- Determine if the truncated sequence is at least 75% of the length of the input sequence: if yes, retain the truncated read; if no, discard the read.
- Determine if the truncated sequence has any N (i.e., ambiguous base call) characters: if yes, discard the read; if no, retain the read.
97% OTUs are picked in the initial EMP analyses using a closed-reference
OTU picking protocol against the Greengenes database pre-clustered at
97% identify (this Greengenes reference collection build is available
here:
http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Caporaso_Reference_OTUs/gg_otus_4feb2011.tgz.
This is done using pick_closed_reference_otus.py. This process works
as follows. Reads are pre-sorted by abundance in QIIME so the most
frequently occurring sequences will be chosen as OTU centroid sequences.
Each read is then searched against the Greengenes reference sequences
using reference-based uclust version 1.2.22. The call to uclust issued
by QIIME looks like:
uclust --id 0.97 --w 12 --stepwords 20 --usersort --maxaccepts 20 --libonly --stable_sort --maxrejects 500
Reads which hit a sequence in the reference collection at greater than
or equal to 97% identity are assigned to an OTU defined by the reference
sequence they match. Reads which fail to hit a reference sequence at at
least 97% identity are discarded. Taxonomy is assigned to each OTU based
on the reference sequence defining that OTU, and the Greengenes
reference tree (also provided in the reference collection build linked
above) can then be used for pylogenetic diversity analyses.
Closed-reference OTU picking has pros and cons, and future EMP runs will
apply the open-reference OTU picking process described here:
http://qiime.org/tutorials/open_reference_illumina_processing.html#option-2-subsampled-open-reference-otu-picking
The pros of closed-reference OTU picking are that it is fully
parallelizable, which is important for data sets of this scale, and that
the OTUs are defined by trusted reference sequences. It additionally
serves as a quality control filter: erroneous reads will likely be
discarded as not hitting the reference data set. The primary con of
closed-reference OTU picking is that sequences that are not already
known (i.e., represented in the reference data set) will be excluded.