DECIPHER classification module compared to dada2 classification #683

vhertzb · 2019-02-20T17:54:44Z

I'm attaching two versions of the bar plot that is generated at the end of the dada2 tutorial using the tutorial dataset.

I used the DECIPHER R chunk in the tutorial to get the plot and the second R chunk was generated using the dada2 R chunks to identify taxa using the Silva 132 reference files.

As you can see there is a great deal of difference between the two files. Any ideas on how to trouble shoot? It could be a problem with the R chunk in the tutorial, it could be a problem with the .RData file and the way it was created, it could be a problem with DECIPHER.

If you run this using the latest versions of dada2 and phyloseq (I'm running those, and those are newer than the tutorial), can you replicate the same difference?

If it is not the Rchunk in the tutorial, then it is either the .RData file or something inherent in DECIPHER. If that is the case, can I copy you on my correspondence with the DECIPHER people?

vhertzb · 2019-02-20T17:56:27Z

...I used the DECIPHER R chunk in the tutorial to get the first set of bar plots...

benjjneb · 2019-02-20T18:33:55Z

From glancing between them, I would say the difference is big but also narrow: Lots of sequences are being assigned to Muribaulaceae by dada2::assignTaxonomy, but are being left unassigned by DECIPHER::IdTaxa. Otherwise things look consistent.

I'm not sure why that would be, but @digitalwright might have some insight. It is true that in general, DECIPHER will be a bit more conservative about assigning taxonomy than default assignTaxonomy (which uses implements the naive Bayesian classifier method with a bootstrap cutoff of 50%).

vhertzb · 2019-02-20T19:30:25Z

Note that DECIPHER is also assigning some sequences to Muribaculaceae, just a whole lot less than assignTaxonomy does. But I will take it up with @digitalwright as well.

apcamargo · 2019-02-20T20:32:33Z

IDTAXA tends to leave more sequences unclassified at the root level. You can read about that in the section "IDTAXA’s classifications change the interpretation of microbiome data" of their paper. Figure 4 illustrates this behaviour.

https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0521-5

digitalwright · 2019-02-20T23:45:26Z

A few more points:

The RDP Classifier is far more permissive than DECIPHER, especially when comparing RDP at 50% confidence to IDTAXA at 60% confidence. The RDP Classifier is more permissive even when it assigns 100% confidence (Figure 4). The RDP Classifier is also wrong more often than IDTAXA at the same level of permissiveness (Figure 1). See Figure 2 for why that is the case.
You can lower IDTAXA's confidence to 50% or 40% if you would like to classify more sequences at the expense of some accuracy. Personally, I prefer to be more right than have a higher percentage classified. The error rate is very low at 60% confidence with IDTAXA. Many of the errors are actually due to the reference taxonomy (Figure 5).
None of these algorithms are very good for partial reads of the 16S rRNA gene, even IDTAXA (lines in Figure 1d). There simply isn't enough information to work with in short length sequences. The difference is that IDTAXA will maintain its low error rates on short sequences, whereas the error rate of other algorithms skyrockets (points in Figure 1d versus points in Figure 1a). See Figure S5 for why that is the case.
Your sequences need to be in the same orientation as the training set if you use strand="top". If you are not sure then try classifying again with strand="both". Note that strand="top" is faster than strand="both". If you use the wrong strand then sequences will be left unclassified at the Root level.
There was a bug in strand="both" that was fixed in DECIPHER v2.10.2. It affected classifications in rare cases, but worth updating if you are using strand="both".
IDTAXA may be assigning reads to at higher rank levels than the family rank that was used in your example plot. You can use plot() on the object output by IdTaxa() to see if that is the case.

Points 1 - 3 above are made in the paper that @apcamargo mentioned.

I hope that helps.

vhertzb · 2019-02-21T02:00:44Z

Thanks to all, your comments have been so helpful.

benjjneb closed this as completed Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DECIPHER classification module compared to dada2 classification #683

DECIPHER classification module compared to dada2 classification #683

vhertzb commented Feb 20, 2019

vhertzb commented Feb 20, 2019

benjjneb commented Feb 20, 2019

vhertzb commented Feb 20, 2019

apcamargo commented Feb 20, 2019

digitalwright commented Feb 20, 2019

vhertzb commented Feb 21, 2019

DECIPHER classification module compared to dada2 classification #683

DECIPHER classification module compared to dada2 classification #683

Comments

vhertzb commented Feb 20, 2019

vhertzb commented Feb 20, 2019

benjjneb commented Feb 20, 2019

vhertzb commented Feb 20, 2019

apcamargo commented Feb 20, 2019

digitalwright commented Feb 20, 2019

vhertzb commented Feb 21, 2019