Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DECIPHER classification module compared to dada2 classification #683

Closed
vhertzb opened this issue Feb 20, 2019 · 6 comments
Closed

DECIPHER classification module compared to dada2 classification #683

vhertzb opened this issue Feb 20, 2019 · 6 comments

Comments

@vhertzb
Copy link

vhertzb commented Feb 20, 2019

I'm attaching two versions of the bar plot that is generated at the end of the dada2 tutorial using the tutorial dataset.
ps barplot based on decipher
ps barplot based on dada2 with silva 132

I used the DECIPHER R chunk in the tutorial to get the plot and the second R chunk was generated using the dada2 R chunks to identify taxa using the Silva 132 reference files.

As you can see there is a great deal of difference between the two files. Any ideas on how to trouble shoot? It could be a problem with the R chunk in the tutorial, it could be a problem with the .RData file and the way it was created, it could be a problem with DECIPHER.

If you run this using the latest versions of dada2 and phyloseq (I'm running those, and those are newer than the tutorial), can you replicate the same difference?

If it is not the Rchunk in the tutorial, then it is either the .RData file or something inherent in DECIPHER. If that is the case, can I copy you on my correspondence with the DECIPHER people?

@vhertzb
Copy link
Author

vhertzb commented Feb 20, 2019

...I used the DECIPHER R chunk in the tutorial to get the first set of bar plots...

@benjjneb
Copy link
Owner

From glancing between them, I would say the difference is big but also narrow: Lots of sequences are being assigned to Muribaulaceae by dada2::assignTaxonomy, but are being left unassigned by DECIPHER::IdTaxa. Otherwise things look consistent.

I'm not sure why that would be, but @digitalwright might have some insight. It is true that in general, DECIPHER will be a bit more conservative about assigning taxonomy than default assignTaxonomy (which uses implements the naive Bayesian classifier method with a bootstrap cutoff of 50%).

@vhertzb
Copy link
Author

vhertzb commented Feb 20, 2019

Note that DECIPHER is also assigning some sequences to Muribaculaceae, just a whole lot less than assignTaxonomy does. But I will take it up with @digitalwright as well.

@apcamargo
Copy link

IDTAXA tends to leave more sequences unclassified at the root level. You can read about that in the section "IDTAXA’s classifications change the interpretation of microbiome data" of their paper. Figure 4 illustrates this behaviour.

https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0521-5

@digitalwright
Copy link

A few more points:

  1. The RDP Classifier is far more permissive than DECIPHER, especially when comparing RDP at 50% confidence to IDTAXA at 60% confidence. The RDP Classifier is more permissive even when it assigns 100% confidence (Figure 4). The RDP Classifier is also wrong more often than IDTAXA at the same level of permissiveness (Figure 1). See Figure 2 for why that is the case.
  2. You can lower IDTAXA's confidence to 50% or 40% if you would like to classify more sequences at the expense of some accuracy. Personally, I prefer to be more right than have a higher percentage classified. The error rate is very low at 60% confidence with IDTAXA. Many of the errors are actually due to the reference taxonomy (Figure 5).
  3. None of these algorithms are very good for partial reads of the 16S rRNA gene, even IDTAXA (lines in Figure 1d). There simply isn't enough information to work with in short length sequences. The difference is that IDTAXA will maintain its low error rates on short sequences, whereas the error rate of other algorithms skyrockets (points in Figure 1d versus points in Figure 1a). See Figure S5 for why that is the case.
  4. Your sequences need to be in the same orientation as the training set if you use strand="top". If you are not sure then try classifying again with strand="both". Note that strand="top" is faster than strand="both". If you use the wrong strand then sequences will be left unclassified at the Root level.
  5. There was a bug in strand="both" that was fixed in DECIPHER v2.10.2. It affected classifications in rare cases, but worth updating if you are using strand="both".
  6. IDTAXA may be assigning reads to at higher rank levels than the family rank that was used in your example plot. You can use plot() on the object output by IdTaxa() to see if that is the case.

Points 1 - 3 above are made in the paper that @apcamargo mentioned.

I hope that helps.

@vhertzb
Copy link
Author

vhertzb commented Feb 21, 2019

Thanks to all, your comments have been so helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants