Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong classification results for some ASVs (assignTaxonomy function) #1441

Closed
SWittouck opened this issue Oct 27, 2021 · 2 comments
Closed

Comments

@SWittouck
Copy link

Hi @benjjneb and fellow DADA2 fans,

I noticed recently that for a very small number of ASVs, the DADA2 assignTaxonomy function is returning a classification that is obviously wrong.

I'll give two examples. In the first, an ASV is classified to the genus Lentilactobacillus by DADA2, while a BLASTN search against the exact same reference database results in a perfect match with a Lactiplantibacillus 16S sequence. In the second example, an ASV is classified to the species Apilactobacillus apinorum, while the best BLASTN hit against the same reference database is to the genus Lacticaseibacillus (with only 92% identity, which makes sense because according to EzBioCloud, the ASV is a Weissella, which isn't present in the reference database).

Here is the code for the examples. Any help is greatly appreciated.

Cheers,
Stijn

@benjjneb
Copy link
Owner

benjjneb commented Oct 29, 2021

I ran the following:

library(dada2); packageVersion("dada2")
setwd("~/Desktop/wrong_classifications/")

fin_refdb_current <- "refdbs/SSUrRNA_GTDB05-lactobacillaceae-all_DADA2.fna"
fin_refdb_previous <- "refdbs/bac120_ssu_r89_dada2_oldlactobacillus_noprefix.fna"

sq1 <- "GCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGTGCATCGGAAACTGGAAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGTATGGGT"
sq2 <- "CCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGACGGTTATTTAAGTCTGAAGTGAAAGCCCTCAGCTCAACTGAGGAATTGCTTTGGAAACTGGATGACTTGAGTGCAGTAGAGGAAAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTTTCTGGACTGTAACTGACGTTGAGGCTCGAAAGTGTGGGT"

nchar(sq1); nchar(sq2)

tt <- dada2::assignTaxonomy(c(sq1, sq2), refFasta = fin_refdb_current)
unname(tt)

[1,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales" "Lactobacillaceae" "Lentilactobacillus" NA
[2,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales" "Lactobacillaceae" "Weissella" NA

So I can reproduce what you are observing.

Then added the subsequent check for reverse complement orientation:

tt2 <- dada2::assignTaxonomy(c(sq1, sq2), refFasta = fin_refdb_current, tryRC=TRUE)
unname(tt2)

[1,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales" "Lactobacillaceae" "Lactiplantibacillus" "Lactiplantibacillus sp005405125"
[2,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales" "Lactobacillaceae" "Weissella" "Weissella confusa_B"

So, it looks like the first misclassification is because the first query sequence is in the reverse complement orientation to the reference database. It's probably best off to always use tryRC=TRUE to catch these sorts of issues, especially with custom references.

The second I think simply reflects a shortcoming of the naive Bayesian classifier method (which is what assignTaxonomy implements)). If there is not a good match in the reference db, sequences can sometimes be over-confidently assigned to the closest taxon that does exist, even if the match is significantly below the similarity level that would be expected for sharing the same e.g. genus.

@SWittouck
Copy link
Author

Hi Benjamin,

The reference database I'm using is a subset of the 16S sequences extracted from whole genomes by the GTDB, so I'm confused as to how it can contain reverse complemented sequences. But I'm glad that there is such a simple fix to the problem!

Thank you so much for your help!

Best wishes,
Stijn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants