-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Richness estimators reliant on singleton counts #103
Comments
Hi Nick, Short answer: Yes it is defensible to use richness estimators, on a sample-wise basis, if you have pooled your samples using dada2. Long answer: I still wouldn't do it. The issue I have with richness estimation in this context is its dependence on the rarest variants, and the lack of any accounting for the misclassification uncertainty in those rare variants. Error bars that don't include the biggest source of error are worse than no error bars! What a richness estimator is doing is attempting to estimate the number of variants that aren't observed, and essentially all unobserved variants will be rare. Thus, the information about the unobserved class comes almost entirely from the observed singletons and doubletons -- the observed rare variants -- from which we extrapolate out. However, it is exactly those variants for which misclassification error is so problematic in this context. Even if just 1 of every 2000 reads is misclassified as a spurious OTU, that is 50 spurious singletons in a 100k read sample, which will typically overwhelm the legitimate singletons and drive a major overestimation by, say, Chao1. DADA2 helps with this, but nothing is perfect! This concern is not new, and efforts such as Breakaway were developed in part to reduce the reliance of richness estimators on the low frequency classes, but they come with their own issues, as higher frequency classes are simply much less informative about the rare unobserved variants and estimates become more sensitive to the model assumptions. Chao (of Chao1 fame) also has a recent paper on new richness estimators that don't use singletons, but that still rely on perfect classification of doubletons. Finally, and this is straying into unnecessary purity territory, the other thing that bothers me is that the very procedure of methods like DADA2 or UPARSE is not the "right" one for performing richness estimations. A richness estimator depends on the total number of singletons, but that is not what DADA2 attempts to estimate. Rather, DADA2 is attempting to call all rare variants that it thinks are true individually. A procedure to estimate the total number of singletons should not enforce individual validity -- if there are 100 candidate singletons, each of which has a 70% chance of being real, than the estimate of total singletons is 70, but an individual variant method may output none of them as each is uncertain alone! |
Hi Ben, Thanks so much for the "short" and the "long" answer. I think I now understand your concern regarding how the difficulty in calling true low frequency classes drives overestimation of richness when using these estimators (in this context). Your last comment that DADA2 or UPARSE are not exactly the "right" approaches to obtain the total number of singletons in extremely helpful and what, in part, prompted my question. Thus, I appreciate your straying! |
Hello,
I just read issue #92 on the treatment of singletons and was hoping I might be able to follow-up on this conversation to ask your thoughts on whether or not richness estimators that are utilize information on singleton counts such as CatchAll, Breakaway, Chao1, etc. are appropriate to use given the treatment of singletons when processing reads using dada2?
If per-sample singletons can be called when pooling sample inference across multiple samples, and dada2 greatly reduces the number of spurious OTUs/RSVs, would such richness estimators perhaps be expected to return reasonable estimates? Maybe inflated to the extent to which artifacts fail to be culled, but more reasonable than for other clustering methods? Or perhaps I am missing an important point and these approaches should not even be considered?
I tend to think in general terms these approaches could yield better estimates (albeit with likely wide standard errors), but given the inherent challenge of calling rare taxa perhaps it is better to just assume the number of observed RSV represents the lower bound of taxa for a given sequencing depth and go with that as the most reasonable estimate?
If it helps frame the conversation our group is usually is working with human and murine stool and our primarily interested in estimating richness is to compare it across experimental conditions.
Thanks in advance and for all the work on the package, tutorials, and informative forum.
The text was updated successfully, but these errors were encountered: