Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Richness estimators reliant on singleton counts #103

Closed
Nick243 opened this issue Aug 8, 2016 · 2 comments
Closed

Richness estimators reliant on singleton counts #103

Nick243 opened this issue Aug 8, 2016 · 2 comments

Comments

@Nick243
Copy link

Nick243 commented Aug 8, 2016

Hello,

I just read issue #92 on the treatment of singletons and was hoping I might be able to follow-up on this conversation to ask your thoughts on whether or not richness estimators that are utilize information on singleton counts such as CatchAll, Breakaway, Chao1, etc. are appropriate to use given the treatment of singletons when processing reads using dada2?

If per-sample singletons can be called when pooling sample inference across multiple samples, and dada2 greatly reduces the number of spurious OTUs/RSVs, would such richness estimators perhaps be expected to return reasonable estimates? Maybe inflated to the extent to which artifacts fail to be culled, but more reasonable than for other clustering methods? Or perhaps I am missing an important point and these approaches should not even be considered?

I tend to think in general terms these approaches could yield better estimates (albeit with likely wide standard errors), but given the inherent challenge of calling rare taxa perhaps it is better to just assume the number of observed RSV represents the lower bound of taxa for a given sequencing depth and go with that as the most reasonable estimate?

If it helps frame the conversation our group is usually is working with human and murine stool and our primarily interested in estimating richness is to compare it across experimental conditions.

Thanks in advance and for all the work on the package, tutorials, and informative forum.

@benjjneb
Copy link
Owner

Hi Nick,

Short answer: Yes it is defensible to use richness estimators, on a sample-wise basis, if you have pooled your samples using dada2.

Long answer: I still wouldn't do it. The issue I have with richness estimation in this context is its dependence on the rarest variants, and the lack of any accounting for the misclassification uncertainty in those rare variants. Error bars that don't include the biggest source of error are worse than no error bars!

What a richness estimator is doing is attempting to estimate the number of variants that aren't observed, and essentially all unobserved variants will be rare. Thus, the information about the unobserved class comes almost entirely from the observed singletons and doubletons -- the observed rare variants -- from which we extrapolate out. However, it is exactly those variants for which misclassification error is so problematic in this context. Even if just 1 of every 2000 reads is misclassified as a spurious OTU, that is 50 spurious singletons in a 100k read sample, which will typically overwhelm the legitimate singletons and drive a major overestimation by, say, Chao1. DADA2 helps with this, but nothing is perfect!

This concern is not new, and efforts such as Breakaway were developed in part to reduce the reliance of richness estimators on the low frequency classes, but they come with their own issues, as higher frequency classes are simply much less informative about the rare unobserved variants and estimates become more sensitive to the model assumptions. Chao (of Chao1 fame) also has a recent paper on new richness estimators that don't use singletons, but that still rely on perfect classification of doubletons.

Finally, and this is straying into unnecessary purity territory, the other thing that bothers me is that the very procedure of methods like DADA2 or UPARSE is not the "right" one for performing richness estimations.

A richness estimator depends on the total number of singletons, but that is not what DADA2 attempts to estimate. Rather, DADA2 is attempting to call all rare variants that it thinks are true individually. A procedure to estimate the total number of singletons should not enforce individual validity -- if there are 100 candidate singletons, each of which has a 70% chance of being real, than the estimate of total singletons is 70, but an individual variant method may output none of them as each is uncertain alone!

@Nick243
Copy link
Author

Nick243 commented Aug 10, 2016

Hi Ben,

Thanks so much for the "short" and the "long" answer. I think I now understand your concern regarding how the difficulty in calling true low frequency classes drives overestimation of richness when using these estimators (in this context).

Your last comment that DADA2 or UPARSE are not exactly the "right" approaches to obtain the total number of singletons in extremely helpful and what, in part, prompted my question. Thus, I appreciate your straying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants