Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
In dada2 default data (i.e. pooled = FALSE) richness correlates with total amplicons but rarefaction curves plateau early because the data usually contains almost no singletons. How to correctly correct for this effect? #317
We would like to discuss how best to correct richness/alpha diversity for differences in sequencing depth (total amplicons) when using the default dada2 pipeline, i.e. when using dada(, pooled = FALSE) (https://benjjneb.github.io/dada2/pool.html).
We generally observe in our dada2 generated count tables (seqtab) a clear trend of increasing richness with increasing number of total amplicons per sample. This is not surprising or unexpected in itself. However, the dada2-generated count tables usually contain virtually no singletons. In consequence rarefaction curves plateau at a low sequencing depth that is usually well below the lowest number of total amplicons of any sample in the dataset. This would indicate that there should be no correlation between richness and total amplicons within the range of sequencing depths we are working with. Yet the trend is clearly present. In other words, there is a correlation between richness and total amplicons, but rarefaction on total amplicons does not correct for it.
To illustrate, here are plots from a study that compared the gut microbiome of two groups of mice. The groups are not necessary to illustrate the general point we would like to discuss, but they hopefully help illustrating differences between the discussed methods.
With dada(, pooled = FALSE) we have a clear trend of increasing richness with increasing number of total amplicons. Group 2 (grp2) samples have higher richness, but have also been sequenced deeper.
Rarefaction curves plateau very early indicating little or no increase in richness above 10,000 total amplicons (NB: the sample with lowest sequencing depth had 27648 total amplicons).
As a comparison, the same samples were run with dada(, pooled = TRUE). Here are singletons in the count table and rarefaction curves do not saturate quickly.
Since rarefaction on total amplicons does not seem to correct for the possible confounding of sequencing depth when working with dada(,pooled = FASLE) data, we would like to discuss options to do so.
To exemplify the second suggestion, rarefaction on filtered reads and then running dada(, pooled = FALSE) resulted in rarefaction curves that saturated much slower than those from running dada(, pooled = FALSE) on all filtered reads and subsequent rarefaction on total amplicons.
Thanks in advance for any thoughts and/or explanations.
Short answer on how to calculate richness: Don't.
Long answer (with why): The problem of calculating richness comes down to the problem of estimating the number of types (ASVs here) that were observed zero times in the data. For some background, see the two lectures by Amy Willis on alpha-diversity at STAMPS this year: https://stamps.mbl.edu/index.php/Schedule
The information on the zero-observation class comes almost entirely from the number of things you saw 1 time, and 2 times. That is because you are trying to estimate how many rare things are around that you didn't see, and the rare things you did see (that inform you about the unseen others) will be seen 1 or 2 times. Different methods (e.g. rarefaction or Chao's S1) all have that basic dependence.
The problem is that the various statistical techniques that have been developed don't account the most important error mode in next-gen amplicon data: some sequences are being misclassified as new types that simply don't exist in reality.
That is, some errors/chimeras/artefacts/contaminants get interpreted as real variants (because they are different enough) and those show up largely in the singleton-doubleton class and can almost entirely drive richness estimates to nonsensical values. The literature is replete with massive order-of-magnitude richness overestimates due to this.
DADA2 confronts the difficult problem of calling very low-frequency things by not calling singletons, because its too hard to get right and the FPs outweight the FNs (some other methods, e.g. UPARSE, do the same). However, that means all those old richness estimate methods break explicitly, rather than just failing silently (i.e. by giving terribly wrong values).
I haven't seen a method yet that I believe is accurate enough at calling singletons/doubletons in deep NGS amplicon data that would make richness estimates worth doing. Pooled DADA2 is maybe as close as you'll get, but I still wouldn't do it.
I agree with Ben. I also want to add that I don't see the "problem" in the data motivating the post. That apparent correlation is weak and driven by an outlier. Generally it doesn't look like much, and your rarefaction curves are pretty convincingly plateaued, as you said. It looks to me like there are differences in richness between the two groups, and you're picking that up reliably with dada2 output.
Also, that approximate number of ASVs is consistent with other data from mice that I've seen.
As a side note about terminology, "amplicons" refers to the PCR amplified molecules. I'm pretty sure what you meant throughout OP was "reads", not amplicons, since we don't directly observe or count amplicons, but rather their sequences emitted from the DNA sequencer.
I know it is selfish to ask, but it would be awesome to have a more detailed comparison of this topic in a blogpost or similar. There are some explanations here and there.
Noah Fierer had a blog post comparing some results:
A preprint from Amy Willis, following the presentations from STAMPs is quite clear about new methods for measuring alpha diversity:
But i failed to find a more detailed analysis about the topic: comparison of results with asv vs otu, the concept of richness in the context of sequence variant//OTU at a defined cluster threshold//richness as species;... Maybe you know of some other comparisons out there. Perhaps @adw96 would like to help with it :-)
Thanks for looping me in, @benjjneb and @adriaaula! I'm currently in the process of rewriting my alpha diversity analysis package "breakaway" to better sync with phyloseq, and when that is up and running I will write a blog post or tutorial about microbial diversity estimation and when it is and isn't possible. I'll do my best to get this to you soon and try to remember to link it here.
There are much, much better ways to estimate total microbial richness than with sample richness, which is what you used in your analysis, @adriaaula, but without looking more closely at the structure of your data it's hard to say this is a first-order or second-order issue.
Why would it be a second order issue?
In short, we can talk about estimating total (observed + unobserved) richness at the genus level (and any more coarse grouping), and perhaps even species level in some environments. However, in some environments when we start talking about OTUs and ASV, there is just so much diversity that we haven't seen that it is generally going to be impossible to reliably estimate total richness.
To give a reasonable analogy for why estimating ASV richness is a difficult statistical task, think about how valid it would be to ask your 10 best friends for their rating of the latest Star Wars movie, taking the average, and using that as an estimate for your country's overall rating of the movie. You have sampled a tiny fraction of the population of interest, plus, all of your friends ratings are probably going to be correlated -- just like a microbial network. The combination of tiny sample fraction and highly networked communities means you can draw only little inference from a sample. Furthermore, estimating averages is a much easier problem then estimating richness (in terms of information accumulation). This analogy is a little fuzzy, but I hope the intuition is helpful.
@adriaaula I'll (try to remember to) let you know when I do that blog post.
Amazing answer, thank you for taking the effort @adw96 !
I will try to be aware of your publications in the future just to have a clear idea of what is the best implementation.
Again, thank you for your work :-)
@benjjneb I agree that calculating richness might be futile for sequence analysis. However, what would you suggest for other alpha diversity indexes such as Shannon or Simpson Index, which explains something about the structure of the microbial community -evenness, for instance-? What would you suggest for those indexes that are also affected by richness (i.e., number of observed ASVs)?
How could you correct for the bias on number of reads? It seems that this approach suggested by @TBrach would be the most accurate?
"Do rarefaction already on the filtered reads and then run dada( ,pooled = FALSE) (so one extra run for richness/alpha diversity analyses only)"
Meausures that are primarily driven by "evenness" (i.e. the relative difference of taxa at appreciable freuqencies in the community) are robust to noise in the rare variants, and thus to library size differences (as long as you exclude the very low depth samples).
These indices are excuisitely sensitive to error rates, bioinformatic specifics, and library size. My general suggestion would be to not use them, if you must than observed ASVs at a constant library size w/in a study is probably the most justifiable.
As far as these things go, that is probably about as good as it gets.