Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In dada2 default data (i.e. pooled = FALSE) richness correlates with total amplicons but rarefaction curves plateau early because the data usually contains almost no singletons. How to correctly correct for this effect? #317

Closed
TBrach opened this issue Aug 21, 2017 · 7 comments

Comments

@TBrach
Copy link

@TBrach TBrach commented Aug 21, 2017

We would like to discuss how best to correct richness/alpha diversity for differences in sequencing depth (total amplicons) when using the default dada2 pipeline, i.e. when using dada(, pooled = FALSE) (https://benjjneb.github.io/dada2/pool.html).

We generally observe in our dada2 generated count tables (seqtab) a clear trend of increasing richness with increasing number of total amplicons per sample. This is not surprising or unexpected in itself. However, the dada2-generated count tables usually contain virtually no singletons. In consequence rarefaction curves plateau at a low sequencing depth that is usually well below the lowest number of total amplicons of any sample in the dataset. This would indicate that there should be no correlation between richness and total amplicons within the range of sequencing depths we are working with. Yet the trend is clearly present. In other words, there is a correlation between richness and total amplicons, but rarefaction on total amplicons does not correct for it.

To illustrate, here are plots from a study that compared the gut microbiome of two groups of mice. The groups are not necessary to illustrate the general point we would like to discuss, but they hopefully help illustrating differences between the discussed methods.

With dada(, pooled = FALSE) we have a clear trend of increasing richness with increasing number of total amplicons. Group 2 (grp2) samples have higher richness, but have also been sequenced deeper.

alpha_div_no_pool

Rarefaction curves plateau very early indicating little or no increase in richness above 10,000 total amplicons (NB: the sample with lowest sequencing depth had 27648 total amplicons).

rarecurves_no_pool

As a comparison, the same samples were run with dada(, pooled = TRUE). Here are singletons in the count table and rarefaction curves do not saturate quickly.

alpha_div_pool

rarecurves_pool

Since rarefaction on total amplicons does not seem to correct for the possible confounding of sequencing depth when working with dada(,pooled = FASLE) data, we would like to discuss options to do so.
Three suggestions we have come up with:

  • running dada( pooled = TRUE) and check if richness trends (e.g. comparison between groups) hold also for pooled data.
    • This is computationally challenging and not really feasible for larger datasets.
  • Do rarefaction already on the filtered reads and then run dada( ,pooled = FALSE) (so one extra run for richness/alpha diversity analyses only)
  • run dada(, pooled = FALSE) and test for richness trends with correction for total amplicons, e.g. lm(richness ~ group + total_amplicons)

To exemplify the second suggestion, rarefaction on filtered reads and then running dada(, pooled = FALSE) resulted in rarefaction curves that saturated much slower than those from running dada(, pooled = FALSE) on all filtered reads and subsequent rarefaction on total amplicons.

rare_filt_reads_vs_rare_total_amplicons

Thanks in advance for any thoughts and/or explanations.

@benjjneb

This comment has been minimized.

Copy link
Owner

@benjjneb benjjneb commented Aug 22, 2017

Short answer on how to calculate richness: Don't.

Long answer (with why): The problem of calculating richness comes down to the problem of estimating the number of types (ASVs here) that were observed zero times in the data. For some background, see the two lectures by Amy Willis on alpha-diversity at STAMPS this year: https://stamps.mbl.edu/index.php/Schedule

The information on the zero-observation class comes almost entirely from the number of things you saw 1 time, and 2 times. That is because you are trying to estimate how many rare things are around that you didn't see, and the rare things you did see (that inform you about the unseen others) will be seen 1 or 2 times. Different methods (e.g. rarefaction or Chao's S1) all have that basic dependence.

The problem is that the various statistical techniques that have been developed don't account the most important error mode in next-gen amplicon data: some sequences are being misclassified as new types that simply don't exist in reality.

That is, some errors/chimeras/artefacts/contaminants get interpreted as real variants (because they are different enough) and those show up largely in the singleton-doubleton class and can almost entirely drive richness estimates to nonsensical values. The literature is replete with massive order-of-magnitude richness overestimates due to this.

DADA2 confronts the difficult problem of calling very low-frequency things by not calling singletons, because its too hard to get right and the FPs outweight the FNs (some other methods, e.g. UPARSE, do the same). However, that means all those old richness estimate methods break explicitly, rather than just failing silently (i.e. by giving terribly wrong values).

I haven't seen a method yet that I believe is accurate enough at calling singletons/doubletons in deep NGS amplicon data that would make richness estimates worth doing. Pooled DADA2 is maybe as close as you'll get, but I still wouldn't do it.

@joey711

This comment has been minimized.

Copy link
Collaborator

@joey711 joey711 commented Aug 25, 2017

I agree with Ben. I also want to add that I don't see the "problem" in the data motivating the post. That apparent correlation is weak and driven by an outlier. Generally it doesn't look like much, and your rarefaction curves are pretty convincingly plateaued, as you said. It looks to me like there are differences in richness between the two groups, and you're picking that up reliably with dada2 output.

Also, that approximate number of ASVs is consistent with other data from mice that I've seen.

As a side note about terminology, "amplicons" refers to the PCR amplified molecules. I'm pretty sure what you meant throughout OP was "reads", not amplicons, since we don't directly observe or count amplicons, but rather their sequences emitted from the DNA sequencer.

@benjjneb benjjneb closed this Sep 20, 2017
@adriaaula

This comment has been minimized.

Copy link

@adriaaula adriaaula commented Jan 3, 2018

Hello there,

Thank you for both explanations here @joey711 and @benjjneb , quite useful! I love the feeling of having a doubt using the package and knowing that there will be a good explanation in the issue page.

I know it is selfish to ask, but it would be awesome to have a more detailed comparison of this topic in a blogpost or similar. There are some explanations here and there.

Noah Fierer had a blog post comparing some results:
http://fiererlab.org/2017/05/02/lumping-versus-splitting-is-it-time-for-microbial-ecologists-to-abandon-otus/

A preprint from Amy Willis, following the presentations from STAMPs is quite clear about new methods for measuring alpha diversity:
https://www.biorxiv.org/content/early/2017/12/11/231878

But i failed to find a more detailed analysis about the topic: comparison of results with asv vs otu, the concept of richness in the context of sequence variant//OTU at a defined cluster threshold//richness as species;... Maybe you know of some other comparisons out there. Perhaps @adw96 would like to help with it :-)

@adw96

This comment has been minimized.

Copy link

@adw96 adw96 commented Jan 8, 2018

Thanks for looping me in, @benjjneb and @adriaaula! I'm currently in the process of rewriting my alpha diversity analysis package "breakaway" to better sync with phyloseq, and when that is up and running I will write a blog post or tutorial about microbial diversity estimation and when it is and isn't possible. I'll do my best to get this to you soon and try to remember to link it here.

There are much, much better ways to estimate total microbial richness than with sample richness, which is what you used in your analysis, @adriaaula, but without looking more closely at the structure of your data it's hard to say this is a first-order or second-order issue.

Why would it be a second order issue?

In short, we can talk about estimating total (observed + unobserved) richness at the genus level (and any more coarse grouping), and perhaps even species level in some environments. However, in some environments when we start talking about OTUs and ASV, there is just so much diversity that we haven't seen that it is generally going to be impossible to reliably estimate total richness.

To give a reasonable analogy for why estimating ASV richness is a difficult statistical task, think about how valid it would be to ask your 10 best friends for their rating of the latest Star Wars movie, taking the average, and using that as an estimate for your country's overall rating of the movie. You have sampled a tiny fraction of the population of interest, plus, all of your friends ratings are probably going to be correlated -- just like a microbial network. The combination of tiny sample fraction and highly networked communities means you can draw only little inference from a sample. Furthermore, estimating averages is a much easier problem then estimating richness (in terms of information accumulation). This analogy is a little fuzzy, but I hope the intuition is helpful.

@adriaaula I'll (try to remember to) let you know when I do that blog post.

Amy

@adriaaula

This comment has been minimized.

Copy link

@adriaaula adriaaula commented Jan 9, 2018

Amazing answer, thank you for taking the effort @adw96 !
In fact, I just jumped in an Issue that was related with the topic. Maybe I should have opened another issue for that.

I will try to be aware of your publications in the future just to have a clear idea of what is the best implementation.

Again, thank you for your work :-)

@jlrolando

This comment has been minimized.

Copy link

@jlrolando jlrolando commented Apr 15, 2019

@benjjneb I agree that calculating richness might be futile for sequence analysis. However, what would you suggest for other alpha diversity indexes such as Shannon or Simpson Index, which explains something about the structure of the microbial community -evenness, for instance-? What would you suggest for those indexes that are also affected by richness (i.e., number of observed ASVs)?

How could you correct for the bias on number of reads? It seems that this approach suggested by @TBrach would be the most accurate?

"Do rarefaction already on the filtered reads and then run dada( ,pooled = FALSE) (so one extra run for richness/alpha diversity analyses only)"

@benjjneb

This comment has been minimized.

Copy link
Owner

@benjjneb benjjneb commented Apr 15, 2019

However, what would you suggest for other alpha diversity indexes such as Shannon or Simpson Index, which explains something about the structure of the microbial community -evenness, for instance-?

Meausures that are primarily driven by "evenness" (i.e. the relative difference of taxa at appreciable freuqencies in the community) are robust to noise in the rare variants, and thus to library size differences (as long as you exclude the very low depth samples).

What would you suggest for those indexes that are also affected by richness (i.e., number of observed ASVs)?

These indices are excuisitely sensitive to error rates, bioinformatic specifics, and library size. My general suggestion would be to not use them, if you must than observed ASVs at a constant library size w/in a study is probably the most justifiable.

How could you correct for the bias on number of reads? It seems that this approach suggested by @TBrach would be the most accurate?

"Do rarefaction already on the filtered reads and then run dada( ,pooled = FALSE) (so one extra run for richness/alpha diversity analyses only)"

As far as these things go, that is probably about as good as it gets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.