Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DADA2 produced an abnormally high number of uniques #1194

Closed
Listen-Lii opened this issue Nov 9, 2020 · 4 comments
Closed

DADA2 produced an abnormally high number of uniques #1194

Listen-Lii opened this issue Nov 9, 2020 · 4 comments

Comments

@Listen-Lii
Copy link

Hi,
I found DADA2 would produce an abnormally high number of Q1 (frequency of uniques, i.e. species that occur in only one sample) when pooling more than ~40 replicates.

Here are the details:
I run DADA2, UPARSE and Deblur simultaneous on some soil samples from a typical grassland. I estimated the theorical richness in a local community based on these samples by abundance-based estimators, including Chao1, ACE and abundance-Jack1, and incidence-based estimators, including Chao2, ICE and incidence-Jack1. But the results were very strange for DADA2. As shown in Fig.1, results of UPARSE and Deblur were quite normal, these estimators would gradually flatten out (Fig.1a, b). But incidence-based estimators would increase abnormally as the sample size increased by DADA2, especially for Chao2 (Fig.1c).
For DADA2 pipeline, we used standard filtering parameters in filterAndTrim function: maxN = 0, truncQ = 2 and maxEE = 2. Considering our big data set, “pseudo” option was used in dada function to perform sample inference. Others steps were performed under default parameters. The workflow recommended by the DADA2 pipeline tutorial (1.8) was utilized to generate an ASV table.

5

Fig.1 Estimated richness. (a) UPARSE; (b) Deblur; (c) DADA2

Let me use Chao2 as a typical example to illustrate this result. According to the classical define of Chao2: S_Chao2=S_obs+((m-1)/m)〖Q1〗^2/2Q2, where Sobs is observed richness, m is sample number, Q1 is the frequency of uniques (species that occur in a only one sample) and Q2 is the frequency of duplicates (species that occur in a only two samples). This anomaly was mainly caused by abnormally increasing Q1 with more sample size by DADA2. Meanwhile, ICE and incidence-Jack1 also include Q1 in their formular, since they don’t need Q1 squared, they were less affected, while it was still visible. This conflict suggested that DADA2 may produce more Q1 during dada function.
Then I explored the changes of Q1, Q2 and Q12/Q2 with sample size (Fig.2). Q1 would increase with more than ~40 samples, and this resulted in the anomaly of estimated richness.

6

Fig.2 Changes of (a) Q1, (b) Q2 and (c) Q12/Q2 with sample size.

To further proved this, I then chose one sample from our data set and regarded it as “species pool”. I splited this sample to 200 sub-samples by randomly selected 10,000 reads from this samples 200 times. Then these 200 sub-samples were treated by DADA2 with the same parameters, except for option in dada functions (three options: “pool”, “pseudo” and “unpool”, respectively). Results of Q1 from “pseudo” option were consistent with our big data, that is, with more than ~40 samples, Q1 would increase abnormally. Additionally, “unpool” option would also result in the increase of Q1 from the beginning. “pool” option seemed to be normal.

7

Fig.3 Q1 of 200 sub-samples (a) “pool” options; (b) “pseudo” options; (c) “unpool” option.

UPARSE and Deblur were implemented with all samples pooled together. Pooling seems to be the right choice given biological context, and un-pooling might bring some problems, e.g. an abnormal increase in Q1. I have validated our results over and over again. I think you can take one sample as a local species pool and split it to more than 50 sub-samples to test this.
Thank you for your time and I would be very grateful if you could help.

@Listen-Lii
Copy link
Author

Have these figures been uploaded successfully? I can only see the figures' links, but I can't open them.

@benjjneb
Copy link
Owner

Yes figures uploaded successfully!

To be honest, I am having some trouble interpreting the multipanel figures you have uploaded however. In particular what treatment corresponds to each panel on these figures.

I would say that the choice of pooling or non-pooling is probably much more important than the choice of ASV algorithm, at least when it comes to estimates that are sensitive to fiddly bits about rare members of the community/data.

@Listen-Lii
Copy link
Author

Hi,
Thank you for your reply! Let me introduce our experimental design briefly.

We collected 141 replicates in a 1 m2 soil and wanted to get an accurate richness estimation by pooling these replicates. Theoretically, with the increase of sample size (and sequencing depth), the richness will increase at a smaller and smaller rate, and finally approach the theoretical value. Especially for our small local community, the convergence should be clear and obvious.

Results showed that estimated richness did tend to approach a theoretical value by UPARSE (Fig. 1a) and Deblur (Fig. 1b). However, an exception appeared in DADA2 method in incidence-based estimators (Chao2, ICE and Incidence-Jack1). I found the increase of Q1 was the main reason (Figure 2). To further proved this, I chose one replicate as “species pool”, and further splited it to 200 sub-samples by resampling process. Theoretically, richness will tend to a fixed value (i.e. the richness of the selected replicate) by pooling these sub-samples. However, this was only observed by “pool” option (Fig. 3a). A much higher richness was observed by “pseudo” and “unpool” options owing to abnormally high number of Q1 (Fig. 3b, 3c).

In fact, I have validated some other samples in different experiments and habitats, all observed this kind of anomaly. Based on these results, I think more attention should be paid to this problem. Richness is the basis of microbial community study; this overestimation of richness is not just a special case of our individual study. Although the number of replicates in the current study is usually less than 40, the requirement for replicate number in the future will definitely be higher and this kind of anomaly may be found by more and more studies.

My concern is, in the context of the “microbial community”, pooling sequences may be the only scientific method to generate ASVs. You acknowledged in your reply that the choice of pooling or non-pooling is probably much more important than the choice of ASV algorithm. However, the default choice is “unpool” in dada function, which will get a higher richness, and their biological significance may be questionable.

Overall, I hope you can run more tests to ensure the accuracy of the DADA2 algorithm. Please forgive me if there is any unintentional offense. I would be very grateful if you could continue our discussion.

@Mathildebd
Copy link

Mathildebd commented Dec 16, 2020

Hi,
I am a bit new to this forum, but I am wondering if this issue was solved?
Since I am facing a similar issue with my 16S amplicon dataset generated through the Dada2 pipeline (non-pooling), I am obtaining a tremendous amount of 'unique' ASVs in my samples - all 93 samples are from a similar environments, forest soil samples, north Germany.

I find in total 16,978 ASVs, but 13,444 only occurs in ONE sample.
The number is 14,981 if I excluded ASVs only present in two samples - leaving 1,997 ASVs for the further analysis.
I find this number quite extreme - and it makes me uneasy about the validity of the ASV table...

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants