-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mock2 expected abundance replication issue #85
Comments
Expected abundances in a mock community are never replicated 100%, because there are so many factors that impact the relative abundance (human imprecision, copy # variation, PCR/sequence error/bias) that skew these results and to a large degree cannot be corrected for bioinformatically. I have a more in-depth discussion on this on the QIIME2 forum. Your results actually look pretty good, and I don't think you are doing anything "wrong". Particularly if you consider that taxa like [Eubacterium] and Eubacterium might be the same thing. Some things that could improve accuracy:
Have you calculated precision/recall or some other accuracy metric on these data? You can follow the notebooks in tax-credit to run these same evaluations, or we have a method for calculating some of these metrics in QIIME2. I am going to close this issue, since there is not really an error here, but please let me know if you have any more questions or comments! |
Thank you for your quick response. https://github.com/caporaso-lab/mockrobiota/tree/master/data/mock-2 In this repository the following files are provided:
According to the README file this mock community is referred as B2 dataset in Bokulich et al. 2015. |
Either I am still misunderstanding, or there is no misunderstanding. By "replicate", you mean that you are attempting to analyze the mock-2 data and detect the expected genera at the expected abundances. Correct? I think I have identified the source of confusion. It looks like you are interpreting the Does that make sense? So I think you are actually doing everything correctly and your "found" abundances actually look quite good. For example, by looking at either the 2015 or 2017 preprints, you will see that none of the methods actually have 100% accuracy at species level for mock communities — this is because the taxon abundances are always skewed during sample handling/PCR/sequencing, and no bioinformatic analysis can really correct this perfectly. Genus-level classification are actually quite a lot better but still not perfect for this same reason. |
You're absolutely correct on what I was aiming for and on my misinterpretation of the "greengenes_13_8_99 expected taxonomy abundance", and yes your explanation makes absolutely sense. Thank you very much for your time and explanation. |
Hello,
I'm trying to replicate the Mock2 expected abundance results at 99% confidence, and more specifically only the expected abundance at the genus level.
From the provided data I have used only the "forward read" which I have trimmed at the position 135 towards the 3'. I didn't use the "reverse read" because the quality didn't seem that good.
I have done various trials using Qiime 1.9.1 and the parameters that brought me closer to the expected results were the following:
Genera identified:
In total 28 genera were identified.
5/28 genera were not in the expected results (denoted as misclassified).
23/24 of the expected genera were identified.
Genera abundances:
In most of the cases the abundances were not matching the expected ones. Please see the results at the bottom.
Bokulich et al. 2015
According to the paper sortmerna should give better results regarding precision, recall and F measure. I used the parameters mentioned in the methods for sortmerna (0.51:0.8:1:0.8:1) and actually I got worst results (8/29 genera misclassified, 21/24 of the expected genera were identified).
Could please tell me what I'm doing wrong?
Results using RDP classifier:
The text was updated successfully, but these errors were encountered: