Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mock2 expected abundance replication issue #85

Closed
dkioroglou opened this issue Feb 28, 2018 · 4 comments
Closed

Mock2 expected abundance replication issue #85

dkioroglou opened this issue Feb 28, 2018 · 4 comments

Comments

@dkioroglou
Copy link

Hello,

I'm trying to replicate the Mock2 expected abundance results at 99% confidence, and more specifically only the expected abundance at the genus level.
From the provided data I have used only the "forward read" which I have trimmed at the position 135 towards the 3'. I didn't use the "reverse read" because the quality didn't seem that good.

I have done various trials using Qiime 1.9.1 and the parameters that brought me closer to the expected results were the following:

split_libraries_fastq.py \
-m ${mapping} \
-i ${raw} \
-b ${index} \
-o splitting_tmp \
-p 0.75 \
-q 19 \
-r 3 \
--rev_comp_mapping_barcodes

pick_open_reference_otus.py \
-i ${fasta_file} \
-o ${output_dir}\
-r $HOME/gg_13_8_otus/rep_set_aligned/99_otus.fasta \
-p ${parameters_file}

Parameters_file:
pick_otus:similarity	0.99
assign_taxonomy:assignment_method	rdp  # version 2.2
assign_taxonomy:id_to_taxonomy	$HOME/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt
assign_taxonomy:reference_seqs_fp	$HOME/gg_13_8_otus/rep_set/99_otus.fasta
align_seqs:template_fp	$HOME/gg_13_8_otus/rep_set_aligned/99_otus.fasta

Genera identified:
In total 28 genera were identified.
5/28 genera were not in the expected results (denoted as misclassified).
23/24 of the expected genera were identified.

Genera abundances:
In most of the cases the abundances were not matching the expected ones. Please see the results at the bottom.

Bokulich et al. 2015
According to the paper sortmerna should give better results regarding precision, recall and F measure. I used the parameters mentioned in the methods for sortmerna (0.51:0.8:1:0.8:1) and actually I got worst results (8/29 genera misclassified, 21/24 of the expected genera were identified).

Could please tell me what I'm doing wrong?

Results using RDP classifier:

Genera            Found    Expected  Source
Acinetobacter     0.0680   NA        NA
Akkermansia       4.5545   2.1277    2.0833
Alistipes         0.6976   2.1277    2.0833
Anaerococcus      2.5559   2.1277    2.0833
Anaerotruncus     0.4444   2.1277    2.0833
Bacteroides       29.8751  25.5319   25.0000
Bifidobacterium   5.3094   4.2553    4.1667
Blautia           2.0612   4.2553    4.1667
Clostridium       4.5551   8.5106    14.5833
Collinsella       0.1558   2.1277    2.0833
Coprococcus       NA       2.1277    2.0833
Dorea             8.4409   8.5106    4.1667
Edwardsiella      2.0291   2.1277    2.0833
Enterobacter      0.0134   2.1277    2.0833
Escherichia       6.8449   4.2553    4.1667
Eubacterium       NA       NA        6.2500
Faecalibacterium  0.0498   2.1277    2.0833
Lachnospira       2.7171   NA        NA
Lactobacillus     0.3774   NA        NA
Parabacteroides   2.6164   2.1277    2.0833
Proteus           0.0535   2.1277    2.0833
Providencia       0.0123   2.1277    2.0833
Roseburia         0.4342   2.1277    2.0833
Ruminococcus      1.3711   4.2553    6.2500
Shigella          0.0262   NA        NA
Streptococcus     8.9549   2.1277    2.0833
Subdoligranulum   4.8672   2.1277    2.0833
Victivallis       0.0375   NA        NA
[Eubacterium]     0.5718   6.3830    NA
[Ruminococcus]    10.3051  4.2553    NA
@nbokulich
Copy link
Contributor

Expected abundances in a mock community are never replicated 100%, because there are so many factors that impact the relative abundance (human imprecision, copy # variation, PCR/sequence error/bias) that skew these results and to a large degree cannot be corrected for bioinformatically. I have a more in-depth discussion on this on the QIIME2 forum.

Your results actually look pretty good, and I don't think you are doing anything "wrong". Particularly if you consider that taxa like [Eubacterium] and Eubacterium might be the same thing. Some things that could improve accuracy:

  1. Use denoising methods (like dada2 and deblur, which are implemented in QIIME2) instead of OTU picking. These will remove some erroneous reads that lead to false positives.
  2. We have a more recent preprint benchmarking different taxonomy classifiers in QIIME1 and QIIME2. Method recommendations have changed. RDP does quite well, but you can try out others. The data for this study are all in tax-credit if you want to take a closer look.
  3. Mock-2 is not the most accurate mock community that we have (though it does do better than some others). You could try something like mock-12 for a community that tends to get higher accuracy scores. Around half-way down this notebook there are heatmaps for precision/recall/F-measure for each classifier method configuration on each individual mock community, so you can see how each method fares, and how I am judging that mock-12 tends to get higher accuracy scores in general than mock-2.

Have you calculated precision/recall or some other accuracy metric on these data? You can follow the notebooks in tax-credit to run these same evaluations, or we have a method for calculating some of these metrics in QIIME2.

I am going to close this issue, since there is not really an error here, but please let me know if you have any more questions or comments!

@dkioroglou
Copy link
Author

Thank you for your quick response.
I see that I have created confusion with the word "replicate", apologies for that.
The aim was not to replicate in the lab the source composition of the Mock2 community and run a bioinformatic analysis afterwards, but to analyze the publicly available data of the following repository:

https://github.com/caporaso-lab/mockrobiota/tree/master/data/mock-2

In this repository the following files are provided:

  • forward-read
  • reverse-read
  • index-read
  • sample-metadata.tsv
  • source taxonomy abundance
  • greengenes_13_8_99 expected taxonomy abundance

According to the README file this mock community is referred as B2 dataset in Bokulich et al. 2015.
So what I tried to do was to download the data and try to get as closer to the expect abundance as possible. The reason that I think I'm doing something wrong is because the provided expected abundance file shows that all the genera of the source have been identified with quite accurate abundances, but my results are quite off. I have been trying to understand the reason but with no luck so far.

@nbokulich
Copy link
Contributor

Either I am still misunderstanding, or there is no misunderstanding. By "replicate", you mean that you are attempting to analyze the mock-2 data and detect the expected genera at the expected abundances. Correct?

I think I have identified the source of confusion. It looks like you are interpreting the greengenes_13_8_99 expected taxonomy abundance to be the abundances observed after analysis. These are in fact just the source taxonomies reformatted to have labels that are consistent with the Greengenes 13_8 99% OTUs taxonomy, not the product of any type of analysis. That is why the expected abundances are so close to the source — they are the source, except that some source taxonomies may be "collapsed" together into a single expected taxonomy where, e.g., the species labels do not exist in the Greengenes taxonomy.

Does that make sense?

So I think you are actually doing everything correctly and your "found" abundances actually look quite good.

For example, by looking at either the 2015 or 2017 preprints, you will see that none of the methods actually have 100% accuracy at species level for mock communities — this is because the taxon abundances are always skewed during sample handling/PCR/sequencing, and no bioinformatic analysis can really correct this perfectly. Genus-level classification are actually quite a lot better but still not perfect for this same reason.

@dkioroglou
Copy link
Author

You're absolutely correct on what I was aiming for and on my misinterpretation of the "greengenes_13_8_99 expected taxonomy abundance", and yes your explanation makes absolutely sense.
During my trials my conclusions on parameters were in congruence with yours in 2015 preprint. It was just that "greengenes_13_8_99 expected taxonomy abundance" file that was driving me nuts.

Thank you very much for your time and explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants