Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to understand the results? #6

Closed
alienzj opened this issue Nov 5, 2022 · 5 comments
Closed

How to understand the results? #6

alienzj opened this issue Nov 5, 2022 · 5 comments

Comments

@alienzj
Copy link

alienzj commented Nov 5, 2022

Dear @apcamargo,
thank you for developing so great tool!

I used it to do viral genome taxonomic assignment:

genomad end-to-end --min-score 0.8 --cleanup --splits 16 \
results/09.dereplicate/genomes/virome/representative/vMAGs_hmq.megahit.rep.fa.gz \
genomad_output ~/databases/ecogenomics/geNomad/genomad_db \
>genomad.log 2>&1

Here is the summary of results:

➤ zcat results/09.dereplicate/genomes/virome/representative/vMAGs_hmq.megahit.rep.fa.gz | rg -c "^>"
8439

➤ wc -l genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_plasmid_summary.tsv
483 genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_plasmid_summary.tsv

➤ wc -l genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_virus_summary.tsv
4933 genomad_output/vMAGs_hmq.megahit.rep_summary/vMAGs_hmq.megahit.rep_virus_summary.tsv

Since all vMAGs were identified by Virsorter2 and phamb, and have complete, high or medium quality evaluated by CheckV,
below is what I don't understand currently:

  1. Why genomad can identify plasmids from viral genomes (vMAGs)? There 482 plasmids were found.
  2. The number of input genomes is 8439, why do only 4932 viral genomes have taxonomic assignments?

Thanks a lot!

@apcamargo
Copy link
Owner

apcamargo commented Nov 5, 2022

Hi @alienzj

There are a couple of points here:

  • geNomad's prediction's won't necessarily match VirSorter2's. geNomad's precision is higher than VirSorter2's (and even higher if you run VirSorter2 with all the models), so geNomad's predictions tend to be more conservative (although this can vary from dataset to dataset). I can't really say what is causing the classification conflicts with VirSorter2 without looking at the data.
  • I can't really say why that those contigs are being classified as plasmids without looking at the data. Can you share the _genes.tsv and _summary.tsv files? Keep in mind that VirSorter2 and phamb don't take plasmids into account, so there's a chance they classify plasmids as viruses.
  • There are two reasons that only 4932 out of 8439 sequences got taxonomic classification: (1) some sequences were classified as plasmid, (2) because you used a very conservative cutoff (--min-score 0.8), some sequences won't be classified as viruses or plasmids. The good news is that you can still check the taxonomic assignment of all sequences (regardless of their classification) in the annotation discovery (try to look for vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv).

Aggregating the output of several classification tools is difficult because they will often diverge. geNomad is, in average, more accurate than VirSorter2 (see figure below), but VS2 is an amazing tool and I can't guarantee that geNomad will be correct in every single scenario they diverge. You should gather as much information as possible.

VIRUS_MAIN_BENCHMARK

The good news is that geNomad's output includes some information that makes it easier to understand why a given sequence was classified as a plasmid or virus:

This is an example of a _plasmid_summary.tsv file:

seq_name      length   topology   n_genes   genetic_code   plasmid_score   fdr   n_hallmarks   marker_enrichment   conjugation_genes
-----------   ------   --------   -------   ------------   -------------   ---   -----------   -----------------   -----------------
NC_002128.1   92721    Linear     88        11             0.9942          NA    5             46.4458             T_virB11;MOBP1
NC_002127.1   3306     Linear     3         11             0.9913          NA    1             1.6586              NA

Here you can see that these sequences encode plasmid hallmarks, which is a very good indication that those sequences are indeed plasmids. Try to check if your sequences also encode those. In addition, the marker_enrichment field is a number that increases proportionally to the number of plasmid markers. So, if the marker_enrichment of a given sequence is high (say, higher than 6), it is probably a plasmid, not a virus.

The same is true for the _virus_summary.tsv output. Try to run the classification again with a lower --min-score and see if the sequences look viral from the summary (if you like to do the filtering by yourself, based on your criteria, just leave --min-score 0). You might have some false positives in your dataset.

Again, if you are only interested in the taxonomy, just look at vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv :)

Hope this helps!

@alienzj
Copy link
Author

alienzj commented Nov 6, 2022

Dear @apcamargo, thanks a lot for your quick and detailed reply.
Sure sure, here are the tsv files generated by geNomad using the above command line: genomad_output_tsv.tar.gz

  1. From the figure you provided, it is excellent that geNomad has such an accurate performance.
  2. Yes, there's a chance Virsorter2 and phamb classify plasmids as viruses.
  3. I checked vMAGs_hmq.megahit.rep_annotate/vMAGs_hmq.megahit.rep_taxonomy.tsv, it recorded 8269 taxonomic assignments. It is quite useful. Yes, I shall change the --min-score to see what will happen based on your suggestions.

Thanks a lot again!

@apcamargo
Copy link
Owner

Thanks @alienzj

There are certainly lots of plasmids in your data. You can easily see that in the _plasmid_summary.tsv file:

  • Sequences with very high marker_enrichment, which means that there are multiple plasmid markers in them.
  • Sequences with multiple plasmid hallmark genes (n_hallmarks)
  • Sequences with multiple conjugation genes (conjugation_genes). It is important to note that there are phages capable of conjugation, though.

@alienzj
Copy link
Author

alienzj commented Nov 7, 2022

Hi, @apcamargo,

Thanks a lot for your reply.
Yes, I shall remove those plasmids when doing virome profiling.

It is quite interesting that find so many plasmids from the viral vMAGs identified by VirSorter2 and phamb.

@apcamargo
Copy link
Owner

No problem! Let me know if you have more questions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants