Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxonomy changed when input contigs are different #33

Closed
xwu35 opened this issue Sep 7, 2023 · 10 comments
Closed

Taxonomy changed when input contigs are different #33

xwu35 opened this issue Sep 7, 2023 · 10 comments

Comments

@xwu35
Copy link

xwu35 commented Sep 7, 2023

Hi,

I used geNomad to identify viruses in addition to VirSorter2 and Cenote-taker2. Subsequently, I used geNomad's annotate module to assign taxonomy to all viral contigs, including those identified by other tools. I noticed that three of them received completely different taxonomic assignment when the input contigs were altered.

For instance, one of them was initially categprozed as Algavirales (Varidnaviria › Bamfordvirae › Nucleocytoviricota › Megaviricetes ) when the input contigs consisted of all assembled contigs (approximately 70,000), but it was then classified as Caudoviricetes (Duplodnaviria › Heunggongvirae › Uroviricota) when only viral contigs were used as the input (around 5000). Is this expected? I would think that taxonomy annotation should be more stable across different input contigs.

Many thanks.

@apcamargo
Copy link
Owner

This shouldn't happen. The only thing that would lead to a difference in the taxonomy is a variance in the annotation process, but changing the input should not alter the annotation of a given sequence.

Can you provide the rows of both _genes.tsv files that contain the genes of this example you mentioned?

@xwu35
Copy link
Author

xwu35 commented Sep 7, 2023

Thanks for the quick response. Please see below:

all assembled contigs as input:
contig_11421_1 2 1918 1917 1 0.529 11 None GENOMAD.002410.VV 8.55e-08 58 0 0 1 11844 Phycodnaviridae NA NA PF16903 Major capsid protein N-terminus
contig_38880_1 3 242 240 1 0.483 11 None NA NA NA 0 0 0 1 NA NA NA NA NA
contig_38880_2 302 967 666 1 0.434 11 None GENOMAD.171343.VP 3.504e-16 80 0 0 0 12031 Mimiviridae NA NA PF18406 Ferredoxin-like domain in Api92-like protein
contig_38880_3 1043 1222 180 1 0.383 11 AGGA NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_1 1 126 126 -1 0.548 11 GGAG/GAGG NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_2 153 3035 2883 -1 0.537 11 None GENOMAD.068590.VV 0.0001592 48 0 0 1 2561 Caudoviricetes NA NA PF13871;PF11242 C-terminal domain on Strawberry notch homologue

only viral contigs as input:
contig_11421_1 2 1918 1917 1 0.529 11 None GENOMAD.088220.VV 7.588e-11 68 0 0 0 2561 Caudoviricetes NA NA NA NA
contig_38880_1 3 242 240 1 0.483 11 None NA NA NA 0 0 0 1 NA NA NA NA NA
contig_38880_2 302 967 666 1 0.434 11 None GENOMAD.055798.VV 8.984e-50 178 0 0 0 2561 Caudoviricetes NA NA PF19174 NA
contig_38880_3 1043 1222 180 1 0.383 11 AGGA NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_1 1 126 126 -1 0.548 11 GGAG/GAGG NA NA NA 0 0 0 1 NA NA NA NA NA
contig_802_2 153 3035 2883 -1 0.537 11 None GENOMAD.056032.VV 1.96e-05 51 0 0 0 11844 Phycodnaviridae NA NA PF00176 ERCC4-related helicase

@apcamargo
Copy link
Owner

Did you use the same geNomad version for both runs? Which version was it? What version of the database did you use?

Not related to this, are these all the genes in your contigs? Be extra careful when working with such short sequences.

@xwu35
Copy link
Author

xwu35 commented Sep 7, 2023

Yes, I used the same geNomad version 1.5.2 and database version 1.2.

We had to use contigs >= 1kb since the majority of the assembled contigs were short than 3 kb.

@apcamargo
Copy link
Owner

apcamargo commented Sep 7, 2023

Indeed, it seems that the size of the input does change the annotation in my testing. I didn't expect that since the E-value depends on the database size, not query size.

My guess is that this is a MMseqs2 thing. @milot-mirdita, do you know what could be causing this? geNomad basically uses the proteins encoded by the input sequences as query and searches a profile database. I then take the best hit of each query protein. You can find the commands here.

Maybe this is because I'm using --max-rejected? If the order of the alignments of a given query is different depending on the number of queries, --max-rejected would stop the alignment computation at different points.

@apcamargo
Copy link
Owner

@xwu35 I did some more investigation and I found the cause of this issue.

Short story: the annotation should be very slightly more reliable when your input has less sequences (in your case, the input with only viral sequences). In any case, I wouldn't really recommend using the taxonomy of very short sequences.

Long story: It seems that the order of the sequences in the input matter when using --max-rejected. I don't really understand why, because I'd assume that the counter resets for each query. The problem is that if I stop using --max-rejected the runtime will increase significantly, so I'll look into other options to solve this.

@xwu35
Copy link
Author

xwu35 commented Sep 8, 2023

Thanks for looking into it.

Will this issue make the virus identification step unreliable since part of the identification method is to find aligned viral marker genes?

@apcamargo
Copy link
Owner

The effects will be minimal, you shouldn't worry. This issue affects very few proteins within a sample and only alignments with high E-value are susceptible to variance.

@xwu35
Copy link
Author

xwu35 commented Sep 8, 2023

Thanks, it is good to know.

@apcamargo
Copy link
Owner

I just pushed a change to the way the MMseqs2 searches are performed that will mitigate this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants