-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taxonomy changed when input contigs are different #33
Comments
This shouldn't happen. The only thing that would lead to a difference in the taxonomy is a variance in the annotation process, but changing the input should not alter the annotation of a given sequence. Can you provide the rows of both |
Thanks for the quick response. Please see below: all assembled contigs as input: only viral contigs as input: |
Did you use the same geNomad version for both runs? Which version was it? What version of the database did you use? Not related to this, are these all the genes in your contigs? Be extra careful when working with such short sequences. |
Yes, I used the same geNomad version 1.5.2 and database version 1.2. We had to use contigs >= 1kb since the majority of the assembled contigs were short than 3 kb. |
Indeed, it seems that the size of the input does change the annotation in my testing. I didn't expect that since the E-value depends on the database size, not query size. My guess is that this is a MMseqs2 thing. @milot-mirdita, do you know what could be causing this? geNomad basically uses the proteins encoded by the input sequences as query and searches a profile database. I then take the best hit of each query protein. You can find the commands here. Maybe this is because I'm using |
@xwu35 I did some more investigation and I found the cause of this issue. Short story: the annotation should be very slightly more reliable when your input has less sequences (in your case, the input with only viral sequences). In any case, I wouldn't really recommend using the taxonomy of very short sequences. Long story: It seems that the order of the sequences in the input matter when using |
Thanks for looking into it. Will this issue make the virus identification step unreliable since part of the identification method is to find aligned viral marker genes? |
The effects will be minimal, you shouldn't worry. This issue affects very few proteins within a sample and only alignments with high E-value are susceptible to variance. |
Thanks, it is good to know. |
I just pushed a change to the way the MMseqs2 searches are performed that will mitigate this issue |
Hi,
I used geNomad to identify viruses in addition to VirSorter2 and Cenote-taker2. Subsequently, I used geNomad's annotate module to assign taxonomy to all viral contigs, including those identified by other tools. I noticed that three of them received completely different taxonomic assignment when the input contigs were altered.
For instance, one of them was initially categprozed as Algavirales (Varidnaviria › Bamfordvirae › Nucleocytoviricota › Megaviricetes ) when the input contigs consisted of all assembled contigs (approximately 70,000), but it was then classified as Caudoviricetes (Duplodnaviria › Heunggongvirae › Uroviricota) when only viral contigs were used as the input (around 5000). Is this expected? I would think that taxonomy annotation should be more stable across different input contigs.
Many thanks.
The text was updated successfully, but these errors were encountered: