You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to understand the execution order of database searches. If I got it correctly it is
diamond on reference proteomes from UniProt, then
blastn on nt, only for the contigs without any hits from the previous step.
However, in your 2017 publication it seems the order is reversed, with blastn run on all sequences and diamond as a second pass for those without any hits.
If my understanding is correct could you explain why running search on reference proteomes first would be advantageous?
The text was updated successfully, but these errors were encountered:
gbdias
changed the title
diamond and blast order
database search order (diamond blast)
Mar 18, 2024
Hi, sorry I missed this. Yes, the search order has changed. blastn searches against nt are for very short contigs, but inefficient for longer sequences. As sequencing/assembly has improved, most assembled sequences have enough information to get good hits from diamond blast searches against reference proteomes, only using blastn for the sequences without diamond blast hits speeds up the process quite considerably
We had a tricky case from a phyla (Nemertea) where, at the time we ran blobtools, there were very few genomes and no reference proteomes available. This resulted in our contigs getting classified as a bunch of equally distant phyla (from Arthropoda to Echinodermata, Chordata, Mollusca, etc).
Today there's still only a handful of Nemertean genomes, but there is one single reference proteome contributed by the NCBI automatic pipeline, so maybe I should try again.
In such cases I guess it could be good to get blastn results on the whole genome first, since the available unannotated Nemertean genomes could be sufficient to correctly classify contigs in the right phyla. But I understand the rationale for the pipeline change. 👍
Hi, thanks for this amazing tool!
I was trying to understand the execution order of database searches. If I got it correctly it is
diamond
on reference proteomes from UniProt, thenblastn
on nt, only for the contigs without any hits from the previous step.However, in your 2017 publication it seems the order is reversed, with blastn run on all sequences and diamond as a second pass for those without any hits.
If my understanding is correct could you explain why running search on reference proteomes first would be advantageous?
The text was updated successfully, but these errors were encountered: