Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database search order (diamond blast) #210

Open
gbdias opened this issue Mar 18, 2024 · 2 comments
Open

database search order (diamond blast) #210

gbdias opened this issue Mar 18, 2024 · 2 comments

Comments

@gbdias
Copy link

gbdias commented Mar 18, 2024

Hi, thanks for this amazing tool!

I was trying to understand the execution order of database searches. If I got it correctly it is

  1. diamond on reference proteomes from UniProt, then
  2. blastn on nt, only for the contigs without any hits from the previous step.

However, in your 2017 publication it seems the order is reversed, with blastn run on all sequences and diamond as a second pass for those without any hits.

Screenshot 2024-03-18 at 16 48 47

If my understanding is correct could you explain why running search on reference proteomes first would be advantageous?

@gbdias gbdias changed the title diamond and blast order database search order (diamond blast) Mar 18, 2024
@rjchallis
Copy link
Contributor

Hi, sorry I missed this. Yes, the search order has changed. blastn searches against nt are for very short contigs, but inefficient for longer sequences. As sequencing/assembly has improved, most assembled sequences have enough information to get good hits from diamond blast searches against reference proteomes, only using blastn for the sequences without diamond blast hits speeds up the process quite considerably

@gbdias
Copy link
Author

gbdias commented May 30, 2024

Hi @rjchallis thanks for the info!

We had a tricky case from a phyla (Nemertea) where, at the time we ran blobtools, there were very few genomes and no reference proteomes available. This resulted in our contigs getting classified as a bunch of equally distant phyla (from Arthropoda to Echinodermata, Chordata, Mollusca, etc).

Today there's still only a handful of Nemertean genomes, but there is one single reference proteome contributed by the NCBI automatic pipeline, so maybe I should try again.

In such cases I guess it could be good to get blastn results on the whole genome first, since the available unannotated Nemertean genomes could be sufficient to correctly classify contigs in the right phyla. But I understand the rationale for the pipeline change. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants