Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempted sourmash gather optimizations #519

Closed
ctb opened this issue Jul 21, 2018 · 3 comments
Closed

Attempted sourmash gather optimizations #519

ctb opened this issue Jul 21, 2018 · 3 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jul 21, 2018

ref #518 -

I've tried & failed at a few different optimizations for sourmash gather that I wanted to describe here.

First, I tried doing adaptive downsampling, in which we downsample the query signature to a high scaled value (5e5) for the first searches. Unfortunately, the devil is in the details: when you downsample the query signature, you end up with different enough results that in each iteration you need to re-run the search to get the exact right results, and that second search discards any speedup you get from doing the first search.

Second, I tried various ways of caching search results. In particular, you would think that we would get quite a speedup from tracking internal nodes that have 0 matches to the entire query signature and never looking at them again. Not so, it seems.

@ctb
Copy link
Contributor Author

ctb commented Jul 23, 2018

Third, I tried various approaches to preloading, where you raise the match threshold based on what you've seen before (that's what's in #520). No joy - apparently the search truncation based on requested threshold isn't very effective, maybe because the threshold matches are so low.

(I've been using shakya-unaligned-contigs.sig against the old genbank k=31 tree to benchmark).

@ctb
Copy link
Contributor Author

ctb commented Sep 5, 2018

Another thought - if we did a search for containment above some fixed threshold, we could limit ourselves to the results from that search until we fell below that threshold. This would avoid going back to the full tree in some situations. Have to think about when and where that would fail, though.

@ctb
Copy link
Contributor Author

ctb commented Jan 11, 2019

seems like @luizirber is On It with #615.

@ctb ctb closed this as completed Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant