Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather performance improvements discussion #838

Open
luizirber opened this issue Jan 11, 2020 · 23 comments
Open

Gather performance improvements discussion #838

luizirber opened this issue Jan 11, 2020 · 23 comments
Milestone

Comments

@luizirber
Copy link
Member

@luizirber luizirber commented Jan 11, 2020

Moving the conversation from #835 here.

@luizirber

This comment has been minimized.

Copy link
Member Author

@luizirber luizirber commented Jan 12, 2020

We love it, if you guys can improve gather's performance.

here are some stats... looks like v3 may be slower than v2?

Should we file a task? we are happy to help. perhaps in v4? Is a smaller release planned on top of v3 codebase anytime soon?

thanks

sourmash compute -k 31 --scaled 5000 -o testv2.sig test.01M_R1_L001.fastq.gz
0m17.907s

sourmash gather -k 31 --scaled 5000 -o gatherv2 testv2.sig db/ecolidb.sbt.json
0m1.896s

sourmash v3.0.1
sourmash compute -k 31 --scaled 5000 -o testv3.sig test.01M_R1_L001.fastq.gz
0m42.243s

sourmash gather -k 31 --scaled 5000 -o gatherv3 testv3.sig db/ecolidb.sbt.json
0m4.429s

Originally posted by @satishv in #835 (comment)_

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 13, 2020

Whoa, there's something bad going on all right :)

Used the following code,

#! /usr/bin/env python
import sourmash
import time

print('loaded from:', sourmash)
print('version:', sourmash.VERSION)

query = sourmash.load_one_signature('podar-ref/0.fa.sig', ksize=31)
db = sourmash.lca.lca_utils.load_single_database('xxx.lca.json')
if sourmash.VERSION.startswith('3.'):
    dblist = [db]
elif sourmash.VERSION.startswith('2.3.'):
    dblist = [(db[0], 'xxx', 'LCA'),]

start = time.time()
for i in range(500):
    g = sourmash.search.gather_databases(query, dblist,
                                         threshold_bp=0,
                                         ignore_abundance=False)
    g = list(g)
end = time.time()

print(end - start)

and got:

loaded from: <module 'sourmash' from '/Users/t/dev/sourmash/sourmash/__init__.py'>
version: 3.0.2.dev6+g86e1105
5.426559925079346

——

loaded from: <module 'sourmash' from '/Users/t/dev/sourmash/sourmash/__init__.py'>
version: 2.3.1
0.07218098640441895

(This was with the podar-ref LCA database.)

I didn't think we'd changed that code much between 3.0.1 and 2.3.1 but clearly I am mistaken :)

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 13, 2020

Yes, my recollection is indeed ...vastly wrong. We merged in the new index ABC (#556) after 2.3.1 and then moved immediately to Rust and released 3.x. I suppose there's a lesson in there about our versioning, @luizirber!

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 13, 2020

commit ec8a00c, with index ABC:

version: 2.3.2.dev1+gec8a00c
5.255207061767578

previous commit 03e5269, no index ABC:

version: 2.3.1
0.11417484283447266

ok, so it was #556 that did the dirty deed.

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 13, 2020

(nothing to do with rust!)

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 13, 2020

OK, using py-spy and digging around a bit, I think the culprit is probably in the new interplay between _find_best in search.py, and Index.gather implementations. _find_best does not pass on any hints about search thresholds, and the Index.gather implementation wouldn't take them into account anyway. 😿

@luizirber

This comment has been minimized.

Copy link
Member Author

@luizirber luizirber commented Jan 13, 2020

Yes, my recollection is indeed ...vastly wrong. We merged in the new index ABC (#556) after 2.3.1 and then moved immediately to Rust and released 3.x. I suppose there's a lesson in there about our versioning, @luizirber!

I'm a fan of "one release per PR", but we also wanted to do #556 and #424 in 3.0.0, so... yeah. See #655 for more info.

@luizirber

This comment has been minimized.

Copy link
Member Author

@luizirber luizirber commented Jan 13, 2020

OK, using py-spy and digging around a bit, I think the culprit is probably in the new interplay
between _find_best in search.py, and Index.gather implementations.

But both SBTs and LCA indices define their own gather, and don't use Index.gather.

_find_best does not pass on any hints about search thresholds, and the Index.gather implementation wouldn't take them into account anyway. crying_cat_face

It's not ideal, but we can pass threshold as a kwargs, and define a default if it is not available (like search already does)

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 13, 2020

@luizirber

This comment has been minimized.

Copy link
Member Author

@luizirber luizirber commented Jan 13, 2020

I'll dump here some other pointers to gather improvements, and they should probably become other issues eventually (and this one stays as a meta/tracking issue?). I'll break it in two sections: Engineering for making what already exists faster, and Research for deeper changes that involve rethinking data structures and methods.

Engineering

We can continue running py-spy and heaptrack for optimizing specific routines or lowering overall memory consumption.

  • Finish #799 to allow loading Nodegraph from memory (avoid the tempfile dance when loading data from storage).
  • Generate a Nodegraph from the query, and use it to query internal nodes (instead of checking every hash in the query). This branch has this somewhat implemented, but depends on an unreleased khmer branch. Should also be doable in #799
  • #784 is not so good for gather or long-running processes (in these cases you might prefer to keep the full index in memory), but it can lower the memory consumption substantially. There is also #520 which tries to cache internal nodes, but I think it needs to cut a bit deeper to work well.
    • is there any cache that, once an item is evicted, allows calling a method? That would work well with the .unload() from #784, I think.
  • #648 is a disk-space saver (because you don't need to uncompress the SBT indices), but for now it makes things slower

Research

SBT indexing is not very smart: it finds the next available position in the tree, and put the signature there. Ideally we would want signatures to be clustered by some metric (probably similarity, but can also just be common hashes), because the search algorithms (be it depth-first search like in gather, or breadth-first search like in search) strongly benefit from prunning the search as early as possible (and avoid loading data from disk).

So, one direction to solve gather issues: better organization of the SBT nodes. At least two ways to do it:

  • Make the insertion of a dataset in an index smarter. We could look for the best position to put the signature, instead of just putting on the next available place. @phoenixAja showed interest in this approach, because it also makes #710 viable.
    • Benefit: this is also an online/streaming approach
    • Possible problem: SBTs built with the current method are "dense", with the lowest possible depth. To keep it dense with this approach it would also involve rebalancing the tree (a self-balancing binary search tree is ideal), but it is more expensive to do and might not work very well with the current internal implementation of SBTs (which uses enumeration on the nodes instead of pointers, so a rotation involves changing a lot of node positions).
  • Scaffold an index. Given a collection of signatures we can cluster them by similarity and build a tree which is both dense and maximize similarity between nodes under the same parent. This is what HowDeSBT does (with the howdesbt cluster command). There is some support for this in sourmash, hidden in the Rust crate.
    • Drawback: this is an offline approach, because it needs a batch of signatures to work.

Another direction, but depending on the first one: change the search strategy.

  • Use a best-first search to find matches in gather. Eventually the DFS will find the same result, but the best-first can reach it faster (by taking the path with largest similarity first). But this only works if the index is well-behaved...
    • Check best_first for a branch implementing the search strategy. It abuses the results dict a bit, so it breaks some tests with search functions that don't take extra kwargs.
  • Use a simple simulated annealing approach: while descending the tree also check some random signature to see if there is a better score.
@satishv

This comment has been minimized.

Copy link

@satishv satishv commented Jan 14, 2020

Thanks for all of your attention on this task. Currently, gather function does not support muti-threading. has that been considered before? I also emailed you about this.

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 14, 2020

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 14, 2020

Embarrassing realization: my earlier benchmarking of lca gather was off, because the v2.3.1 code modified the incoming query object and removed all the minhashes, so I was essentially comparing gather on an empty signature (in v2.3.1) to gather on a full signature (in v3.0.1). When I run a real comparison, the LCA gather is about 50% faster in v3 than in v2.3.1.

OK, now that I have that figured out, going to take a look at the SBT performance.

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 14, 2020

Well, at least one problem (mentioned above) is that neither gather implementation (LCA or SBT) was using thresholding in search. That's fixed in #843. My benchmarking is highly variable but I'm seeing what looks like consistent improvements in gather performance on SBTs.

@luizirber luizirber added this to the pre-4.0 milestone Jan 14, 2020
@luizirber

This comment has been minimized.

Copy link
Member Author

@luizirber luizirber commented Jan 14, 2020

Thanks for all of your attention on this task. Currently, gather function does not support muti-threading. has that been considered before? I also emailed you about this.

One of the big reasons to move to rust, as I understand it, is better support for multithreading. Before, the C++ layer was a real pain in terms of threading...
The algorithm is not trivial to parallelize, however.

Maybe, maybe... It's not trivial, but I've been thinking about doing something like this, using rayon:
https://github.com/oconnor663/bao/blob/a2fedd649487de63e4d0c0a9e45f5a22d1e46365/src/hash.rs#L224L270

But that's still a bit far in the future, because #532 needs to land (3.1), and then I need to write the Index Python abc/Rust trait bridge (3.3). I want to focus on improving compute (3.2, check #845) before doing the bridge.

@satishv

This comment has been minimized.

Copy link

@satishv satishv commented Jan 15, 2020

For some more context: We are running with a DB file (200 GB zipped) with over 100k genomes. Would sorting these genomes in the DB file help? As a last resort, we may consider chopping the large DB file into multiple DB files or even passing both starting and end index to the DB file would be awesome, if you can support that kind of DB lookups. Happy to expand this further. Not sure i am making self very clear here. cc @metajinomics

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 15, 2020

hi @satishv, one thing you can do today would be to split the large DB into multiple files, as you suggest. If you can do so in such a way that related genomes are kept in the same database subset, that might improve things significantly.

A rough pipeline for doing so might be to:

  • take something like the GTDB taxonomy database that I just posted on my blog
  • run sourmash lca classify on all your genome signatures (should take ~3-6 hours)
  • pick a taxonomic level that groups your genomes into conveniently sized chunks of 20k or less
  • make databases for each of those.

best,
--titus

@satishv

This comment has been minimized.

Copy link

@satishv satishv commented Jan 15, 2020

@ctb - Thanks for the above info. Doing the above will certainly adds to our timelines. We are hoping for small changes to do our side at this point. It will help if you also make changes on your end to speed up things as much as possible.

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 15, 2020

@satishv

This comment has been minimized.

Copy link

@satishv satishv commented Jan 15, 2020

thanks a lot!

we are using sourmash-2.0.1 now. I assume no outputs & format will be changed in the 4.0/or the above fix. Assume we will be able to easily swap 2.0.1 with your above fix, ideally with out lot of effort?

cc @brendanwee

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 15, 2020

@satishv

This comment has been minimized.

Copy link

@satishv satishv commented Jan 17, 2020

thanks. looking forward to next week, hope to incorporate your changes as soon as ready. can not wait.

@ctb

This comment has been minimized.

Copy link
Member

@ctb ctb commented Jan 19, 2020

re improving SBT organization, I wanted to link in #756, which is a very nice discussion of the issue, and also #545.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.