-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gcs] HashMatchAny: faster filter matches for large queries #122
Conversation
e8ae6e6
to
12890a9
Compare
4babee0
to
938f41c
Compare
938f41c
to
0b667a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This'll significantly speed up larger wallets that are using the neutrino protocol. I left one comment regarding the way it estimates which algorithm to us. As it, it compares the size of the filter, rather than the number of elements in the filter. Not sure if that was intended or not.
// TODO(conner): add real heuristics to query optimization | ||
switch { | ||
|
||
case len(data) >= int(f.N()/2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So N
here is actually the number of bytes, and not necessarily the number of elements as it's about 3 bit per element or so as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out N is the number of elements, so i've left the heuristics in tact. did realize that the bench mark was using 2000 instead of 5000 so i went back and updated the body w/ metrics for 5k filters
0b667a2
to
41876c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎨
This PR proposes a different gcs filter querying mechanism, intended for having better performance as the number of query entries surpasses the number of elements in the filter.
As the number of elements in the query grows, allocating and sorting the elements begins to dominate the runtime. The solution then for large queries is inspired by a hash join, which makes no assumptions on the input ordering of either set. Since the number of filters is ultimately bounded by the block size, the filter entries are chosen as the hash index so that the setup latency is minimized.
Complexity
Number of filter entries: F
Number of query entries: Q
Assumption: Q > F
Setup
Online
Benchmarks
Zip w/ 5K Filter Elements
Zip w/ 10K Filter Elements
Hash-Join w/ 5K Filter Elements
Hash-Join w/ 10K Filter Elements
Hybrid w/ 5k Filter Elements
Hybrid w/ 10k Filter Elements
Ratio Zip/Hash