-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix druid-bloom-filter thread-safety #6584
fix druid-bloom-filter thread-safety #6584
Conversation
Confirmed working as expected on test cluster |
The original intent of PR #6222 , is to build the BloomK sketch in Hive and handed it over to Druid (via the wire as a filter) to filter down the scan, therefore, for this to work we need to make sure that both Hive and Druid are using the same algorithm for building sketch and probing it. |
NOTICE
Outdated
* update this when this patch is committed | ||
|
||
This product contains code adapted from Apache Hive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is already one entry for Hive above, to make this more relevant maybe need to add what is copied and commit or release version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, oops I'm dumb and didn't actually read the NOTICE
file, just knew i should add something since I copied it from elsewhere, will remove this extra 👍
I don't particularly have a goal at this point other than fixing the thread safety issue that is blocking 0.13. FWIW I did add a handful of tests to ensure the copied The thing that makes it ambiguous of what is the best path forward to me is that I'm not certain the thread safety issues faced here necessarily are concerns with how hive is using this stuff (they may be, but I have no idea and don't want to depend on that). In other words, I'm unsure if the thread safety issue is a hive issue or a druid issue (and also I haven't yet even found how to open an issue on hive yet😛) Also as I mentioned, the other issue is that the latest version of I agree that if for whatever reason I need to break hive compatibility it will be more appropriate to put in another extension, or yeah, rework this one to support multiple implementations. |
|
||
public boolean testBytes(byte[] val, int offset, int length) | ||
{ | ||
long hash64 = val == null ? Murmur3.NULL_HASHCODE : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Particular reason why Murmur3 is used? xxHash from https://github.com/OpenHFT/Zero-Allocation-Hashing is much faster and allows to hash for example integers directly garbage-free, without resorting to ThreadLocals such as BYTE_ARRAY_4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why specifically this is implemented the way it is here (since this code is copied from hive), but in the latest version of BloomKFilter
BYTE_ARRAY_4
is no longer used and it uses Murmur3 directly,
public boolean testInt(int val) {
return testHash(Murmur3.hash64(val));
}
However, this seems to be what breaks compatibility with BloomKFilters
generated with the current version of the hive storage library used by this extension. I suspect this is perhaps an unintended consequence of the modification, as I see no mention of it in the jira ticket. I haven't looked closely at the Murmur3 implementation yet, but it at least looks like it avoids using any patterns like BYTE_ARRAY_4
in this version of BloomKFilter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to clarify, my modifications to BloomKFilter
besides formatting changes to make it adhere to style stuff, is limited to using ThreadLocal
for the masks
and BYTE_ARRAY_4
variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leventov Do you have any Java Benchmarks that compares XXHash to Murmur3 or 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@b-slim https://github.com/OpenHFT/Zero-Allocation-Hashing#performance. xxHash is ~2 times faster on a long run, with ~2 times shorter bootstrap (6 vs 12 ns) too. Please create issues in Druid or Hive for replacing the algo.
@clintropolis: Can you also submit a bug and patch against hive for fixing the same in hive-storage-api. I think its ok to have the code copied over here until these changes are merged on hive side and we get a new release. Have looked at the code changes, they seem good, will do some more testing internally and update here. |
Sure, I should be able to do that. Is this information at https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute still accurate? It looks like I send something to the dev mailing list, then jira, then patch? I can bring up the int/float incompatibility I saw between 2.7.0 and latest as well. |
* This is a direct modification of the Apache Hive 'BloomKFilter', found at: | ||
* https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomKFilter.java | ||
* modified to store variables which are re-used instead of re-allocated per call as {@link ThreadLocal} so multiple | ||
* threads can share the same filter object. Note that this is snapshot at hive-storag-api version 2.7.0, 3.2.0+ break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am not aware of 3.2.0 release, therefore maybe it is better to reframe this statement or remove 3.2.0.
FYI The incompatibility you are referring too it is not released yet.
Also i recommend adding a TODO, statement to clarify that this is a temporary solution and to remove this class as soon as HIVE-20893 is fixed and released.
@clintropolis there is 2 ways to submit your contribution. |
Thanks @b-slim I will open a PR on github today 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 post travis, have tested the latest changes with hive filtering use case.
Updated the PR description to match what was merged |
@clintropolis could you do a backport to 0.13.0 for this? |
* use BloomFilter instead of BloomKFilter since the latters test method is not threadsafe * fix formatting * style and forbidden api * remove redundant hive notice entry * add todo with note to delete copied implementation and link to related hive jira * better fix for masks than ThreadLocal
* use BloomFilter instead of BloomKFilter since the latters test method is not threadsafe * fix formatting * style and forbidden api * remove redundant hive notice entry * add todo with note to delete copied implementation and link to related hive jira * better fix for masks than ThreadLocal
Alternate approach to what was taken in #6547, instead embedding the
BloomKFilter
implementation indruid-bloom-filters
extension, applying a fix from HIVE-20893 and wrappingBYTE_ARRAY_4
which is re-used instead of re-allocated per function call, for ints/floats, withThreadLocal
to fix the thread safety issues.Fixes #6546
The copied
BloomKFilter
is fromhive-storage-api
as of verison 2.7.0 instead of master, since this commit seems to have broken backwards compatibility by using 8 bytes instead of 4 to store and test ints/floats. I went with this version specifically to maintain compatibility with the hive version used in the original PR, #6222, despite I think the latest implementation being a bit better since it doesn't use the static variable.The copied implementation should be removed once the linked hive ticket is resolved and the fix is released so we can update
hive-storage-api
.