Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix druid-bloom-filter thread-safety #6584

Merged
merged 6 commits into from
Nov 9, 2018

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Nov 7, 2018

Alternate approach to what was taken in #6547, instead embedding the BloomKFilter implementation in druid-bloom-filters extension, applying a fix from HIVE-20893 and wrapping BYTE_ARRAY_4 which is re-used instead of re-allocated per function call, for ints/floats, with ThreadLocal to fix the thread safety issues.

Fixes #6546

The copied BloomKFilter is from hive-storage-api as of verison 2.7.0 instead of master, since this commit seems to have broken backwards compatibility by using 8 bytes instead of 4 to store and test ints/floats. I went with this version specifically to maintain compatibility with the hive version used in the original PR, #6222, despite I think the latest implementation being a bit better since it doesn't use the static variable.

The copied implementation should be removed once the linked hive ticket is resolved and the fix is released so we can update hive-storage-api.

@clintropolis
Copy link
Member Author

Confirmed working as expected on test cluster

@b-slim
Copy link
Contributor

b-slim commented Nov 7, 2018

The original intent of PR #6222 , is to build the BloomK sketch in Hive and handed it over to Druid (via the wire as a filter) to filter down the scan, therefore, for this to work we need to make sure that both Hive and Druid are using the same algorithm for building sketch and probing it.
This implies that the best way to fix this is to push the fix to Hive Storage API and drive a storage API release (FYI storage API release is independent from Hive release life cycle hence goes faster).
In the meantime to not block 0.13 I think we can go with this option where the code is copied over to Druid it self waiting on the release with the bug fix but for the longer term it is better to have it in one place to avoid issues.
@clintropolis if your goal is to build your own bloom filter sketch from scratch for performance reasons or any other goal, then it is better to introduce a new kind of filter. The other way to do it is to introduce a version blob within the sketch it self to pick the correct serde/probing implementation but this will need way more work...

NOTICE Outdated
* update this when this patch is committed

This product contains code adapted from Apache Hive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is already one entry for Hive above, to make this more relevant maybe need to add what is copied and commit or release version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, oops I'm dumb and didn't actually read the NOTICE file, just knew i should add something since I copied it from elsewhere, will remove this extra 👍

@clintropolis
Copy link
Member Author

I don't particularly have a goal at this point other than fixing the thread safety issue that is blocking 0.13. FWIW I did add a handful of tests to ensure the copied BloomKFilter implementation is compatible with the hive implementation.

The thing that makes it ambiguous of what is the best path forward to me is that I'm not certain the thread safety issues faced here necessarily are concerns with how hive is using this stuff (they may be, but I have no idea and don't want to depend on that). In other words, I'm unsure if the thread safety issue is a hive issue or a druid issue (and also I haven't yet even found how to open an issue on hive yet😛)

Also as I mentioned, the other issue is that the latest version of BloomKFilter in hive is incompatible with the current version we are using, which my hive compat test on 'floats' caught because I initially copied BloomKFilter from master. I don't know if this is intended or not, the jira associated with the change makes no mention of breaking compatibility so maybe it was accidental.

I agree that if for whatever reason I need to break hive compatibility it will be more appropriate to put in another extension, or yeah, rework this one to support multiple implementations.


public boolean testBytes(byte[] val, int offset, int length)
{
long hash64 = val == null ? Murmur3.NULL_HASHCODE :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particular reason why Murmur3 is used? xxHash from https://github.com/OpenHFT/Zero-Allocation-Hashing is much faster and allows to hash for example integers directly garbage-free, without resorting to ThreadLocals such as BYTE_ARRAY_4

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why specifically this is implemented the way it is here (since this code is copied from hive), but in the latest version of BloomKFilter BYTE_ARRAY_4 is no longer used and it uses Murmur3 directly,

public boolean testInt(int val) {
  return testHash(Murmur3.hash64(val));
}

However, this seems to be what breaks compatibility with BloomKFilters generated with the current version of the hive storage library used by this extension. I suspect this is perhaps an unintended consequence of the modification, as I see no mention of it in the jira ticket. I haven't looked closely at the Murmur3 implementation yet, but it at least looks like it avoids using any patterns like BYTE_ARRAY_4 in this version of BloomKFilter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to clarify, my modifications to BloomKFilter besides formatting changes to make it adhere to style stuff, is limited to using ThreadLocal for the masks and BYTE_ARRAY_4 variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leventov Do you have any Java Benchmarks that compares XXHash to Murmur3 or 2?

Copy link
Member

@leventov leventov Nov 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-slim https://github.com/OpenHFT/Zero-Allocation-Hashing#performance. xxHash is ~2 times faster on a long run, with ~2 times shorter bootstrap (6 vs 12 ns) too. Please create issues in Druid or Hive for replacing the algo.

@jihoonson jihoonson self-assigned this Nov 8, 2018
@jihoonson jihoonson added this to the 0.13.0 milestone Nov 8, 2018
@jihoonson jihoonson added the Bug label Nov 8, 2018
@nishantmonu51
Copy link
Member

@clintropolis: Can you also submit a bug and patch against hive for fixing the same in hive-storage-api. I think its ok to have the code copied over here until these changes are merged on hive side and we get a new release.

Have looked at the code changes, they seem good, will do some more testing internally and update here.

@clintropolis
Copy link
Member Author

Sure, I should be able to do that. Is this information at https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute still accurate? It looks like I send something to the dev mailing list, then jira, then patch? I can bring up the int/float incompatibility I saw between 2.7.0 and latest as well.

* This is a direct modification of the Apache Hive 'BloomKFilter', found at:
* https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomKFilter.java
* modified to store variables which are re-used instead of re-allocated per call as {@link ThreadLocal} so multiple
* threads can share the same filter object. Note that this is snapshot at hive-storag-api version 2.7.0, 3.2.0+ break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am not aware of 3.2.0 release, therefore maybe it is better to reframe this statement or remove 3.2.0.
FYI The incompatibility you are referring too it is not released yet.
Also i recommend adding a TODO, statement to clarify that this is a temporary solution and to remove this class as soon as HIVE-20893 is fixed and released.

@b-slim
Copy link
Contributor

b-slim commented Nov 8, 2018

@clintropolis there is 2 ways to submit your contribution.
Option 1 open a pull request against https://github.com/apache/hive with HIVE-20893: Your_PR_Title XXX.
Option 2 is to submit a diff using https://reviews.apache.org/r/new/ against hive-git

@clintropolis
Copy link
Member Author

Thanks @b-slim I will open a PR on github today 👍

Copy link
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 post travis, have tested the latest changes with hive filtering use case.

@b-slim b-slim merged commit c2f020e into apache:master Nov 9, 2018
@clintropolis
Copy link
Member Author

Updated the PR description to match what was merged

@clintropolis clintropolis deleted the bloom-filter-not-threadsafe branch November 9, 2018 20:41
@dclim
Copy link
Contributor

dclim commented Nov 13, 2018

@clintropolis could you do a backport to 0.13.0 for this?

clintropolis added a commit to clintropolis/druid that referenced this pull request Nov 13, 2018
* use BloomFilter instead of BloomKFilter since the latters test method is not threadsafe

* fix formatting

* style and forbidden api

* remove redundant hive notice entry

* add todo with note to delete copied implementation and link to related hive jira

* better fix for masks than ThreadLocal
fjy pushed a commit that referenced this pull request Nov 13, 2018
* use BloomFilter instead of BloomKFilter since the latters test method is not threadsafe

* fix formatting

* style and forbidden api

* remove redundant hive notice entry

* add todo with note to delete copied implementation and link to related hive jira

* better fix for masks than ThreadLocal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BloomKFilter used by BloomDimFilter not thread-safe leading to unstable results
6 participants