-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to pass in Bloom filter from Hive Queries #6222
Conversation
6458fd3
to
59d6a07
Compare
Hi @nishantmonu51, this is really cool and something that seems useful outside of Hive / Druid interop too. Is there a way to do this without making it so Hive specific? Could the core bloom filter code be extracted into a library and put into a I am thinking of use cases like approximate
It's not exact but with appropriately sized filters it would still be useful for some use cases. |
@gianm: sure, I agree its a useful feature in general. |
A cool extension.IMO it's better to extract the implementation of BloomKFilter to avoid any possible problems. |
59d6a07
to
ea066c1
Compare
@gianm: Updated based on your comments, please review. |
@nishantmonu51 first there is no doc about how to use this extension. Secondly, I think you forget to add this extension to |
*/ | ||
public class BloomDimFilter implements DimFilter | ||
{ | ||
public static byte BLOOM_DIM_FILTER_CACHE_ID = 0x10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be better to put this in DimFilterUtils
? I think it'd be easier to have all the filter cache IDs in one place, and this is a core extension
addDeserializer(BloomKFilter.class, new BloomKFilterDeserializer()); | ||
} | ||
|
||
public static class BloomKFilterserializer extends StdSerializer<BloomKFilter> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BloomKFilterserializer -> BloomKFilterSerializer
Hi @nishantmonu51, do you think this should be a blocking issue for 0.13.0 release? |
@nishantmonu51 Can you add docs for this? An example of what a serialized BloomFilter input looks like and what it represents would be helpful IMO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests for virtual columns are failing for me, but after making changes mentioned in my review comments (and adding to distribution pom), this extension works for me and is pretty rad!
@@ -0,0 +1 @@ | |||
org.apache.druid.guice.BloomFilterExtensionModule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file has the wrong name, should be org.apache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -0,0 +1,388 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is located in a folder that does't match the package structure (io/druid/..
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@nishantmonu51 can you address comments? Otherwise we can move this to 0.13.1 |
Will update the PR today with the fixes. It will be good to get this in 0.13.0 |
ea066c1
to
e8aec5c
Compare
@jon-wei @clintropolis @QiuMM Thanks for the review. Added docs and handled review comments. |
This travis failure looks legitimate, but 👍 after travis and conflicts are resolved |
fix checkstyle review comments Fix wierd failure review comments Revert "Fix wierd failure" This reverts commit a13a83a.
e8aec5c
to
305d083
Compare
By the way, I built a bloom filter aggregator on top of this last weekend once I figured out |
|-------------------------|------------------------------|----------------------------------| | ||
|`type` |Filter Type. Should always be `bloom`|yes| | ||
|`dimension` |The dimension to filter over. | yes | | ||
|`bloomKFilter` |Binary representation of `org.apache.hive.common.util.BloomKFilter`| yes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits:
"Base64 encoded Binary representation....." ?
Also, It appears to me that this module(and property) are really usable only if someone was using hive util BloomKFilter
class on the client side because serialized representation is specific to that class and not something standard, so maybe we could add "hive" in the names everywhere e.g. hive-bloom-filter
module and this property being hiveUtilBloomKFilter
.
or I'm missing something :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, yes, its true that the implementation is currently shared from hive-storage-api jar, but the implementation as such does not depend on any of the hive specific functionality.
I initially made this a hive-bloom-filter but later on renamed to bloom-filter based on review comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to not have hive
in the names. I can imagine evolving this extension to support some non-hive format too, and make it more divorced from hive. Especially since with the aggregator @clintropolis mentioned, people could create serialized bloom filters without really needing to worry about the fact that they came from hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, if we did evolve it that way, it should still support the hive format (to work with the original use case from @nishantmonu51). I just think there is a lot of potential for this extension to be useful beyond hive integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I understand the naming now. ( I knew I was missing something ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks that teamcity fail isn't related to this PR.
This PR adds a BloomDimFilter which can be used by Apache Hive to pass in BloomFilters.
Use Case -
We have fact table in druid and slowly changing dimension/lookup tables in Apache Hive and need to join those tables.
e.g. Consider the case of SSB Benchmark when lineorder is stored in Druid and parts table is in hive For following query from SSB Benchmark -
In the above query Hive can scan parts table, create a bloom filter for possible values for p_part_key where p_category = 'MFGR#14'. This bloom filter can then be pushed to Druid reducing the data that needs to scanned and transferred between Druid and Hive.
Since BloomFilter is probablistic data structure and can have false positives. Hive will still need to do filtering while processing joins.