Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to pass in Bloom filter from Hive Queries #6222

Merged
merged 3 commits into from
Sep 26, 2018

Conversation

nishantmonu51
Copy link
Member

This PR adds a BloomDimFilter which can be used by Apache Hive to pass in BloomFilters.

Use Case -
We have fact table in druid and slowly changing dimension/lookup tables in Apache Hive and need to join those tables.
e.g. Consider the case of SSB Benchmark when lineorder is stored in Druid and parts table is in hive For following query from SSB Benchmark -

select sum(total_revenue) from druid.ssb_lineorder_100, hive.ssb_lineorder_100 WHERE lo_partkey = p_partkey and p_category = 'MFGR#14';

In the above query Hive can scan parts table, create a bloom filter for possible values for p_part_key where p_category = 'MFGR#14'. This bloom filter can then be pushed to Druid reducing the data that needs to scanned and transferred between Druid and Hive.
Since BloomFilter is probablistic data structure and can have false positives. Hive will still need to do filtering while processing joins.

@gianm
Copy link
Contributor

gianm commented Aug 23, 2018

Hi @nishantmonu51, this is really cool and something that seems useful outside of Hive / Druid interop too. Is there a way to do this without making it so Hive specific? Could the core bloom filter code be extracted into a library and put into a druid-bloom-filter extension?

I am thinking of use cases like approximate x IN (SELECT y FROM … WHERE …) filters done like:

  • Run the subquery with a bloom filter aggregator that builds a bloom filter object for all y matching the inner query.
  • Turn around and apply that bloom filter object as a filter on the main query.

It's not exact but with appropriately sized filters it would still be useful for some use cases.

@nishantmonu51
Copy link
Member Author

@gianm: sure, I agree its a useful feature in general.
I can rename the extension to druid-bloom-filter, will leave the dependency on hive-storage-api for bloom Filter implementation, I don't think we need to copy the implementation as hive-storage-api is just has few classes and does not have many transitive dependencies.
Does this sounds good to you ?

@QiuMM
Copy link
Member

QiuMM commented Aug 23, 2018

A cool extension.IMO it's better to extract the implementation of BloomKFilter to avoid any possible problems.

@nishantmonu51
Copy link
Member Author

@gianm: Updated based on your comments, please review.

@QiuMM
Copy link
Member

QiuMM commented Sep 8, 2018

@nishantmonu51 first there is no doc about how to use this extension. Secondly, I think you forget to add this extension to distribution/pom.xml.

*/
public class BloomDimFilter implements DimFilter
{
public static byte BLOOM_DIM_FILTER_CACHE_ID = 0x10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to put this in DimFilterUtils? I think it'd be easier to have all the filter cache IDs in one place, and this is a core extension

addDeserializer(BloomKFilter.class, new BloomKFilterDeserializer());
}

public static class BloomKFilterserializer extends StdSerializer<BloomKFilter>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BloomKFilterserializer -> BloomKFilterSerializer

@jihoonson
Copy link
Contributor

Hi @nishantmonu51, do you think this should be a blocking issue for 0.13.0 release?

@jon-wei
Copy link
Contributor

jon-wei commented Sep 17, 2018

@nishantmonu51 Can you add docs for this? An example of what a serialized BloomFilter input looks like and what it represents would be helpful IMO.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for virtual columns are failing for me, but after making changes mentioned in my review comments (and adding to distribution pom), this extension works for me and is pretty rad!

@@ -0,0 +1 @@
org.apache.druid.guice.BloomFilterExtensionModule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file has the wrong name, should be org.apache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -0,0 +1,388 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is located in a folder that does't match the package structure (io/druid/..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

extensions-core/druid-bloom-filter/pom.xml Outdated Show resolved Hide resolved
@fjy
Copy link
Contributor

fjy commented Sep 25, 2018

@nishantmonu51 can you address comments? Otherwise we can move this to 0.13.1

@nishantmonu51
Copy link
Member Author

Will update the PR today with the fixes. It will be good to get this in 0.13.0

@nishantmonu51
Copy link
Member Author

@jon-wei @clintropolis @QiuMM Thanks for the review. Added docs and handled review comments.

@clintropolis
Copy link
Member

This travis failure looks legitimate, but 👍 after travis and conflicts are resolved

fix checkstyle

review comments

Fix wierd failure

review comments

Revert "Fix wierd failure"

This reverts commit a13a83a.
@clintropolis
Copy link
Member

By the way, I built a bloom filter aggregator on top of this last weekend once I figured out BloomKFilter had a merge function. This way we can generate the bloom filters to feed into this directly from queries, which I'll submit in a PR once this gets merged. It's pretty similar to the cardinality agg. It's currently backported to a 0.12.3 branch of ours, and is missing tests, but it shouldn't be much to get it into master.

|-------------------------|------------------------------|----------------------------------|
|`type` |Filter Type. Should always be `bloom`|yes|
|`dimension` |The dimension to filter over. | yes |
|`bloomKFilter` |Binary representation of `org.apache.hive.common.util.BloomKFilter`| yes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:
"Base64 encoded Binary representation....." ?

Also, It appears to me that this module(and property) are really usable only if someone was using hive util BloomKFilter class on the client side because serialized representation is specific to that class and not something standard, so maybe we could add "hive" in the names everywhere e.g. hive-bloom-filter module and this property being hiveUtilBloomKFilter .
or I'm missing something :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, yes, its true that the implementation is currently shared from hive-storage-api jar, but the implementation as such does not depend on any of the hive specific functionality.
I initially made this a hive-bloom-filter but later on renamed to bloom-filter based on review comments.

Copy link
Contributor

@gianm gianm Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to not have hive in the names. I can imagine evolving this extension to support some non-hive format too, and make it more divorced from hive. Especially since with the aggregator @clintropolis mentioned, people could create serialized bloom filters without really needing to worry about the fact that they came from hive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, if we did evolve it that way, it should still support the hive format (to work with the original use case from @nishantmonu51). I just think there is a lot of potential for this extension to be useful beyond hive integration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I understand the naming now. ( I knew I was missing something ).

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks that teamcity fail isn't related to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants