Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bloom filter druid expression #6904

Merged
merged 4 commits into from
Jan 28, 2019
Merged

Conversation

clintropolis
Copy link
Member

Adds a bloom_filter_test Druid expression to allow use in ExpressionVirtualColumn and ExpressionDimFilter, as well as SQL virtual column expression support.

…ExpressionVirtualColumn and ExpressionDimFilter and sql expressions
Copy link
Member

@asdf2014 asdf2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM 👍 Also, I left few comments.

The bloom filter extension also adds a bloom filter [Druid expression](../../misc/math-expr.html) which shares syntax
with the SQL operator.

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the sql type for this block

with the SQL operator.

```
bloom_filter_test(<expr>, "<serialized_bytes_for_BloomKFilter>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use '<serialized_bytes_for_BloomKFilter>' instead of "<serialized_bytes_for_BloomKFilter>"

@@ -60,6 +60,7 @@ The following built-in functions are available.
|like|like(expr, pattern[, escape]) is equivalent to SQL `expr LIKE pattern`|
|case_searched|case_searched(expr1, result1, \[\[expr2, result2, ...\], else-result\])|
|case_simple|case_simple(expr, value1, result1, \[\[value2, result2, ...\], else-result\])|
|bloom_filter_test|bloom_filter_test(expr,filter) tests the value of 'expr' against 'filter', a bloom filter serialized as a base64 string. See [bloom filter extension](../development/extensions-core/bloom-filter.html) documentation for additional details.|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a blank character between expr and filter, e.g., (expr, filter)



final String serializedFilter = filterExpr.getLiteralValue().toString();
final byte[] decoded = BaseEncoding.base64().decode(serializedFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use java.util.Base64.getDecoder().decode(serializedFilter) is a better choice..

FYI, I found some benchmark results using the initial Java 8 release on the Xeon E5-2650@2.8 Ghz

Name Encode, 100 bytes Decode, 100 bytes Encode, 1000 bytes Decode, 1000 bytes Encode, 200000000 bytes Decode, 200000000 bytes
JavaXmlImpl 0.721 sec 1.067 sec 0.664 sec 0.947 sec 0.689 sec 0.885 sec
Java8Impl 0.808 sec 1.101 sec 0.712 sec 0.913 sec 0.699 sec 0.887 sec
MiGBase64Impl 0.887 sec 2.113 sec 0.788 sec 1.978 sec 0.812 sec 1.928 sec
IHarderImpl 1.179 sec 2.645 sec 1.012 sec 2.439 sec 0.976 sec 2.431 sec
ApacheImpl 4.113 sec 4.733 sec 2.308 sec 2.667 sec 2.205 sec 2.552 sec
GuavaImpl 3.305 sec 4.178 sec 3.12 sec 3.755 sec 3.102 sec 3.744 sec
SunImpl 11.153 sec 8.281 sec 3.992 sec 4.533 sec 3.289 sec 4.096 sec

filter.addBytes(null, 0, 0);
}
byte[] bytes = BloomFilterSerializersModule.bloomKFilterToBytes(filter);
String base64 = Base64.encodeBase64String(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same, use java.util.Base64.getEncoder().encodeToString(bytes) would be better.

BloomKFilter filter = new BloomKFilter(1500);
filter.addDouble(20.2);
byte[] bytes = BloomFilterSerializersModule.bloomKFilterToBytes(filter);
String base64 = Base64.encodeBase64String(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same, use java.util.Base64.getEncoder().encodeToString(bytes) would be better.

@clintropolis
Copy link
Member Author

Thanks for the review and benchmarks @asdf2014 👍, switched to use java.util.Base64 and adjusted the docs.

@asdf2014
Copy link
Member

@clintropolis You are welcome. Maybe we should replace all non JDK Base64 in the project and record it to druid-forbidden-apis. I will open a new PR later.

Copy link
Member

@asdf2014 asdf2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@clintropolis
Copy link
Member Author

Maybe we should replace all non JDK Base64 in the project and record it to druid-forbidden-apis. I will open a new PR later.

I only saw 2 other occurrences I think, both in tests, but probably a good idea to mark forbidden given those benchmarks 👍

@clintropolis
Copy link
Member Author

clintropolis commented Jan 24, 2019

Nevermind, I see a handful of commons base64 usage in places, so probably some improvements to be had there by making the switch.

@asdf2014
Copy link
Member

@clintropolis Yep, I think so.

@clintropolis
Copy link
Member Author

@asdf2014 It would also probably be worth considering driving these conversions through StringUtils class to make future changes easier, it already has a utf8Base64 method that converts a string to a base64 string, so just expanding on that to handle to/from bytes/strings might be appropriate.

@asdf2014
Copy link
Member

Instances of {@link Decoder} class are safe for use by multiple concurrent threads.

@clintropolis Yep, I agree with you. In addition, I found the above comment from JDK, so that we can also hold the instance of Decoder and Encoder in the StringUtils class without worrying about the multi-thread problem. (Although, Base64.Encoder#RFC4648 is still a single instance 😅)

@asdf2014
Copy link
Member

@clintropolis I raised up a PR #6913, PTAL.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gianm gianm added this to the 0.14.0 milestone Jan 28, 2019
@gianm gianm merged commit af3cbc3 into apache:master Jan 28, 2019
@clintropolis clintropolis deleted the bloom-filter-expr branch February 1, 2019 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants