-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support default BYTES (zero-length byte array) in aggregation function and aggregator #4214
Conversation
…n and aggregator Treat zero-length byte array as null, and skip processing it Add DefaultBytesQueriesTest for test coverage
Codecov Report
@@ Coverage Diff @@
## master #4214 +/- ##
============================================
+ Coverage 67.11% 67.43% +0.31%
Complexity 20 20
============================================
Files 1035 1035
Lines 51520 51572 +52
Branches 7220 7245 +25
============================================
+ Hits 34577 34775 +198
+ Misses 14569 14395 -174
- Partials 2374 2402 +28
Continue to review full report at Codecov.
|
Can you point us to a concrete query/usecase where this PR would help? What would be the behavior without this PR. |
@kishoreg Without this PR, in order to create default columns for serialized Object such as HyperLogLog, TDigest etc., user needs to serialize the default Object (empty HyperLogLog or TDigest) in order to get the default value and then put it into the schema. The default values are quite long for these Objects (HyperLogLog: "00000008000000ac00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"; QDigest: "3fa999999999999a0000000000000000000000000004b9467fffffffffffffff800000000000000000000000"; TDigest: "000000017ff0000000000000fff0000000000000405900000000000000000000"). Besides, this PR allows user to put empty byte array to skip a value which can save storage and reduce computation cost. E.g. for a HyperLogLog column, if there is no value inside, user can simply put an empty array instead of the serialized byte array. |
This is equivalent of supporting NULL for BYTES column right? |
@kishoreg Yes, and we treat empty byte array as NULL for BYTES. |
ok. It's better to support NULL as first-class instead of checking for NULL every row right?. Generate a NULL bitmap for this column and filter it out in the filter phase. |
Let’s create an issue for supporting null and point to this PR. |
@kishoreg It cannot be done in filter phase because for each document, there is no guarantee that all metrics are null. We can filter them out in data fetching phase though. I think we need to add a new index type or reserve some dictId for null. Will create the issue. |
NULL value is per column. Also, I think I now remember the discussion with Mayank. This will be backward incompatible change when we add NULL support. |
@kishoreg I understand that NULL value is per column, but for example: SELECT SUM(A), SUM(B) FROM table, for the same document, column A might have NULL value while column B does not. We cannot apply an extra filter in such case. |
Discussed with @kishoreg offline. Will hold on this PR and see how much effort it takes to support real NULL values, and then decide which route to take. |
Treat zero-length byte array as null, and skip processing it
Add DefaultBytesQueriesTest for test coverage
Motivation:
Without this change, in order to create default columns for serialized Object such as HyperLogLog, TDigest etc., user needs to serialize the default Object (empty HyperLogLog or TDigest) to get the default value and put it into the schema. The default values are quite long for these Objects (HyperLogLog: "00000008000000ac00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"; QDigest: "3fa999999999999a0000000000000000000000000004b9467fffffffffffffff800000000000000000000000"; TDigest: "000000017ff0000000000000fff0000000000000405900000000000000000000").
With this change, we directly treat empty byte array as null, so user doesn't need to specify such default values.
Besides, this change allows user to put empty byte array to skip a value which can save storage and reduce computation cost. E.g. for a HyperLogLog column, if there is no value inside, user can simply put an empty array instead of the serialized byte array.