Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support default BYTES (zero-length byte array) in aggregation function and aggregator #4214

Closed
wants to merge 1 commit into from

Conversation

Jackie-Jiang
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang commented May 16, 2019

Treat zero-length byte array as null, and skip processing it
Add DefaultBytesQueriesTest for test coverage

Motivation:
Without this change, in order to create default columns for serialized Object such as HyperLogLog, TDigest etc., user needs to serialize the default Object (empty HyperLogLog or TDigest) to get the default value and put it into the schema. The default values are quite long for these Objects (HyperLogLog: "00000008000000ac00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"; QDigest: "3fa999999999999a0000000000000000000000000004b9467fffffffffffffff800000000000000000000000"; TDigest: "000000017ff0000000000000fff0000000000000405900000000000000000000").
With this change, we directly treat empty byte array as null, so user doesn't need to specify such default values.
Besides, this change allows user to put empty byte array to skip a value which can save storage and reduce computation cost. E.g. for a HyperLogLog column, if there is no value inside, user can simply put an empty array instead of the serialized byte array.

…n and aggregator

Treat zero-length byte array as null, and skip processing it
Add DefaultBytesQueriesTest for test coverage
@mayankshriv
Copy link
Contributor

I had a similar PR a while back to filter out zero-length byte array (that represented null values for TDigest in that PR). But based on some offline concerns from @kishoreg that I don't recall right now, I had retracted the PR. Tagging @kishoreg to comment.

@codecov-io
Copy link

Codecov Report

Merging #4214 into master will increase coverage by 0.31%.
The diff coverage is 55.03%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #4214      +/-   ##
============================================
+ Coverage     67.11%   67.43%   +0.31%     
  Complexity       20       20              
============================================
  Files          1035     1035              
  Lines         51520    51572      +52     
  Branches       7220     7245      +25     
============================================
+ Hits          34577    34775     +198     
+ Misses        14569    14395     -174     
- Partials       2374     2402      +28
Impacted Files Coverage Δ Complexity Δ
.../function/DistinctCountHLLAggregationFunction.java 35.91% <30.76%> (+5.48%) 0 <0> (ø) ⬇️
...ion/function/PercentileEstAggregationFunction.java 61.72% <36.36%> (+8.48%) 0 <0> (ø) ⬇️
...ta/aggregator/DistinctCountHLLValueAggregator.java 76.47% <50%> (-9.25%) 0 <0> (ø)
...re/data/aggregator/MinMaxRangeValueAggregator.java 88% <57.14%> (-12%) 0 <0> (ø)
...pinot/core/data/aggregator/AvgValueAggregator.java 86.95% <57.14%> (-13.05%) 0 <0> (ø)
...function/PercentileTDigestAggregationFunction.java 66.26% <61.9%> (+2.97%) 0 <0> (ø) ⬇️
.../data/aggregator/PercentileEstValueAggregator.java 87.5% <63.63%> (-12.5%) 0 <0> (ø)
...ation/function/MinMaxRangeAggregationFunction.java 74.44% <63.63%> (+1.18%) 0 <0> (ø) ⬇️
...y/aggregation/function/AvgAggregationFunction.java 80.45% <63.63%> (-0.27%) 0 <0> (ø)
...a/aggregator/PercentileTDigestValueAggregator.java 87.5% <66.66%> (-12.5%) 0 <0> (ø)
... and 39 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3cca15f...c8169e4. Read the comment docs.

@kishoreg
Copy link
Member

Can you point us to a concrete query/usecase where this PR would help? What would be the behavior without this PR.

@Jackie-Jiang
Copy link
Contributor Author

Jackie-Jiang commented May 21, 2019

@kishoreg Without this PR, in order to create default columns for serialized Object such as HyperLogLog, TDigest etc., user needs to serialize the default Object (empty HyperLogLog or TDigest) in order to get the default value and then put it into the schema. The default values are quite long for these Objects (HyperLogLog: "00000008000000ac00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"; QDigest: "3fa999999999999a0000000000000000000000000004b9467fffffffffffffff800000000000000000000000"; TDigest: "000000017ff0000000000000fff0000000000000405900000000000000000000").
With this PR, we directly treat empty byte array as null, so user doesn't need to specify such default values.

Besides, this PR allows user to put empty byte array to skip a value which can save storage and reduce computation cost. E.g. for a HyperLogLog column, if there is no value inside, user can simply put an empty array instead of the serialized byte array.

@kishoreg
Copy link
Member

This is equivalent of supporting NULL for BYTES column right?

@Jackie-Jiang
Copy link
Contributor Author

@kishoreg Yes, and we treat empty byte array as NULL for BYTES.

@kishoreg
Copy link
Member

ok. It's better to support NULL as first-class instead of checking for NULL every row right?. Generate a NULL bitmap for this column and filter it out in the filter phase.

@kishoreg
Copy link
Member

Let’s create an issue for supporting null and point to this PR.

@Jackie-Jiang
Copy link
Contributor Author

@kishoreg It cannot be done in filter phase because for each document, there is no guarantee that all metrics are null. We can filter them out in data fetching phase though. I think we need to add a new index type or reserve some dictId for null. Will create the issue.

@kishoreg
Copy link
Member

NULL value is per column. Also, I think I now remember the discussion with Mayank. This will be backward incompatible change when we add NULL support.

@Jackie-Jiang
Copy link
Contributor Author

@kishoreg I understand that NULL value is per column, but for example: SELECT SUM(A), SUM(B) FROM table, for the same document, column A might have NULL value while column B does not. We cannot apply an extra filter in such case.
Can you describe why this will be backward incompatible (forward incompatible maybe as we don't have NULL support yet)?

@Jackie-Jiang
Copy link
Contributor Author

Discussed with @kishoreg offline. Will hold on this PR and see how much effort it takes to support real NULL values, and then decide which route to take.

@Jackie-Jiang Jackie-Jiang deleted the default_bytes branch October 2, 2019 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants