Support default BYTES (zero-length byte array) in aggregation function and aggregator #4214

Jackie-Jiang · 2019-05-16T22:18:22Z

Treat zero-length byte array as null, and skip processing it
Add DefaultBytesQueriesTest for test coverage

Motivation:
Without this change, in order to create default columns for serialized Object such as HyperLogLog, TDigest etc., user needs to serialize the default Object (empty HyperLogLog or TDigest) to get the default value and put it into the schema. The default values are quite long for these Objects (HyperLogLog: "00000008000000ac00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"; QDigest: "3fa999999999999a0000000000000000000000000004b9467fffffffffffffff800000000000000000000000"; TDigest: "000000017ff0000000000000fff0000000000000405900000000000000000000").
With this change, we directly treat empty byte array as null, so user doesn't need to specify such default values.
Besides, this change allows user to put empty byte array to skip a value which can save storage and reduce computation cost. E.g. for a HyperLogLog column, if there is no value inside, user can simply put an empty array instead of the serialized byte array.

…n and aggregator Treat zero-length byte array as null, and skip processing it Add DefaultBytesQueriesTest for test coverage

mayankshriv · 2019-05-16T22:37:17Z

I had a similar PR a while back to filter out zero-length byte array (that represented null values for TDigest in that PR). But based on some offline concerns from @kishoreg that I don't recall right now, I had retracted the PR. Tagging @kishoreg to comment.

codecov-io · 2019-05-16T23:05:42Z

Codecov Report

Merging #4214 into master will increase coverage by 0.31%.
The diff coverage is 55.03%.

@@             Coverage Diff              @@
##             master    #4214      +/-   ##
============================================
+ Coverage     67.11%   67.43%   +0.31%     
  Complexity       20       20              
============================================
  Files          1035     1035              
  Lines         51520    51572      +52     
  Branches       7220     7245      +25     
============================================
+ Hits          34577    34775     +198     
+ Misses        14569    14395     -174     
- Partials       2374     2402      +28

Impacted Files	Coverage Δ	Complexity Δ
.../function/DistinctCountHLLAggregationFunction.java	`35.91% <30.76%> (+5.48%)`	`0 <0> (ø)`	⬇️
...ion/function/PercentileEstAggregationFunction.java	`61.72% <36.36%> (+8.48%)`	`0 <0> (ø)`	⬇️
...ta/aggregator/DistinctCountHLLValueAggregator.java	`76.47% <50%> (-9.25%)`	`0 <0> (ø)`
...re/data/aggregator/MinMaxRangeValueAggregator.java	`88% <57.14%> (-12%)`	`0 <0> (ø)`
...pinot/core/data/aggregator/AvgValueAggregator.java	`86.95% <57.14%> (-13.05%)`	`0 <0> (ø)`
...function/PercentileTDigestAggregationFunction.java	`66.26% <61.9%> (+2.97%)`	`0 <0> (ø)`	⬇️
.../data/aggregator/PercentileEstValueAggregator.java	`87.5% <63.63%> (-12.5%)`	`0 <0> (ø)`
...ation/function/MinMaxRangeAggregationFunction.java	`74.44% <63.63%> (+1.18%)`	`0 <0> (ø)`	⬇️
...y/aggregation/function/AvgAggregationFunction.java	`80.45% <63.63%> (-0.27%)`	`0 <0> (ø)`
...a/aggregator/PercentileTDigestValueAggregator.java	`87.5% <66.66%> (-12.5%)`	`0 <0> (ø)`
... and 39 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3cca15f...c8169e4. Read the comment docs.

kishoreg · 2019-05-21T00:52:28Z

Can you point us to a concrete query/usecase where this PR would help? What would be the behavior without this PR.

Jackie-Jiang · 2019-05-21T01:03:41Z

@kishoreg Without this PR, in order to create default columns for serialized Object such as HyperLogLog, TDigest etc., user needs to serialize the default Object (empty HyperLogLog or TDigest) in order to get the default value and then put it into the schema. The default values are quite long for these Objects (HyperLogLog: "00000008000000ac00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"; QDigest: "3fa999999999999a0000000000000000000000000004b9467fffffffffffffff800000000000000000000000"; TDigest: "000000017ff0000000000000fff0000000000000405900000000000000000000").
With this PR, we directly treat empty byte array as null, so user doesn't need to specify such default values.

Besides, this PR allows user to put empty byte array to skip a value which can save storage and reduce computation cost. E.g. for a HyperLogLog column, if there is no value inside, user can simply put an empty array instead of the serialized byte array.

kishoreg · 2019-05-21T03:56:52Z

This is equivalent of supporting NULL for BYTES column right?

Jackie-Jiang · 2019-05-21T18:27:56Z

@kishoreg Yes, and we treat empty byte array as NULL for BYTES.

kishoreg · 2019-05-21T18:38:55Z

ok. It's better to support NULL as first-class instead of checking for NULL every row right?. Generate a NULL bitmap for this column and filter it out in the filter phase.

kishoreg · 2019-05-21T22:12:25Z

Let’s create an issue for supporting null and point to this PR.

Jackie-Jiang · 2019-05-21T22:13:34Z

@kishoreg It cannot be done in filter phase because for each document, there is no guarantee that all metrics are null. We can filter them out in data fetching phase though. I think we need to add a new index type or reserve some dictId for null. Will create the issue.

kishoreg · 2019-05-21T22:31:45Z

NULL value is per column. Also, I think I now remember the discussion with Mayank. This will be backward incompatible change when we add NULL support.

Jackie-Jiang · 2019-05-21T23:29:08Z

@kishoreg I understand that NULL value is per column, but for example: SELECT SUM(A), SUM(B) FROM table, for the same document, column A might have NULL value while column B does not. We cannot apply an extra filter in such case.
Can you describe why this will be backward incompatible (forward incompatible maybe as we don't have NULL support yet)?

Jackie-Jiang · 2019-05-22T00:42:21Z

Discussed with @kishoreg offline. Will hold on this PR and see how much effort it takes to support real NULL values, and then decide which route to take.

Support default BYTES (zero-length byte array) in aggregation functio…

c8169e4

…n and aggregator Treat zero-length byte array as null, and skip processing it Add DefaultBytesQueriesTest for test coverage

kishoreg approved these changes May 21, 2019

View reviewed changes

Jackie-Jiang mentioned this pull request May 21, 2019

NULL value support for all data types #4230

Closed

Jackie-Jiang closed this Oct 2, 2019

Jackie-Jiang deleted the default_bytes branch October 2, 2019 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support default BYTES (zero-length byte array) in aggregation function and aggregator #4214

Support default BYTES (zero-length byte array) in aggregation function and aggregator #4214

Jackie-Jiang commented May 16, 2019 •

edited

Loading

mayankshriv commented May 16, 2019

codecov-io commented May 16, 2019

kishoreg commented May 21, 2019

Jackie-Jiang commented May 21, 2019 •

edited

Loading

kishoreg commented May 21, 2019

Jackie-Jiang commented May 21, 2019

kishoreg commented May 21, 2019

kishoreg commented May 21, 2019

Jackie-Jiang commented May 21, 2019

kishoreg commented May 21, 2019

Jackie-Jiang commented May 21, 2019

Jackie-Jiang commented May 22, 2019

Support default BYTES (zero-length byte array) in aggregation function and aggregator #4214

Support default BYTES (zero-length byte array) in aggregation function and aggregator #4214

Conversation

Jackie-Jiang commented May 16, 2019 • edited Loading

mayankshriv commented May 16, 2019

codecov-io commented May 16, 2019

Codecov Report

kishoreg commented May 21, 2019

Jackie-Jiang commented May 21, 2019 • edited Loading

kishoreg commented May 21, 2019

Jackie-Jiang commented May 21, 2019

kishoreg commented May 21, 2019

kishoreg commented May 21, 2019

Jackie-Jiang commented May 21, 2019

kishoreg commented May 21, 2019

Jackie-Jiang commented May 21, 2019

Jackie-Jiang commented May 22, 2019

Jackie-Jiang commented May 16, 2019 •

edited

Loading

Jackie-Jiang commented May 21, 2019 •

edited

Loading