-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance DistinctCountThetaSketchAggregationFunction #6004
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation to add a new aggregation function as opposed to enhancing the existing one? Is there a backward compatibility issue? If not, it adds more confusion on the user side to have multiple variations of the same aggregation functions.
adc1d7c
to
06155e4
Compare
We discussed offline and did some performance benchmark to ensure there is no regression. Based on the testing/benchmarking results, we decided to move the code under existing function instead of creating a new one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor comments, except on raw
version being feature compatible with non-raw
version. Please address comments before merging.
|
||
// Directly return the size (0) for empty list | ||
if (size == 0) { | ||
return new byte[Integer.BYTES]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can a static final be used here (to avoid creating new object)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is safer to not reuse this as we have no control on not modifying it
return new byte[Integer.BYTES]; | ||
} | ||
|
||
// No need to close these 2 streams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add why in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
* The {@code DistinctCountRawThetaSketchAggregationFunction} collects the values for a given expression (can be | ||
* single-valued or multi-valued) into a {@link Sketch} object, and returns the sketch as a base64 encoded string. It | ||
* treats BYTES expression as serialized sketches. | ||
* <p>The function takes an optional second argument as the parameters for the function. Currently there is only 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about predicates, and post-aggregation expression? Are those arguments not supported?
} | ||
|
||
@Override | ||
public void aggregate(int length, AggregationResultHolder aggregationResultHolder, | ||
Map<ExpressionContext, BlockValSet> blockValSetMap) { | ||
_thetaSketchAggregationFunction.aggregate(length, aggregationResultHolder, blockValSetMap); | ||
BlockValSet blockValSet = blockValSetMap.get(_expression); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the raw
version not able to re-use the non-raw
version of the code as in the previous implementation?
private final SetOperationBuilder _setOperationBuilder; | ||
@SuppressWarnings({"rawtypes", "unchecked"}) | ||
public class DistinctCountThetaSketchAggregationFunction implements AggregationFunction<List<Sketch>, Long> { | ||
private static final String SET_UNION = "SET_UNION"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use enum for set operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using string has these 2 benefits over enum:
- Save the extra parsing of the enum
- Simplify the handling of invalid operations
_filterEvaluators = Collections.emptyList(); | ||
_postAggregationExpression = ExpressionContext.forIdentifier(DEFAULT_SKETCH_IDENTIFIER); | ||
} else { | ||
// Union with post-aggregation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to ensure, we have test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have unit test and integration test for this
// Main expression is always index 0 | ||
if (valueTypes[0] != DataType.BYTES) { | ||
List<UpdateSketch> updateSketches = getUpdateSketches(aggregationResultHolder); | ||
if (singleValues[0]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these if/else blocks re-factorable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried and don't see a better way to organize the code. We need to have slightly different logic for each primitive type
… the new one Make `DistinctCountRawThetaSketchAggregationFunction` the same usage as `DistinctCountThetaSketchAggregationFunction` and returns base64 encoded sketch. Also fix the issue in BaseBrokerRequestHandler.updateColumnNames()
06155e4
to
297874d
Compare
@mayankshriv Changed the ``DistinctCountRawThetaSketchAggregationFunction |
Codecov Report
@@ Coverage Diff @@
## master #6004 +/- ##
==========================================
- Coverage 66.44% 64.05% -2.40%
==========================================
Files 1075 1211 +136
Lines 54773 56796 +2023
Branches 8168 8338 +170
==========================================
- Hits 36396 36380 -16
- Misses 15700 17774 +2074
+ Partials 2677 2642 -35
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Description
Enhance
DistinctCountThetaSketchAggregationFunction
, and add the following supports:A = 1 AND (B = 2 OR C = 3)
)thetaSketch(col)
)$0
as the default sketch (sketch without filter)Also change the
DistinctCountRawThetaSketchAggregationFunction
to have the same usage as theDistinctCountThetaSketchAggregationFunction
and returns base64 encoded sketch.Fix the issue where theta-sketch parameters cannot be correctly handled in
BaseBrokerRequestHandler.updateColumnNames()
Release Notes
This aggregation function is still in beta version. This PR involves change on the format of data sent from server to broker, so it works only when both broker and server are upgraded to the new version.