Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance DistinctCountThetaSketchAggregationFunction #6004

Merged
merged 2 commits into from
Sep 29, 2020

Conversation

Jackie-Jiang
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang commented Sep 11, 2020

Description

Enhance DistinctCountThetaSketchAggregationFunction, and add the following supports:

  • Support aggregating on raw values of all data types (both SV and MV)
  • Support nested filter (E.g. A = 1 AND (B = 2 OR C = 3))
  • Support filter on all data types (both SV and MV)
  • Support simple union without filters (E.g. thetaSketch(col))
  • Support $0 as the default sketch (sketch without filter)

Also change the DistinctCountRawThetaSketchAggregationFunction to have the same usage as the DistinctCountThetaSketchAggregationFunction and returns base64 encoded sketch.
Fix the issue where theta-sketch parameters cannot be correctly handled in BaseBrokerRequestHandler.updateColumnNames()

Release Notes

This aggregation function is still in beta version. This PR involves change on the format of data sent from server to broker, so it works only when both broker and server are upgraded to the new version.

Copy link
Contributor

@mayankshriv mayankshriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation to add a new aggregation function as opposed to enhancing the existing one? Is there a backward compatibility issue? If not, it adds more confusion on the user side to have multiple variations of the same aggregation functions.

@Jackie-Jiang Jackie-Jiang changed the title Add ThetaSketchAggregationFunction Enhance DistinctCountThetaSketchAggregationFunction Sep 29, 2020
@Jackie-Jiang Jackie-Jiang added the backward-incompat Referenced by PRs that introduce or fix backward compat issues label Sep 29, 2020
@mayankshriv
Copy link
Contributor

What's the motivation to add a new aggregation function as opposed to enhancing the existing one? Is there a backward compatibility issue? If not, it adds more confusion on the user side to have multiple variations of the same aggregation functions.

We discussed offline and did some performance benchmark to ensure there is no regression. Based on the testing/benchmarking results, we decided to move the code under existing function instead of creating a new one.

Copy link
Contributor

@mayankshriv mayankshriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments, except on raw version being feature compatible with non-raw version. Please address comments before merging.


// Directly return the size (0) for empty list
if (size == 0) {
return new byte[Integer.BYTES];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can a static final be used here (to avoid creating new object)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safer to not reuse this as we have no control on not modifying it

return new byte[Integer.BYTES];
}

// No need to close these 2 streams
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add why in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

* The {@code DistinctCountRawThetaSketchAggregationFunction} collects the values for a given expression (can be
* single-valued or multi-valued) into a {@link Sketch} object, and returns the sketch as a base64 encoded string. It
* treats BYTES expression as serialized sketches.
* <p>The function takes an optional second argument as the parameters for the function. Currently there is only 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about predicates, and post-aggregation expression? Are those arguments not supported?

}

@Override
public void aggregate(int length, AggregationResultHolder aggregationResultHolder,
Map<ExpressionContext, BlockValSet> blockValSetMap) {
_thetaSketchAggregationFunction.aggregate(length, aggregationResultHolder, blockValSetMap);
BlockValSet blockValSet = blockValSetMap.get(_expression);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the raw version not able to re-use the non-raw version of the code as in the previous implementation?

private final SetOperationBuilder _setOperationBuilder;
@SuppressWarnings({"rawtypes", "unchecked"})
public class DistinctCountThetaSketchAggregationFunction implements AggregationFunction<List<Sketch>, Long> {
private static final String SET_UNION = "SET_UNION";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use enum for set operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using string has these 2 benefits over enum:

  • Save the extra parsing of the enum
  • Simplify the handling of invalid operations

_filterEvaluators = Collections.emptyList();
_postAggregationExpression = ExpressionContext.forIdentifier(DEFAULT_SKETCH_IDENTIFIER);
} else {
// Union with post-aggregation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to ensure, we have test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have unit test and integration test for this

// Main expression is always index 0
if (valueTypes[0] != DataType.BYTES) {
List<UpdateSketch> updateSketches = getUpdateSketches(aggregationResultHolder);
if (singleValues[0]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these if/else blocks re-factorable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried and don't see a better way to organize the code. We need to have slightly different logic for each primitive type

… the new one

Make `DistinctCountRawThetaSketchAggregationFunction` the same usage as `DistinctCountThetaSketchAggregationFunction` and returns base64 encoded sketch.
Also fix the issue in BaseBrokerRequestHandler.updateColumnNames()
@Jackie-Jiang
Copy link
Contributor Author

@mayankshriv Changed the ``DistinctCountRawThetaSketchAggregationFunctionto have the same usage asDistinctCountThetaSketchAggregationFunction`

@codecov-commenter
Copy link

Codecov Report

Merging #6004 into master will decrease coverage by 2.39%.
The diff coverage is 37.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6004      +/-   ##
==========================================
- Coverage   66.44%   64.05%   -2.40%     
==========================================
  Files        1075     1211     +136     
  Lines       54773    56796    +2023     
  Branches     8168     8338     +170     
==========================================
- Hits        36396    36380      -16     
- Misses      15700    17774    +2074     
+ Partials     2677     2642      -35     
Flag Coverage Δ
#unittests 64.05% <37.52%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...e/pinot/broker/api/resources/PinotBrokerDebug.java 0.00% <0.00%> (-79.32%) ⬇️
...ot/broker/broker/AllowAllAccessControlFactory.java 71.42% <ø> (-28.58%) ⬇️
.../helix/BrokerUserDefinedMessageHandlerFactory.java 33.96% <0.00%> (-32.71%) ⬇️
...ava/org/apache/pinot/client/AbstractResultSet.java 53.33% <0.00%> (-3.81%) ⬇️
.../main/java/org/apache/pinot/client/Connection.java 35.55% <0.00%> (-13.29%) ⬇️
...inot/client/JsonAsyncHttpPinotClientTransport.java 10.90% <0.00%> (-51.10%) ⬇️
.../org/apache/pinot/client/ResultTableResultSet.java 0.00% <0.00%> (-34.29%) ⬇️
...not/common/assignment/InstancePartitionsUtils.java 73.80% <ø> (+0.63%) ⬆️
.../apache/pinot/common/exception/QueryException.java 90.27% <ø> (+5.55%) ⬆️
...pinot/common/function/AggregationFunctionType.java 96.49% <ø> (-3.51%) ⬇️
... and 1015 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2379791...297874d. Read the comment docs.

@Jackie-Jiang Jackie-Jiang merged commit e892cb2 into apache:master Sep 29, 2020
@Jackie-Jiang Jackie-Jiang deleted the theta_sketch branch September 29, 2020 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backward-incompat Referenced by PRs that introduce or fix backward compat issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants