Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation for support Theta Sketches #5316

Merged
merged 1 commit into from May 12, 2020

Conversation

mayankshriv
Copy link
Contributor

@mayankshriv mayankshriv commented Apr 29, 2020

  1. Added an initial implementation for theta-sketch based distinct count
    aggregation function, which can be invoked as follows:
    select distinctCountThetaSketch(thetaSketchColumn, thetaSketchParams, p1, p2..pn, postAggrExpression)
    - Required: thetaSketchColumn is the column of type BYTES that stores serialized theta sketches.
    - Required: thetaSketchParams in the form "param1=v1;param2=v2..", pass empty literal '' if no params.
    - Optional: p1, p2 etc are filter predicates that make up the postAggregationExpression
    - Required: postAggrExpression is the expression built with AND/OR on predicates p1..pn.
    Example:
    select distinctCountThetaSketch(tsCol, "nominalEntries=1024", "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar') from table where dim1 = 'foo' or dim2 = 'bar'

  2. The aggregation function works as follows:

    • The postAggrExpression is basically an expression that AND/OR's some predicates,
      these predicates are expected to be specified as part of aggregation function.
    • The aggregation goes over all values in the blockValSet and applies each predicate
      on the values. If the predicate is satisfied, the theta-sketch is aggregated and stored
      as value in a map, where the key is the predicate.
    • Once all theta-sketches corresponding to all predicates are evaluated, across all segments
      and servers, the final result is computed by evaluating the postAggrFunction, by performing
      set operations on predicate theta-sketches (AND = intersection, OR = union).
  3. The Theta-Sketch library being used is from org.apache.datasketches.

  4. Now that more than one aggregation functions take multiple arguments, generalized handling of
    multiple args, and removed special casing of Distinct.

  5. Refactored methods from PinotQuery2BrokerRequestConverter to reusable utility class ParserUtils.

  6. Added unit tests for new code.

TODO:

  1. Performance tuning to ensure the aggregation function works at par.
  2. Support complex predicates p1, p2, p3, current PR only supports predicates of form like LHS = RHS, LHS IN (...), etc.
  3. Evaluate theta-sketch creation params, and pick default or come up with ways
    to configure (by user).
  4. MultiValue aggregation support.
  5. Auto derive predicates that make up postAggrExpression, so they don't need to be specified
    in the aggregation function.
  6. Auto sharding of Sketches with high cardinality into smaller ones to improve accuracy.
  7. Pql2Compiler.compileToExpressionTree does not seem to handle parenthesized expressions,
    e.g., ((col1 = 1 and col2 = 2) or col3 = 3).

@mayankshriv mayankshriv changed the title Initial implementation for support Theta Sketchesi (WIP). Initial implementation for support Theta Sketches (WIP). Apr 29, 2020
@codecov-io
Copy link

codecov-io commented Apr 29, 2020

Codecov Report

Merging #5316 into master will decrease coverage by 9.29%.
The diff coverage is 67.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5316      +/-   ##
==========================================
- Coverage   66.44%   57.15%   -9.30%     
==========================================
  Files        1075     1080       +5     
  Lines       54773    55056     +283     
  Branches     8168     8229      +61     
==========================================
- Hits        36396    31465    -4931     
- Misses      15700    21125    +5425     
+ Partials     2677     2466     -211     
Impacted Files Coverage Δ
...data/manager/realtime/DefaultSegmentCommitter.java 0.00% <ø> (-80.00%) ⬇️
...e/data/manager/realtime/SplitSegmentCommitter.java 0.00% <ø> (-63.64%) ⬇️
...ation/function/AggregationFunctionVisitorBase.java 0.00% <0.00%> (ø)
...startree/executor/StarTreeAggregationExecutor.java 0.00% <0.00%> (-100.00%) ⬇️
...ore/startree/executor/StarTreeGroupByExecutor.java 0.00% <0.00%> (-87.50%) ⬇️
.../java/org/apache/pinot/spi/data/TimeFieldSpec.java 90.69% <ø> (+1.80%) ⬆️
...inot/spi/ingestion/batch/IngestionJobLauncher.java 8.51% <27.27%> (+8.51%) ⬆️
...manager/realtime/LLRealtimeSegmentDataManager.java 47.92% <38.46%> (-21.29%) ⬇️
...n/DistinctCountThetaSketchAggregationFunction.java 62.85% <62.85%> (ø)
...org/apache/pinot/core/common/ObjectSerDeUtils.java 88.29% <80.00%> (+0.02%) ⬆️
... and 354 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d54afba...1587d88. Read the comment docs.

@bkuang88
Copy link
Contributor

Curious in general - should we support modifying some of these theta sketch parameters per union/intersection/diff or possibly a table-wide parameter to tune.

I might want to change the value of nominal entries per OR/AND in the same query right?

@kishoreg
Copy link
Member

can we simplify the way we define the expression?
for e.g.
select distinctCountThetaSketch(tsCol, "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar')
to
select distinctCountThetaSketch(tsCol, "dim1='foo'", "dim2='bar'", "1 AND 2")

@mayankshriv
Copy link
Contributor Author

Curious in general - should we support modifying some of these theta sketch parameters per union/intersection/diff or possibly a table-wide parameter to tune.

I might want to change the value of nominal entries per OR/AND in the same query right?

I am planning to add a ThetaSketchParams argument to the aggregation function (second argument, which can be empty), that will be of form "param1=val1;param2=val2...". This would be a backward compatible way to add more params in future.

@mayankshriv
Copy link
Contributor Author

can we simplify the way we define the expression?
for e.g.
select distinctCountThetaSketch(tsCol, "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar')
to
select distinctCountThetaSketch(tsCol, "dim1='foo'", "dim2='bar'", "1 AND 2")

Yes agree, this is a good idea, specially in case when predicates are long strings. Will address that in following PRs.

@mayankshriv mayankshriv force-pushed the theta-sketch branch 5 times, most recently from 212b2a8 to 4f87384 Compare May 7, 2020 16:21
@mcvsubbu
Copy link
Contributor

mcvsubbu commented May 7, 2020

can we simplify the way we define the expression?
for e.g.
select distinctCountThetaSketch(tsCol, "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar')
to
select distinctCountThetaSketch(tsCol, "dim1='foo'", "dim2='bar'", "1 AND 2")

Yes agree, this is a good idea, specially in case when predicates are long strings. Will address that in following PRs.

Is it possible to do select distinctCountThetaSketch(tsCol, ("dim1=foo" AND "dim2=bar")) and parse it to get what we want? Why is that a problem?

@mayankshriv
Copy link
Contributor Author

can we simplify the way we define the expression?
for e.g.
select distinctCountThetaSketch(tsCol, "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar')
to
select distinctCountThetaSketch(tsCol, "dim1='foo'", "dim2='bar'", "1 AND 2")

Yes agree, this is a good idea, specially in case when predicates are long strings. Will address that in following PRs.

Is it possible to do select distinctCountThetaSketch(tsCol, ("dim1=foo" AND "dim2=bar")) and parse it to get what we want? Why is that a problem?

One issue with auto deriving is that we can only derive the lowest level predicates, and not a combination if that was already applied. However, I am unsure we can get to that, so I am already trying to see if I can get rid of p1/p2/p3... in this PR (will update).

@mayankshriv
Copy link
Contributor Author

can we simplify the way we define the expression?
for e.g.
select distinctCountThetaSketch(tsCol, "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar')
to
select distinctCountThetaSketch(tsCol, "dim1='foo'", "dim2='bar'", "1 AND 2")

Yes agree, this is a good idea, specially in case when predicates are long strings. Will address that in following PRs.

Is it possible to do select distinctCountThetaSketch(tsCol, ("dim1=foo" AND "dim2=bar")) and parse it to get what we want? Why is that a problem?

One issue with auto deriving is that we can only derive the lowest level predicates, and not a combination if that was already applied. However, I am unsure we can get to that, so I am already trying to see if I can get rid of p1/p2/p3... in this PR (will update).

An advantage of explicitly specifying predicates is that complex predicates (with and/or) can be specified, and can be applied at once to the filtered docs, which improves performance. For example, p1 can be of form (col1 = 1 and col2 = 'x'), as opposed to two separate predicates col1 = 1 and col2 = 'x'. In future if we support accepting the first form, it would probably be faster to apply that at once, as opposed to applying individual once, and then performing intersection of them using theta-sketches.

BTW, I have updated the PR to make p1, p2, p3... optional, i.e. derive from postAggregationExpression.

@mcvsubbu
Copy link
Contributor

mcvsubbu commented May 8, 2020

Is it possible to process the (thetasketch) aggregation if a filter tree is provided? In that case, the query can simply be something like
select distinctThetasketch(column) where ...
just exploring possibilities for not having the user specify the arguments multiple times.

@mayankshriv
Copy link
Contributor Author

Is it possible to process the (thetasketch) aggregation if a filter tree is provided? In that case, the query can simply be something like
select distinctThetasketch(column) where ...
just exploring possibilities for not having the user specify the arguments multiple times.

The filter tree in where clause can be completely independent of postAggregationExpression, so this is not really possible.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. You can merge first, and I can rebase on yours for the aggregation cleanup.
For the parameters, I vote for using $1, $2 in the last expression to avoid extra compilation and user mistakenly put wrong predicate. With that you can also easily verify whether the number of optional parameters are correct.

@@ -33,6 +33,8 @@
PERCENTILE("percentile"),
PERCENTILEEST("percentileEst"),
PERCENTILETDIGEST("percentileTDigest"),
DISTINCTCOUNTTHETASKETCH("DistinctCountThetaSketch"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change the naming convention (first letter lower case).
Also I would recommend moving it next to FASTHLL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* Class for holding Parser specific utility functions.
*/
public class ParserUtils {
static Map<FilterKind, FilterOperator> filterOperatorMapping;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Naming convention)

Suggested change
static Map<FilterKind, FilterOperator> filterOperatorMapping;
private static final Map<FilterKind, FilterOperator> FILTER_OPERATOR_MAP;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

// Private constructor to disable instantiation.
private ParserUtils() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit, your choice) I usually put this next to the class definition (line 35) so that it is very clear this is a pure util class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return filterOperatorMapping.get(filterKind);
}

public static String getFilterColumn(Expression expression) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc?
How about expression with multiple operands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with javadoc, and specify that it only supports single LHS column.

* @return True if the enum is of Range type, false otherwise.
*/
public boolean isRange() {
int ordinal = this.ordinal();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using ordinal?

Suggested change
int ordinal = this.ordinal();
return this == GREATER_THAN || this == GREATER_THEN_OR_EQUAL || this == LESS_THEN || this == LESS_THAN_OR_EQUAL || this == BETWEEN;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* @param dataType Data type for RHS of the predicate
* @return Predicate Evaluator
*/
public PredicateEvaluator getPredicateEvaluator(FieldSpec.DataType dataType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend having all types instead of grouping INT and LONG, FLOAT and DOUBLE to prevent unnecessary type conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theta-Sketch api's don't support int/float.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is for predicate evaluation right? Theta-sketch should always be BYTES type

if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (o == null || getClass() != o.getClass()) {
if (o instanceof PredicateEvaluator) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-generated code, but will change.

}

@Override
public boolean equals(Object o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For equals() and hashCode(), if _predicateEvaluator is lazily set, you should not include it into the comparison or hash code calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the lazily setting was a change made later on. Good catch.

throws SqlParseException {

// Predicate Strings are optional. When not specified, they are derived from postAggregationExpression
boolean predicatesSpecified = arguments.size() > 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache arguments.size()

* <ul>
* <li> Required: First expression is interpreted as theta sketch column to aggregate on. </li>
* <li> Required: Second argument is the thetaSketchParams. </li>
* <li> Optional: Second to penultimate are predicates with LHS and RHS. </li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third?
Also, is this optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I had updated the PR to make individual predicates optional.

@Jackie-Jiang
Copy link
Contributor

For the last parameter, should we use AND and OR or use the set operation like UNION, INTERSECTION, DIFFERENCE?

@mayankshriv
Copy link
Contributor Author

For the last parameter, should we use AND and OR or use the set operation like UNION, INTERSECTION, DIFFERENCE?

Yeah, I had thought about that, decided to use AND/OR in this PR purely because it can directly be compiled into an expression.

@mayankshriv mayankshriv force-pushed the theta-sketch branch 2 times, most recently from e85b31c to ec51ad7 Compare May 12, 2020 00:17
1. Added an initial implementation for theta-sketch based distinct count
   aggregation function, which can be invoked as follows:
   `select distinctCountThetaSketch(thetaSketchColumn, thetaSketchParams, p1, p2..pn, postAggrExpression)`
    - Required: thetaSketchColumn is the column of type BYTES that stores serialized theta sketches.
    - Required: thetaSketchParams in the form "param1=v1;param2=v2..", pass empty literal '' if no params.
    - Optional: p1, p2 etc are filter predicates that make up the postAggregationExpression
    - Required: postAggrExpression is the expression built with AND/OR on predicates p1..pn.
    Example:
    `select distinctCountThetaSketch(tsCol, "nominalEntries=1024", "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar') from table where dim1 = 'foo' or dim2 = 'bar'`

2. The aggregation function works as follows:
   - The postAggrExpression is basically an expression that AND/OR's some predicates,
     these predicates are expected to be specified as part of aggregation function.
   - The aggregation goes over all values in the blockValSet and applies each predicate
     on the values. If the predicate is satisfied, the theta-sketch is aggregated and stored
     as value in a map, where the key is the predicate.
   - Once all theta-sketches corresponding to all predicates are evaluated, across all segments
     and servers, the final result is computed by evaluating the postAggrFunction, by performing
     set operations on predicate theta-sketches (AND = intersection, OR = union).

3. The Theta-Sketch library being used is from org.apache.datasketches.

4. Now that more than one aggregation functions take multiple arguments, generalized handling of
   multiple args, and removed special casing of Distinct.

5. Refactored methods from PinotQuery2BrokerRequestConverter to reusable utility class ParserUtils.

6. Added unit tests for new code.

TODO:
1. Performance tuning to ensure the aggregation function works at par.
2. Support complex predicates p1, p2, p3, current PR only supports predicates of form like `LHS = RHS`, `LHS IN (...)`, etc.
2. Evaluate theta-sketch creation params, and pick default or come up with ways
   to configure (by user).
3. MultiValue aggregation support.
4. Auto derive predicates that make up postAggrExpression, so they don't need to be specified
   in the aggregation function.
5. Auto sharding of Sketches with high cardinality into smaller ones to improve accuracy.
6. Pql2Compiler.compileToExpressionTree does not seem to handle parenthesized expressions,
   e.g., ((col1 = 1 and col2 = 2) or col3 = 3).
@mayankshriv mayankshriv merged commit 1b28e5f into apache:master May 12, 2020
@mayankshriv mayankshriv deleted the theta-sketch branch May 12, 2020 01:56
mayankshriv added a commit to mayankshriv/pinot that referenced this pull request May 12, 2020
1. Add predicate evaluators for all types instead of LONG/DOUBLE/STRING only.
2. Update DistinctCountThetaSketchTest to test more data types.
3. Add hashCode() and equals() for Predicate class.
mayankshriv added a commit that referenced this pull request May 12, 2020
1. Add predicate evaluators for all types instead of LONG/DOUBLE/STRING only.
2. Update DistinctCountThetaSketchTest to test more data types.
3. Add hashCode() and equals() for Predicate class.
@Jackie-Jiang Jackie-Jiang changed the title Initial implementation for support Theta Sketches (WIP). Initial implementation for support Theta Sketches Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants