Initial implementation for support Theta Sketches #5316

mayankshriv · 2020-04-29T05:40:00Z

Added an initial implementation for theta-sketch based distinct count
aggregation function, which can be invoked as follows:
select distinctCountThetaSketch(thetaSketchColumn, thetaSketchParams, p1, p2..pn, postAggrExpression)
- Required: thetaSketchColumn is the column of type BYTES that stores serialized theta sketches.
- Required: thetaSketchParams in the form "param1=v1;param2=v2..", pass empty literal '' if no params.
- Optional: p1, p2 etc are filter predicates that make up the postAggregationExpression
- Required: postAggrExpression is the expression built with AND/OR on predicates p1..pn.
Example:
select distinctCountThetaSketch(tsCol, "nominalEntries=1024", "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar') from table where dim1 = 'foo' or dim2 = 'bar'
The aggregation function works as follows:
- The postAggrExpression is basically an expression that AND/OR's some predicates,
  these predicates are expected to be specified as part of aggregation function.
- The aggregation goes over all values in the blockValSet and applies each predicate
  on the values. If the predicate is satisfied, the theta-sketch is aggregated and stored
  as value in a map, where the key is the predicate.
- Once all theta-sketches corresponding to all predicates are evaluated, across all segments
  and servers, the final result is computed by evaluating the postAggrFunction, by performing
  set operations on predicate theta-sketches (AND = intersection, OR = union).
The Theta-Sketch library being used is from org.apache.datasketches.
Now that more than one aggregation functions take multiple arguments, generalized handling of
multiple args, and removed special casing of Distinct.
Refactored methods from PinotQuery2BrokerRequestConverter to reusable utility class ParserUtils.
Added unit tests for new code.

TODO:

Performance tuning to ensure the aggregation function works at par.
Support complex predicates p1, p2, p3, current PR only supports predicates of form like LHS = RHS, LHS IN (...), etc.
Evaluate theta-sketch creation params, and pick default or come up with ways
to configure (by user).
MultiValue aggregation support.
Auto derive predicates that make up postAggrExpression, so they don't need to be specified
in the aggregation function.
Auto sharding of Sketches with high cardinality into smaller ones to improve accuracy.
Pql2Compiler.compileToExpressionTree does not seem to handle parenthesized expressions,
e.g., ((col1 = 1 and col2 = 2) or col3 = 3).

codecov-io · 2020-04-29T06:55:51Z

Codecov Report

Merging #5316 into master will decrease coverage by 9.29%.
The diff coverage is 67.28%.

@@            Coverage Diff             @@
##           master    #5316      +/-   ##
==========================================
- Coverage   66.44%   57.15%   -9.30%     
==========================================
  Files        1075     1080       +5     
  Lines       54773    55056     +283     
  Branches     8168     8229      +61     
==========================================
- Hits        36396    31465    -4931     
- Misses      15700    21125    +5425     
+ Partials     2677     2466     -211

Impacted Files	Coverage Δ
...data/manager/realtime/DefaultSegmentCommitter.java	`0.00% <ø> (-80.00%)`	⬇️
...e/data/manager/realtime/SplitSegmentCommitter.java	`0.00% <ø> (-63.64%)`	⬇️
...ation/function/AggregationFunctionVisitorBase.java	`0.00% <0.00%> (ø)`
...startree/executor/StarTreeAggregationExecutor.java	`0.00% <0.00%> (-100.00%)`	⬇️
...ore/startree/executor/StarTreeGroupByExecutor.java	`0.00% <0.00%> (-87.50%)`	⬇️
.../java/org/apache/pinot/spi/data/TimeFieldSpec.java	`90.69% <ø> (+1.80%)`	⬆️
...inot/spi/ingestion/batch/IngestionJobLauncher.java	`8.51% <27.27%> (+8.51%)`	⬆️
...manager/realtime/LLRealtimeSegmentDataManager.java	`47.92% <38.46%> (-21.29%)`	⬇️
...n/DistinctCountThetaSketchAggregationFunction.java	`62.85% <62.85%> (ø)`
...org/apache/pinot/core/common/ObjectSerDeUtils.java	`88.29% <80.00%> (+0.02%)`	⬆️
... and 354 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d54afba...1587d88. Read the comment docs.

bkuang88 · 2020-04-30T19:16:11Z

Curious in general - should we support modifying some of these theta sketch parameters per union/intersection/diff or possibly a table-wide parameter to tune.

I might want to change the value of nominal entries per OR/AND in the same query right?

kishoreg · 2020-04-30T20:06:27Z

can we simplify the way we define the expression?
for e.g.
select distinctCountThetaSketch(tsCol, "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar')
to
select distinctCountThetaSketch(tsCol, "dim1='foo'", "dim2='bar'", "1 AND 2")

mayankshriv · 2020-04-30T22:38:30Z

Curious in general - should we support modifying some of these theta sketch parameters per union/intersection/diff or possibly a table-wide parameter to tune.

I might want to change the value of nominal entries per OR/AND in the same query right?

I am planning to add a ThetaSketchParams argument to the aggregation function (second argument, which can be empty), that will be of form "param1=val1;param2=val2...". This would be a backward compatible way to add more params in future.

mayankshriv · 2020-04-30T22:39:07Z

can we simplify the way we define the expression?
for e.g.
select distinctCountThetaSketch(tsCol, "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar')
to
select distinctCountThetaSketch(tsCol, "dim1='foo'", "dim2='bar'", "1 AND 2")

Yes agree, this is a good idea, specially in case when predicates are long strings. Will address that in following PRs.

mcvsubbu · 2020-05-07T17:45:10Z

can we simplify the way we define the expression?
for e.g.
select distinctCountThetaSketch(tsCol, "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar')
to
select distinctCountThetaSketch(tsCol, "dim1='foo'", "dim2='bar'", "1 AND 2")

Yes agree, this is a good idea, specially in case when predicates are long strings. Will address that in following PRs.

Is it possible to do select distinctCountThetaSketch(tsCol, ("dim1=foo" AND "dim2=bar")) and parse it to get what we want? Why is that a problem?

mayankshriv · 2020-05-07T18:03:15Z

can we simplify the way we define the expression?
for e.g.
select distinctCountThetaSketch(tsCol, "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar')
to
select distinctCountThetaSketch(tsCol, "dim1='foo'", "dim2='bar'", "1 AND 2")

Yes agree, this is a good idea, specially in case when predicates are long strings. Will address that in following PRs.

Is it possible to do select distinctCountThetaSketch(tsCol, ("dim1=foo" AND "dim2=bar")) and parse it to get what we want? Why is that a problem?

One issue with auto deriving is that we can only derive the lowest level predicates, and not a combination if that was already applied. However, I am unsure we can get to that, so I am already trying to see if I can get rid of p1/p2/p3... in this PR (will update).

mayankshriv · 2020-05-08T15:23:00Z

can we simplify the way we define the expression?
for e.g.
select distinctCountThetaSketch(tsCol, "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar')
to
select distinctCountThetaSketch(tsCol, "dim1='foo'", "dim2='bar'", "1 AND 2")

Yes agree, this is a good idea, specially in case when predicates are long strings. Will address that in following PRs.

Is it possible to do select distinctCountThetaSketch(tsCol, ("dim1=foo" AND "dim2=bar")) and parse it to get what we want? Why is that a problem?

One issue with auto deriving is that we can only derive the lowest level predicates, and not a combination if that was already applied. However, I am unsure we can get to that, so I am already trying to see if I can get rid of p1/p2/p3... in this PR (will update).

An advantage of explicitly specifying predicates is that complex predicates (with and/or) can be specified, and can be applied at once to the filtered docs, which improves performance. For example, p1 can be of form (col1 = 1 and col2 = 'x'), as opposed to two separate predicates col1 = 1 and col2 = 'x'. In future if we support accepting the first form, it would probably be faster to apply that at once, as opposed to applying individual once, and then performing intersection of them using theta-sketches.

BTW, I have updated the PR to make p1, p2, p3... optional, i.e. derive from postAggregationExpression.

mcvsubbu · 2020-05-08T17:43:43Z

Is it possible to process the (thetasketch) aggregation if a filter tree is provided? In that case, the query can simply be something like
select distinctThetasketch(column) where ...
just exploring possibilities for not having the user specify the arguments multiple times.

mayankshriv · 2020-05-08T22:10:16Z

Is it possible to process the (thetasketch) aggregation if a filter tree is provided? In that case, the query can simply be something like
select distinctThetasketch(column) where ...
just exploring possibilities for not having the user specify the arguments multiple times.

The filter tree in where clause can be completely independent of postAggregationExpression, so this is not really possible.

Jackie-Jiang

LGTM otherwise. You can merge first, and I can rebase on yours for the aggregation cleanup.
For the parameters, I vote for using $1, $2 in the last expression to avoid extra compilation and user mistakenly put wrong predicate. With that you can also easily verify whether the number of optional parameters are correct.

Jackie-Jiang · 2020-05-11T21:20:55Z

pinot-common/src/main/java/org/apache/pinot/common/function/AggregationFunctionType.java

@@ -33,6 +33,8 @@
  PERCENTILE("percentile"),
  PERCENTILEEST("percentileEst"),
  PERCENTILETDIGEST("percentileTDigest"),
+  DISTINCTCOUNTTHETASKETCH("DistinctCountThetaSketch"),


Let's not change the naming convention (first letter lower case).
Also I would recommend moving it next to FASTHLL

Jackie-Jiang · 2020-05-11T21:22:33Z

pinot-common/src/main/java/org/apache/pinot/parsers/utils/ParserUtils.java

+ * Class for holding Parser specific utility functions.
+ */
+public class ParserUtils {
+  static Map<FilterKind, FilterOperator> filterOperatorMapping;


(Naming convention)

Suggested change

static Map<FilterKind, FilterOperator> filterOperatorMapping;

private static final Map<FilterKind, FilterOperator> FILTER_OPERATOR_MAP;

Jackie-Jiang · 2020-05-11T21:24:10Z

pinot-common/src/main/java/org/apache/pinot/parsers/utils/ParserUtils.java

+  }
+
+  // Private constructor to disable instantiation.
+  private ParserUtils() {


(nit, your choice) I usually put this next to the class definition (line 35) so that it is very clear this is a pure util class.

Jackie-Jiang · 2020-05-11T21:24:47Z

pinot-common/src/main/java/org/apache/pinot/parsers/utils/ParserUtils.java

+    return filterOperatorMapping.get(filterKind);
+  }
+
+  public static String getFilterColumn(Expression expression) {


javadoc?
How about expression with multiple operands?

Updated with javadoc, and specify that it only supports single LHS column.

Jackie-Jiang · 2020-05-11T21:34:07Z

pinot-common/src/main/java/org/apache/pinot/pql/parsers/pql2/ast/FilterKind.java

+   * @return True if the enum is of Range type, false otherwise.
+   */
+  public boolean isRange() {
+    int ordinal = this.ordinal();


Why using ordinal?

Suggested change

int ordinal = this.ordinal();

return this == GREATER_THAN || this == GREATER_THEN_OR_EQUAL || this == LESS_THEN || this == LESS_THAN_OR_EQUAL || this == BETWEEN;

Jackie-Jiang · 2020-05-11T21:42:31Z