New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation for support Theta Sketches #5316
Conversation
386a855
to
7f74fbd
Compare
Codecov Report
@@ Coverage Diff @@
## master #5316 +/- ##
==========================================
- Coverage 66.44% 57.15% -9.30%
==========================================
Files 1075 1080 +5
Lines 54773 55056 +283
Branches 8168 8229 +61
==========================================
- Hits 36396 31465 -4931
- Misses 15700 21125 +5425
+ Partials 2677 2466 -211
Continue to review full report at Codecov.
|
7f74fbd
to
4d26bfb
Compare
Curious in general - should we support modifying some of these theta sketch parameters per union/intersection/diff or possibly a table-wide parameter to tune. I might want to change the value of nominal entries per OR/AND in the same query right? |
can we simplify the way we define the expression? |
I am planning to add a |
Yes agree, this is a good idea, specially in case when predicates are long strings. Will address that in following PRs. |
212b2a8
to
4f87384
Compare
Is it possible to do |
One issue with auto deriving is that we can only derive the lowest level predicates, and not a combination if that was already applied. However, I am unsure we can get to that, so I am already trying to see if I can get rid of p1/p2/p3... in this PR (will update). |
An advantage of explicitly specifying predicates is that complex predicates (with and/or) can be specified, and can be applied at once to the filtered docs, which improves performance. For example, p1 can be of form BTW, I have updated the PR to make |
Is it possible to process the (thetasketch) aggregation if a filter tree is provided? In that case, the query can simply be something like |
The filter tree in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise. You can merge first, and I can rebase on yours for the aggregation cleanup.
For the parameters, I vote for using $1
, $2
in the last expression to avoid extra compilation and user mistakenly put wrong predicate. With that you can also easily verify whether the number of optional parameters are correct.
@@ -33,6 +33,8 @@ | |||
PERCENTILE("percentile"), | |||
PERCENTILEEST("percentileEst"), | |||
PERCENTILETDIGEST("percentileTDigest"), | |||
DISTINCTCOUNTTHETASKETCH("DistinctCountThetaSketch"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not change the naming convention (first letter lower case).
Also I would recommend moving it next to FASTHLL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* Class for holding Parser specific utility functions. | ||
*/ | ||
public class ParserUtils { | ||
static Map<FilterKind, FilterOperator> filterOperatorMapping; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Naming convention)
static Map<FilterKind, FilterOperator> filterOperatorMapping; | |
private static final Map<FilterKind, FilterOperator> FILTER_OPERATOR_MAP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
// Private constructor to disable instantiation. | ||
private ParserUtils() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit, your choice) I usually put this next to the class definition (line 35) so that it is very clear this is a pure util class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return filterOperatorMapping.get(filterKind); | ||
} | ||
|
||
public static String getFilterColumn(Expression expression) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc?
How about expression with multiple operands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with javadoc, and specify that it only supports single LHS column.
* @return True if the enum is of Range type, false otherwise. | ||
*/ | ||
public boolean isRange() { | ||
int ordinal = this.ordinal(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using ordinal?
int ordinal = this.ordinal(); | |
return this == GREATER_THAN || this == GREATER_THEN_OR_EQUAL || this == LESS_THEN || this == LESS_THAN_OR_EQUAL || this == BETWEEN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* @param dataType Data type for RHS of the predicate | ||
* @return Predicate Evaluator | ||
*/ | ||
public PredicateEvaluator getPredicateEvaluator(FieldSpec.DataType dataType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend having all types instead of grouping INT and LONG, FLOAT and DOUBLE to prevent unnecessary type conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theta-Sketch api's don't support int/float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is for predicate evaluation right? Theta-sketch should always be BYTES type
if (this == o) { | ||
return true; | ||
} | ||
if (o == null || getClass() != o.getClass()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (o == null || getClass() != o.getClass()) { | |
if (o instanceof PredicateEvaluator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto-generated code, but will change.
} | ||
|
||
@Override | ||
public boolean equals(Object o) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For equals()
and hashCode()
, if _predicateEvaluator
is lazily set, you should not include it into the comparison or hash code calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the lazily setting was a change made later on. Good catch.
throws SqlParseException { | ||
|
||
// Predicate Strings are optional. When not specified, they are derived from postAggregationExpression | ||
boolean predicatesSpecified = arguments.size() > 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache arguments.size()
* <ul> | ||
* <li> Required: First expression is interpreted as theta sketch column to aggregate on. </li> | ||
* <li> Required: Second argument is the thetaSketchParams. </li> | ||
* <li> Optional: Second to penultimate are predicates with LHS and RHS. </li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Third?
Also, is this optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I had updated the PR to make individual predicates optional.
For the last parameter, should we use |
Yeah, I had thought about that, decided to use |
e85b31c
to
ec51ad7
Compare
1. Added an initial implementation for theta-sketch based distinct count aggregation function, which can be invoked as follows: `select distinctCountThetaSketch(thetaSketchColumn, thetaSketchParams, p1, p2..pn, postAggrExpression)` - Required: thetaSketchColumn is the column of type BYTES that stores serialized theta sketches. - Required: thetaSketchParams in the form "param1=v1;param2=v2..", pass empty literal '' if no params. - Optional: p1, p2 etc are filter predicates that make up the postAggregationExpression - Required: postAggrExpression is the expression built with AND/OR on predicates p1..pn. Example: `select distinctCountThetaSketch(tsCol, "nominalEntries=1024", "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar') from table where dim1 = 'foo' or dim2 = 'bar'` 2. The aggregation function works as follows: - The postAggrExpression is basically an expression that AND/OR's some predicates, these predicates are expected to be specified as part of aggregation function. - The aggregation goes over all values in the blockValSet and applies each predicate on the values. If the predicate is satisfied, the theta-sketch is aggregated and stored as value in a map, where the key is the predicate. - Once all theta-sketches corresponding to all predicates are evaluated, across all segments and servers, the final result is computed by evaluating the postAggrFunction, by performing set operations on predicate theta-sketches (AND = intersection, OR = union). 3. The Theta-Sketch library being used is from org.apache.datasketches. 4. Now that more than one aggregation functions take multiple arguments, generalized handling of multiple args, and removed special casing of Distinct. 5. Refactored methods from PinotQuery2BrokerRequestConverter to reusable utility class ParserUtils. 6. Added unit tests for new code. TODO: 1. Performance tuning to ensure the aggregation function works at par. 2. Support complex predicates p1, p2, p3, current PR only supports predicates of form like `LHS = RHS`, `LHS IN (...)`, etc. 2. Evaluate theta-sketch creation params, and pick default or come up with ways to configure (by user). 3. MultiValue aggregation support. 4. Auto derive predicates that make up postAggrExpression, so they don't need to be specified in the aggregation function. 5. Auto sharding of Sketches with high cardinality into smaller ones to improve accuracy. 6. Pql2Compiler.compileToExpressionTree does not seem to handle parenthesized expressions, e.g., ((col1 = 1 and col2 = 2) or col3 = 3).
1. Add predicate evaluators for all types instead of LONG/DOUBLE/STRING only. 2. Update DistinctCountThetaSketchTest to test more data types. 3. Add hashCode() and equals() for Predicate class.
Added an initial implementation for theta-sketch based distinct count
aggregation function, which can be invoked as follows:
select distinctCountThetaSketch(thetaSketchColumn, thetaSketchParams, p1, p2..pn, postAggrExpression)
- Required: thetaSketchColumn is the column of type BYTES that stores serialized theta sketches.
- Required: thetaSketchParams in the form "param1=v1;param2=v2..", pass empty literal '' if no params.
- Optional: p1, p2 etc are filter predicates that make up the postAggregationExpression
- Required: postAggrExpression is the expression built with AND/OR on predicates p1..pn.
Example:
select distinctCountThetaSketch(tsCol, "nominalEntries=1024", "dim1='foo', dim2='bar', "dim1 = 'foo and dim2='bar') from table where dim1 = 'foo' or dim2 = 'bar'
The aggregation function works as follows:
these predicates are expected to be specified as part of aggregation function.
on the values. If the predicate is satisfied, the theta-sketch is aggregated and stored
as value in a map, where the key is the predicate.
and servers, the final result is computed by evaluating the postAggrFunction, by performing
set operations on predicate theta-sketches (AND = intersection, OR = union).
The Theta-Sketch library being used is from org.apache.datasketches.
Now that more than one aggregation functions take multiple arguments, generalized handling of
multiple args, and removed special casing of Distinct.
Refactored methods from PinotQuery2BrokerRequestConverter to reusable utility class ParserUtils.
Added unit tests for new code.
TODO:
LHS = RHS
,LHS IN (...)
, etc.to configure (by user).
in the aggregation function.
e.g., ((col1 = 1 and col2 = 2) or col3 = 3).