New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial SQL support for non-expression sketch postaggs #8487
Conversation
Marking WIP temporarily, need to update the docs |
Updated docs, removing WIP tag |
docs/querying/sql.md
Outdated
|--------|-----| | ||
|`THETA_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from a theta sketch. `expr` must return a theta sketch.| | ||
|`THETA_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(expr, errorBoundsStdDev)`|Returns the distinct count estimate and error bounds from a theta sketch. `expr` must return a theta sketch.| | ||
|`THETA_SKETCH_UNION([size], expr0, expr1, ...)`|Returns a union of theta sketches, where each input expression must return an theta sketch. The `size` can be optionally specified as the first parameter.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: an theta sketch -> a theta sketch
(same for following 2 lines)
FYI, it's possible that the spellcheck in travis will have some false positives on the function names. You can add suppress false positives for this file by adding entries after https://github.com/apache/incubator-druid/blob/master/website/.spelling#L1307
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed, looks like there were no spellcheck errors
Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jon-wei thank you for the very useful feature! Still reviewing, left some comments.
docs/querying/sql.md
Outdated
|
||
|Function|Notes| | ||
|--------|-----| | ||
|`HLL_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from an HLL sketch. `expr` must return an HLL sketch.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it's possible to use the same function name for this and APPROX_COUNT_DISTINCT_DS_HLL
. It looks like it's possible to call a proper conversion depending on the context in DruidQuery
. I think we should use the same function name if possible even if it needs a big refactoring because it makes the life of Druid users easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, i actually like the new functions added in this PR more than APPROX_COUNT_DISTINCT_DS_HLL
, I think they express a clearer boundary between the aggregated sketch object and the operations that can be performed on said object. The new functions names also have a 1-to-1 correspondence with the underlying native aggs/postaggs which I think is nice.
If not for backwards compatibility, I would actually consider dropping APPROX_COUNT_DISTINCT_DS_HLL
, it was introduced as a way to allow people to use the datasketches implementations before we supported postaggs in SQL (with this patch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After my second look, the new ones do look better to me too. Thanks.
|
||
|Function|Notes| | ||
|--------|-----| | ||
|`THETA_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from a theta sketch. `expr` must return a theta sketch.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, can we use the same function name for this and APPROX_COUNT_DISTINCT_DS_THETA
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning here is the same as in this comment: https://github.com/apache/incubator-druid/pull/8487/files#r335716107
|Function|Notes| | ||
|--------|-----| | ||
|`HLL_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from an HLL sketch. `expr` must return an HLL sketch.| | ||
|`HLL_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(expr, [numStdDev])`|Returns the distinct count estimate and error bounds from an HLL sketch. `expr` must return an HLL sketch. An optional `numStdDev` argument can be provided.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, estimate
sounds unclear what it estimates to me. How about HLL_SKETCH_COUNT_DISTINCT
(or APPROX_COUNT_DISTINCT_DS_HLL
if it's same)? Also, does it make sense to use the same name for this and the above function? It sounds like HLL_SKETCH_ESTIMATE(expr)
should use a default error bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within the context of this sketch and its documentation (https://druid.apache.org/docs/latest/development/extensions-core/datasketches-hll.html), I think the meaning of estimate
is clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks.
|Function|Notes| | ||
|--------|-----| | ||
|`THETA_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from a theta sketch. `expr` must return a theta sketch.| | ||
|`THETA_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(expr, errorBoundsStdDev)`|Returns the distinct count estimate and error bounds from a theta sketch. `expr` must return a theta sketch.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for the naming here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given the context and other docs (https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html), estimate
is clear here
public byte[] getCacheKey() | ||
{ | ||
return new CacheKeyBuilder(PostAggregatorIds.HLL_SKETCH_TO_ESTIMATE_CACHE_TYPE_ID) | ||
.appendString(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think name
shouldn't be in the cache key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks
factory -> Calcites.createSqlType(factory, SqlTypeName.OTHER) | ||
), | ||
null, | ||
OperandTypes.VARIADIC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. variadic(SqlOperandCountRanges.from(2))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to suggested
if (convertedPostAgg == null) { | ||
if (operandCounter == 0) { | ||
try { | ||
lgK = RexLiteral.intValue(operand); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the kind of operand
is LITERAL
here before we call intValue()
? Otherwise, this line can throw an AssertionError
which is not caught by the below catch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added sqlkind checks here and in similar places
try { | ||
lgK = RexLiteral.intValue(operand); | ||
} | ||
catch (RuntimeException re) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be better to narrow the exception here, so that we won't catch some exceptions unintentionally. Looks like ClassCastException
is the only possible RuntimeException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Narrowed this to ClassCastException
here and elsewhere
if (convertedPostAgg == null) { | ||
if (operandCounter == 0) { | ||
try { | ||
size = RexLiteral.intValue(operand); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here for literal check and exception handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
public String getSetOperationName() | ||
{ | ||
throw new UnsupportedOperationException("getSetOperationName() is not implemented."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest public abstract String getSetOperationName()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to abstract method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Left a couple of trivial comments.
SqlFunctionCategory.USER_DEFINED_FUNCTION | ||
); | ||
|
||
public HllSketchSetUnionOperatorConversion() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
return oldVal; | ||
} | ||
|
||
public int getCounter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove unused method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
{ | ||
int oldVal = counter; | ||
counter++; | ||
return oldVal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can be simplified to return counter++;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to suggested
@jihoonson Addressed latest comments, also added a "round" parameter to the HLL sketch estimate postagg, and fixed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
This PR adds support for using PostAggregators on Datasketches aggregators in SQL, by adding new SQL operators that accept quantile, HLL, and theta sketches.
To support this, the SQL system has been updated to allow operators to be converted into non-expression-based PostAggregators. This is accomplished by adding a new
PostAggregatorVisitor
class and adding new variants of the existing RexNode conversion methods that accept a post agg visitor.This PR does not add support for all current sketch postaggregators (such as
DoublesSketchToCDFPostAggregator
) and the tuple sketch is currently not supported, I plan to handle this in a follow on PR.This PR has: