Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial SQL support for non-expression sketch postaggs #8487

Merged
merged 13 commits into from Oct 18, 2019

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented Sep 7, 2019

This PR adds support for using PostAggregators on Datasketches aggregators in SQL, by adding new SQL operators that accept quantile, HLL, and theta sketches.

To support this, the SQL system has been updated to allow operators to be converted into non-expression-based PostAggregators. This is accomplished by adding a new PostAggregatorVisitor class and adding new variants of the existing RexNode conversion methods that accept a post agg visitor.

This PR does not add support for all current sketch postaggregators (such as DoublesSketchToCDFPostAggregator) and the tuple sketch is currently not supported, I plan to handle this in a follow on PR.

This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths.

@jon-wei
Copy link
Contributor Author

jon-wei commented Sep 7, 2019

Marking WIP temporarily, need to update the docs

@jon-wei jon-wei removed the WIP label Sep 12, 2019
@jon-wei
Copy link
Contributor Author

jon-wei commented Sep 12, 2019

Updated docs, removing WIP tag

|--------|-----|
|`THETA_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from a theta sketch. `expr` must return a theta sketch.|
|`THETA_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(expr, errorBoundsStdDev)`|Returns the distinct count estimate and error bounds from a theta sketch. `expr` must return a theta sketch.|
|`THETA_SKETCH_UNION([size], expr0, expr1, ...)`|Returns a union of theta sketches, where each input expression must return an theta sketch. The `size` can be optionally specified as the first parameter.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: an theta sketch -> a theta sketch

(same for following 2 lines)

FYI, it's possible that the spellcheck in travis will have some false positives on the function names. You can add suppress false positives for this file by adding entries after https://github.com/apache/incubator-druid/blob/master/website/.spelling#L1307

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed, looks like there were no spellcheck errors

@jihoonson
Copy link
Contributor

Added Design Review since this PR adds new SQL functions.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jon-wei thank you for the very useful feature! Still reviewing, left some comments.


|Function|Notes|
|--------|-----|
|`HLL_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from an HLL sketch. `expr` must return an HLL sketch.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it's possible to use the same function name for this and APPROX_COUNT_DISTINCT_DS_HLL. It looks like it's possible to call a proper conversion depending on the context in DruidQuery. I think we should use the same function name if possible even if it needs a big refactoring because it makes the life of Druid users easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i actually like the new functions added in this PR more than APPROX_COUNT_DISTINCT_DS_HLL, I think they express a clearer boundary between the aggregated sketch object and the operations that can be performed on said object. The new functions names also have a 1-to-1 correspondence with the underlying native aggs/postaggs which I think is nice.

If not for backwards compatibility, I would actually consider dropping APPROX_COUNT_DISTINCT_DS_HLL, it was introduced as a way to allow people to use the datasketches implementations before we supported postaggs in SQL (with this patch).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After my second look, the new ones do look better to me too. Thanks.


|Function|Notes|
|--------|-----|
|`THETA_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from a theta sketch. `expr` must return a theta sketch.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, can we use the same function name for this and APPROX_COUNT_DISTINCT_DS_THETA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning here is the same as in this comment: https://github.com/apache/incubator-druid/pull/8487/files#r335716107

|Function|Notes|
|--------|-----|
|`HLL_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from an HLL sketch. `expr` must return an HLL sketch.|
|`HLL_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(expr, [numStdDev])`|Returns the distinct count estimate and error bounds from an HLL sketch. `expr` must return an HLL sketch. An optional `numStdDev` argument can be provided.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, estimate sounds unclear what it estimates to me. How about HLL_SKETCH_COUNT_DISTINCT (or APPROX_COUNT_DISTINCT_DS_HLL if it's same)? Also, does it make sense to use the same name for this and the above function? It sounds like HLL_SKETCH_ESTIMATE(expr) should use a default error bound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the context of this sketch and its documentation (https://druid.apache.org/docs/latest/development/extensions-core/datasketches-hll.html), I think the meaning of estimate is clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks.

|Function|Notes|
|--------|-----|
|`THETA_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from a theta sketch. `expr` must return a theta sketch.|
|`THETA_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(expr, errorBoundsStdDev)`|Returns the distinct count estimate and error bounds from a theta sketch. `expr` must return a theta sketch.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for the naming here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given the context and other docs (https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html), estimate is clear here

public byte[] getCacheKey()
{
return new CacheKeyBuilder(PostAggregatorIds.HLL_SKETCH_TO_ESTIMATE_CACHE_TYPE_ID)
.appendString(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think name shouldn't be in the cache key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks

factory -> Calcites.createSqlType(factory, SqlTypeName.OTHER)
),
null,
OperandTypes.VARIADIC,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. variadic(SqlOperandCountRanges.from(2))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to suggested

if (convertedPostAgg == null) {
if (operandCounter == 0) {
try {
lgK = RexLiteral.intValue(operand);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the kind of operand is LITERAL here before we call intValue()? Otherwise, this line can throw an AssertionError which is not caught by the below catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sqlkind checks here and in similar places

try {
lgK = RexLiteral.intValue(operand);
}
catch (RuntimeException re) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be better to narrow the exception here, so that we won't catch some exceptions unintentionally. Looks like ClassCastException is the only possible RuntimeException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Narrowed this to ClassCastException here and elsewhere

if (convertedPostAgg == null) {
if (operandCounter == 0) {
try {
size = RexLiteral.intValue(operand);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for literal check and exception handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


public String getSetOperationName()
{
throw new UnsupportedOperationException("getSetOperationName() is not implemented.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest public abstract String getSetOperationName().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to abstract method

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Left a couple of trivial comments.

SqlFunctionCategory.USER_DEFINED_FUNCTION
);

public HllSketchSetUnionOperatorConversion()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

return oldVal;
}

public int getCounter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unused method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

{
int oldVal = counter;
counter++;
return oldVal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be simplified to return counter++;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to suggested

@jon-wei
Copy link
Contributor Author

jon-wei commented Oct 18, 2019

@jihoonson Addressed latest comments, also added a "round" parameter to the HLL sketch estimate postagg, and fixed the numStdDev param for HLL_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS to actually be optional (it's optional in the docs and the underlying postagg).

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@jon-wei jon-wei merged commit d880752 into apache:master Oct 18, 2019
@gianm gianm added this to the 0.17.0 milestone Oct 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants