New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial SQL support for non-expression sketch postaggs #8487
Changes from all commits
6ea389c
bd0cfbc
b4e6a33
192bffc
e0b45e7
2ebaef7
d377d77
16a04f6
bb12e8b
bfaeda5
be0bdfe
a0596b0
65c530b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -180,9 +180,12 @@ Only the COUNT aggregation can accept DISTINCT. | |
|`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a regular column or a hyperUnique column. This is always approximate, regardless of the value of "useApproximateCountDistinct". This uses Druid's built-in "cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT expr)`.| | ||
|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct values of expr, which can be a regular column or an [HLL sketch](../development/extensions-core/datasketches-hll.html) column. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| | ||
|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of expr, which can be a regular column or a [Theta sketch](../development/extensions-core/datasketches-theta.html) column. The `size` parameter is described in the Theta sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| | ||
|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.html) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| | ||
|`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.html) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| | ||
|`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.html#approximate-histogram-aggregator) exprs. The "probability" should be between 0 and 1 (exclusive). The "resolution" is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. The [approximate histogram extension](../development/extensions-core/approximate-histograms.html) must be loaded to use this function.| | ||
|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.html) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| | ||
|`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric or [fixed buckets histogram](../development/extensions-core/approximate-histograms.html#fixed-buckets-histogram) exprs. The "probability" should be between 0 and 1 (exclusive). The `numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters are described in the fixed buckets histogram documentation. The [approximate histogram extension](../development/extensions-core/approximate-histograms.html) must be loaded to use this function.| | ||
|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles sketch](../development/extensions-core/datasketches-quantiles.html) on the values of expr, which can be a regular column or a column containing quantiles sketches. The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| | ||
|`BLOOM_FILTER(expr, numEntries)`|Computes a bloom filter from values produced by `expr`, with `numEntries` maximum number of distinct values before false positive rate increases. See [bloom filter extension](../development/extensions-core/bloom-filter.html) documentation for additional details.| | ||
|`TDIGEST_QUANTILE(expr, quantileFraction, [compression])`|Builds a T-Digest sketch on values produced by `expr` and returns the value for the quantile. Compression parameter (default value 100) determines the accuracy and size of the sketch. Higher compression means higher accuracy but more space to store sketches. See [t-digest extension](../development/extensions-contrib/tdigestsketch-quantiles.html) documentation for additional details.| | ||
|`TDIGEST_GENERATE_SKETCH(expr, [compression])`|Builds a T-Digest sketch on values produced by `expr`. Compression parameter (default value 100) determines the accuracy and size of the sketch Higher compression means higher accuracy but more space to store sketches. See [t-digest extension](../development/extensions-contrib/tdigestsketch-quantiles.html) documentation for additional details.| | ||
|
@@ -363,6 +366,44 @@ All 'array' references in the multi-value string function documentation can refe | |
| `MV_TO_STRING(arr,str)` | joins all elements of arr by the delimiter specified by str | | ||
| `STRING_TO_MV(str1,str2)` | splits str1 into an array on the delimiter specified by str2 | | ||
|
||
### Sketch operators | ||
|
||
These functions operate on expressions or columns that return sketch objects. | ||
|
||
#### HLL sketch operators | ||
|
||
The following functions operate on [DataSketches HLL sketches](../development/extensions-core/datasketches-hll.html). | ||
The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use the following functions. | ||
|
||
|Function|Notes| | ||
|--------|-----| | ||
|`HLL_SKETCH_ESTIMATE(expr, [round])`|Returns the distinct count estimate from an HLL sketch. `expr` must return an HLL sketch. The optional `round` boolean parameter will round the estimate if set to `true`, with a default of `false`.| | ||
|`HLL_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(expr, [numStdDev])`|Returns the distinct count estimate and error bounds from an HLL sketch. `expr` must return an HLL sketch. An optional `numStdDev` argument can be provided.| | ||
|`HLL_SKETCH_UNION([lgK, tgtHllType], expr0, expr1, ...)`|Returns a union of HLL sketches, where each input expression must return an HLL sketch. The `lgK` and `tgtHllType` can be optionally specified as the first parameter; if provided, both optional parameters must be specified.| | ||
|`HLL_SKETCH_TO_STRING(expr)`|Returns a human-readable string representation of an HLL sketch for debugging. `expr` must return an HLL sketch.| | ||
|
||
#### Theta sketch operators | ||
|
||
The following functions operate on [theta sketches](../development/extensions-core/datasketches-theta.html). | ||
The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use the following functions. | ||
|
||
|Function|Notes| | ||
|--------|-----| | ||
|`THETA_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from a theta sketch. `expr` must return a theta sketch.| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similarly, can we use the same function name for this and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My reasoning here is the same as in this comment: https://github.com/apache/incubator-druid/pull/8487/files#r335716107 |
||
|`THETA_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(expr, errorBoundsStdDev)`|Returns the distinct count estimate and error bounds from a theta sketch. `expr` must return a theta sketch.| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same question for the naming here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think given the context and other docs (https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html), |
||
|`THETA_SKETCH_UNION([size], expr0, expr1, ...)`|Returns a union of theta sketches, where each input expression must return a theta sketch. The `size` can be optionally specified as the first parameter.| | ||
|`THETA_SKETCH_INTERSECT([size], expr0, expr1, ...)`|Returns an intersection of theta sketches, where each input expression must return a theta sketch. The `size` can be optionally specified as the first parameter.| | ||
|`THETA_SKETCH_NOT([size], expr0, expr1, ...)`|Returns a set difference of theta sketches, where each input expression must return a theta sketch. The `size` can be optionally specified as the first parameter.| | ||
|
||
#### Quantiles sketch operators | ||
|
||
The following functions operate on [quantiles sketches](../development/extensions-core/datasketches-quantiles.html). | ||
The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use the following functions. | ||
|
||
|Function|Notes| | ||
|--------|-----| | ||
|`DS_GET_QUANTILE(expr, fraction)`|Returns the quantile estimate corresponding to `fraction` from a quantiles sketch. `expr` must return a quantiles sketch.| | ||
|
||
### Other functions | ||
|
||
|Function|Notes| | ||
|
@@ -588,8 +629,6 @@ Connection context can be specified as JDBC connection properties or as a "conte | |
|`useApproximateCountDistinct`|Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.|druid.sql.planner.useApproximateCountDistinct on the Broker (default: true)| | ||
|`useApproximateTopN`|Whether to use approximate [TopN queries](topnquery.html) when a SQL query could be expressed as such. If false, exact [GroupBy queries](groupbyquery.html) will be used instead.|druid.sql.planner.useApproximateTopN on the Broker (default: true)| | ||
|
||
|
||
|
||
## Metadata tables | ||
|
||
Druid Brokers infer table and column metadata for each datasource from segments loaded in the cluster, and use this to | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
|
||
package org.apache.druid.query.aggregation.datasketches.hll; | ||
|
||
import com.fasterxml.jackson.annotation.JsonCreator; | ||
import com.fasterxml.jackson.annotation.JsonProperty; | ||
import com.yahoo.sketches.hll.HllSketch; | ||
import org.apache.druid.query.aggregation.AggregatorFactory; | ||
import org.apache.druid.query.aggregation.PostAggregator; | ||
import org.apache.druid.query.aggregation.post.ArithmeticPostAggregator; | ||
import org.apache.druid.query.aggregation.post.PostAggregatorIds; | ||
import org.apache.druid.query.cache.CacheKeyBuilder; | ||
|
||
import java.util.Comparator; | ||
import java.util.Map; | ||
import java.util.Objects; | ||
import java.util.Set; | ||
|
||
/** | ||
* Returns a distinct count estimate a from a given {@link HllSketch}. | ||
* The result will be a double value. | ||
*/ | ||
public class HllSketchToEstimatePostAggregator implements PostAggregator | ||
{ | ||
private final String name; | ||
private final PostAggregator field; | ||
private final boolean round; | ||
|
||
@JsonCreator | ||
public HllSketchToEstimatePostAggregator( | ||
@JsonProperty("name") final String name, | ||
@JsonProperty("field") final PostAggregator field, | ||
@JsonProperty("round") boolean round | ||
) | ||
{ | ||
this.name = name; | ||
this.field = field; | ||
this.round = round; | ||
} | ||
|
||
@Override | ||
@JsonProperty | ||
public String getName() | ||
{ | ||
return name; | ||
} | ||
|
||
@JsonProperty | ||
public PostAggregator getField() | ||
{ | ||
return field; | ||
} | ||
|
||
@JsonProperty | ||
public boolean isRound() | ||
{ | ||
return round; | ||
} | ||
|
||
@Override | ||
public Set<String> getDependentFields() | ||
{ | ||
return field.getDependentFields(); | ||
} | ||
|
||
@Override | ||
public Comparator<Double> getComparator() | ||
{ | ||
return ArithmeticPostAggregator.DEFAULT_COMPARATOR; | ||
} | ||
|
||
@Override | ||
public Object compute(final Map<String, Object> combinedAggregators) | ||
{ | ||
final HllSketch sketch = (HllSketch) field.compute(combinedAggregators); | ||
return round ? Math.round(sketch.getEstimate()) : sketch.getEstimate(); | ||
} | ||
|
||
@Override | ||
public PostAggregator decorate(final Map<String, AggregatorFactory> aggregators) | ||
{ | ||
return this; | ||
} | ||
|
||
@Override | ||
public String toString() | ||
{ | ||
return getClass().getSimpleName() + "{" + | ||
"name='" + name + '\'' + | ||
", field=" + field + | ||
"}"; | ||
} | ||
|
||
@Override | ||
public boolean equals(final Object o) | ||
{ | ||
if (this == o) { | ||
return true; | ||
} | ||
if (!(o instanceof HllSketchToEstimatePostAggregator)) { | ||
return false; | ||
} | ||
|
||
final HllSketchToEstimatePostAggregator that = (HllSketchToEstimatePostAggregator) o; | ||
|
||
if (!name.equals(that.name)) { | ||
return false; | ||
} | ||
return field.equals(that.field); | ||
} | ||
|
||
@Override | ||
public int hashCode() | ||
{ | ||
return Objects.hash(name, field); | ||
} | ||
|
||
@Override | ||
public byte[] getCacheKey() | ||
{ | ||
return new CacheKeyBuilder(PostAggregatorIds.HLL_SKETCH_TO_ESTIMATE_CACHE_TYPE_ID) | ||
.appendCacheable(field) | ||
.build(); | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm,
estimate
sounds unclear what it estimates to me. How aboutHLL_SKETCH_COUNT_DISTINCT
(orAPPROX_COUNT_DISTINCT_DS_HLL
if it's same)? Also, does it make sense to use the same name for this and the above function? It sounds likeHLL_SKETCH_ESTIMATE(expr)
should use a default error bound.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within the context of this sketch and its documentation (https://druid.apache.org/docs/latest/development/extensions-core/datasketches-hll.html), I think the meaning of
estimate
is clearThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks.