Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial SQL support for non-expression sketch postaggs #8487

Merged
merged 13 commits into from Oct 18, 2019
21 changes: 19 additions & 2 deletions docs/development/extensions-core/datasketches-hll.md
Expand Up @@ -67,8 +67,26 @@ druid.extensions.loadList=["druid-datasketches"]

### Post Aggregators

#### Estimate

Returns the distinct count estimate as a double.

```
{
"type" : "HLLSketchEstimate",
"name": <output name>,
"field" : <post aggregator that returns an HLL Sketch>,
"round" : <if true, round the estimate. Default is false>
}
```

#### Estimate with bounds

Returns a distinct count estimate and error bounds from an HLL sketch.
The result will be an array containing three double values: estimate, lower bound and upper bound.
The bounds are provided at a given number of standard deviations (optional, defaults to 1).
This must be an integer value of 1, 2 or 3 corresponding to approximately 68.3%, 95.4% and 99.7% confidence intervals.

```
{
"type" : "HLLSketchEstimateWithBounds",
Expand All @@ -92,13 +110,12 @@ druid.extensions.loadList=["druid-datasketches"]

#### Sketch to string

Human-readable sketch summary for debugging
Human-readable sketch summary for debugging.

```
{
"type" : "HLLSketchToString",
"name": <output name>,
"field" : <post aggregator that returns an HLL Sketch>
}

```
43 changes: 41 additions & 2 deletions docs/querying/sql.md
Expand Up @@ -180,9 +180,12 @@ Only the COUNT aggregation can accept DISTINCT.
|`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a regular column or a hyperUnique column. This is always approximate, regardless of the value of "useApproximateCountDistinct". This uses Druid's built-in "cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT expr)`.|
|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct values of expr, which can be a regular column or an [HLL sketch](../development/extensions-core/datasketches-hll.html) column. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.|
|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of expr, which can be a regular column or a [Theta sketch](../development/extensions-core/datasketches-theta.html) column. The `size` parameter is described in the Theta sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.|
|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.html) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.|
|`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.html) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.|
|`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.html#approximate-histogram-aggregator) exprs. The "probability" should be between 0 and 1 (exclusive). The "resolution" is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. The [approximate histogram extension](../development/extensions-core/approximate-histograms.html) must be loaded to use this function.|
|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.html) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.|
|`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric or [fixed buckets histogram](../development/extensions-core/approximate-histograms.html#fixed-buckets-histogram) exprs. The "probability" should be between 0 and 1 (exclusive). The `numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters are described in the fixed buckets histogram documentation. The [approximate histogram extension](../development/extensions-core/approximate-histograms.html) must be loaded to use this function.|
|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles sketch](../development/extensions-core/datasketches-quantiles.html) on the values of expr, which can be a regular column or a column containing quantiles sketches. The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.|
|`BLOOM_FILTER(expr, numEntries)`|Computes a bloom filter from values produced by `expr`, with `numEntries` maximum number of distinct values before false positive rate increases. See [bloom filter extension](../development/extensions-core/bloom-filter.html) documentation for additional details.|
|`TDIGEST_QUANTILE(expr, quantileFraction, [compression])`|Builds a T-Digest sketch on values produced by `expr` and returns the value for the quantile. Compression parameter (default value 100) determines the accuracy and size of the sketch. Higher compression means higher accuracy but more space to store sketches. See [t-digest extension](../development/extensions-contrib/tdigestsketch-quantiles.html) documentation for additional details.|
|`TDIGEST_GENERATE_SKETCH(expr, [compression])`|Builds a T-Digest sketch on values produced by `expr`. Compression parameter (default value 100) determines the accuracy and size of the sketch Higher compression means higher accuracy but more space to store sketches. See [t-digest extension](../development/extensions-contrib/tdigestsketch-quantiles.html) documentation for additional details.|
Expand Down Expand Up @@ -363,6 +366,44 @@ All 'array' references in the multi-value string function documentation can refe
| `MV_TO_STRING(arr,str)` | joins all elements of arr by the delimiter specified by str |
| `STRING_TO_MV(str1,str2)` | splits str1 into an array on the delimiter specified by str2 |

### Sketch operators

These functions operate on expressions or columns that return sketch objects.

#### HLL sketch operators

The following functions operate on [DataSketches HLL sketches](../development/extensions-core/datasketches-hll.html).
The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use the following functions.

|Function|Notes|
|--------|-----|
|`HLL_SKETCH_ESTIMATE(expr, [round])`|Returns the distinct count estimate from an HLL sketch. `expr` must return an HLL sketch. The optional `round` boolean parameter will round the estimate if set to `true`, with a default of `false`.|
|`HLL_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(expr, [numStdDev])`|Returns the distinct count estimate and error bounds from an HLL sketch. `expr` must return an HLL sketch. An optional `numStdDev` argument can be provided.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, estimate sounds unclear what it estimates to me. How about HLL_SKETCH_COUNT_DISTINCT (or APPROX_COUNT_DISTINCT_DS_HLL if it's same)? Also, does it make sense to use the same name for this and the above function? It sounds like HLL_SKETCH_ESTIMATE(expr) should use a default error bound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the context of this sketch and its documentation (https://druid.apache.org/docs/latest/development/extensions-core/datasketches-hll.html), I think the meaning of estimate is clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks.

|`HLL_SKETCH_UNION([lgK, tgtHllType], expr0, expr1, ...)`|Returns a union of HLL sketches, where each input expression must return an HLL sketch. The `lgK` and `tgtHllType` can be optionally specified as the first parameter; if provided, both optional parameters must be specified.|
|`HLL_SKETCH_TO_STRING(expr)`|Returns a human-readable string representation of an HLL sketch for debugging. `expr` must return an HLL sketch.|

#### Theta sketch operators

The following functions operate on [theta sketches](../development/extensions-core/datasketches-theta.html).
The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use the following functions.

|Function|Notes|
|--------|-----|
|`THETA_SKETCH_ESTIMATE(expr)`|Returns the distinct count estimate from a theta sketch. `expr` must return a theta sketch.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, can we use the same function name for this and APPROX_COUNT_DISTINCT_DS_THETA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning here is the same as in this comment: https://github.com/apache/incubator-druid/pull/8487/files#r335716107

|`THETA_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(expr, errorBoundsStdDev)`|Returns the distinct count estimate and error bounds from a theta sketch. `expr` must return a theta sketch.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for the naming here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given the context and other docs (https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html), estimate is clear here

|`THETA_SKETCH_UNION([size], expr0, expr1, ...)`|Returns a union of theta sketches, where each input expression must return a theta sketch. The `size` can be optionally specified as the first parameter.|
|`THETA_SKETCH_INTERSECT([size], expr0, expr1, ...)`|Returns an intersection of theta sketches, where each input expression must return a theta sketch. The `size` can be optionally specified as the first parameter.|
|`THETA_SKETCH_NOT([size], expr0, expr1, ...)`|Returns a set difference of theta sketches, where each input expression must return a theta sketch. The `size` can be optionally specified as the first parameter.|

#### Quantiles sketch operators

The following functions operate on [quantiles sketches](../development/extensions-core/datasketches-quantiles.html).
The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use the following functions.

|Function|Notes|
|--------|-----|
|`DS_GET_QUANTILE(expr, fraction)`|Returns the quantile estimate corresponding to `fraction` from a quantiles sketch. `expr` must return a quantiles sketch.|

### Other functions

|Function|Notes|
Expand Down Expand Up @@ -588,8 +629,6 @@ Connection context can be specified as JDBC connection properties or as a "conte
|`useApproximateCountDistinct`|Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.|druid.sql.planner.useApproximateCountDistinct on the Broker (default: true)|
|`useApproximateTopN`|Whether to use approximate [TopN queries](topnquery.html) when a SQL query could be expressed as such. If false, exact [GroupBy queries](groupbyquery.html) will be used instead.|druid.sql.planner.useApproximateTopN on the Broker (default: true)|



## Metadata tables

Druid Brokers infer table and column metadata for each datasource from segments loaded in the cluster, and use this to
Expand Down
Expand Up @@ -40,6 +40,7 @@
*/
public abstract class HllSketchAggregatorFactory extends AggregatorFactory
{
public static final boolean DEFAULT_ROUND = false;
public static final int DEFAULT_LG_K = 12;
public static final TgtHllType DEFAULT_TGT_HLL_TYPE = TgtHllType.HLL_4;

Expand Down
Expand Up @@ -26,7 +26,12 @@
import com.google.inject.Binder;
import com.yahoo.sketches.hll.HllSketch;
import org.apache.druid.initialization.DruidModule;
import org.apache.druid.query.aggregation.datasketches.hll.sql.HllSketchSqlAggregator;
import org.apache.druid.query.aggregation.datasketches.hll.sql.HllSketchApproxCountDistinctSqlAggregator;
import org.apache.druid.query.aggregation.datasketches.hll.sql.HllSketchEstimateOperatorConversion;
import org.apache.druid.query.aggregation.datasketches.hll.sql.HllSketchEstimateWithErrorBoundsOperatorConversion;
import org.apache.druid.query.aggregation.datasketches.hll.sql.HllSketchObjectSqlAggregator;
import org.apache.druid.query.aggregation.datasketches.hll.sql.HllSketchSetUnionOperatorConversion;
import org.apache.druid.query.aggregation.datasketches.hll.sql.HllSketchToStringOperatorConversion;
import org.apache.druid.segment.serde.ComplexMetrics;
import org.apache.druid.sql.guice.SqlBindings;

Expand All @@ -46,12 +51,20 @@ public class HllSketchModule implements DruidModule
public static final String TO_STRING_TYPE_NAME = "HLLSketchToString";
public static final String UNION_TYPE_NAME = "HLLSketchUnion";
public static final String ESTIMATE_WITH_BOUNDS_TYPE_NAME = "HLLSketchEstimateWithBounds";
public static final String ESTIMATE_TYPE_NAME = "HLLSketchEstimate";


@Override
public void configure(final Binder binder)
{
registerSerde();
SqlBindings.addAggregator(binder, HllSketchSqlAggregator.class);
SqlBindings.addAggregator(binder, HllSketchApproxCountDistinctSqlAggregator.class);
SqlBindings.addAggregator(binder, HllSketchObjectSqlAggregator.class);

SqlBindings.addOperatorConversion(binder, HllSketchEstimateOperatorConversion.class);
SqlBindings.addOperatorConversion(binder, HllSketchEstimateWithErrorBoundsOperatorConversion.class);
SqlBindings.addOperatorConversion(binder, HllSketchSetUnionOperatorConversion.class);
SqlBindings.addOperatorConversion(binder, HllSketchToStringOperatorConversion.class);
}

@Override
Expand All @@ -64,7 +77,8 @@ public List<? extends Module> getJacksonModules()
new NamedType(HllSketchMergeAggregatorFactory.class, TYPE_NAME),
new NamedType(HllSketchToStringPostAggregator.class, TO_STRING_TYPE_NAME),
new NamedType(HllSketchUnionPostAggregator.class, UNION_TYPE_NAME),
new NamedType(HllSketchToEstimateWithBoundsPostAggregator.class, ESTIMATE_WITH_BOUNDS_TYPE_NAME)
new NamedType(HllSketchToEstimateWithBoundsPostAggregator.class, ESTIMATE_WITH_BOUNDS_TYPE_NAME),
new NamedType(HllSketchToEstimatePostAggregator.class, ESTIMATE_TYPE_NAME)
).addSerializer(HllSketch.class, new HllSketchJsonSerializer())
);
}
Expand Down
@@ -0,0 +1,143 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.apache.druid.query.aggregation.datasketches.hll;

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonProperty;
import com.yahoo.sketches.hll.HllSketch;
import org.apache.druid.query.aggregation.AggregatorFactory;
import org.apache.druid.query.aggregation.PostAggregator;
import org.apache.druid.query.aggregation.post.ArithmeticPostAggregator;
import org.apache.druid.query.aggregation.post.PostAggregatorIds;
import org.apache.druid.query.cache.CacheKeyBuilder;

import java.util.Comparator;
import java.util.Map;
import java.util.Objects;
import java.util.Set;

/**
* Returns a distinct count estimate a from a given {@link HllSketch}.
* The result will be a double value.
*/
public class HllSketchToEstimatePostAggregator implements PostAggregator
{
private final String name;
private final PostAggregator field;
private final boolean round;

@JsonCreator
public HllSketchToEstimatePostAggregator(
@JsonProperty("name") final String name,
@JsonProperty("field") final PostAggregator field,
@JsonProperty("round") boolean round
)
{
this.name = name;
this.field = field;
this.round = round;
}

@Override
@JsonProperty
public String getName()
{
return name;
}

@JsonProperty
public PostAggregator getField()
{
return field;
}

@JsonProperty
public boolean isRound()
{
return round;
}

@Override
public Set<String> getDependentFields()
{
return field.getDependentFields();
}

@Override
public Comparator<Double> getComparator()
{
return ArithmeticPostAggregator.DEFAULT_COMPARATOR;
}

@Override
public Object compute(final Map<String, Object> combinedAggregators)
{
final HllSketch sketch = (HllSketch) field.compute(combinedAggregators);
return round ? Math.round(sketch.getEstimate()) : sketch.getEstimate();
}

@Override
public PostAggregator decorate(final Map<String, AggregatorFactory> aggregators)
{
return this;
}

@Override
public String toString()
{
return getClass().getSimpleName() + "{" +
"name='" + name + '\'' +
", field=" + field +
"}";
}

@Override
public boolean equals(final Object o)
{
if (this == o) {
return true;
}
if (!(o instanceof HllSketchToEstimatePostAggregator)) {
return false;
}

final HllSketchToEstimatePostAggregator that = (HllSketchToEstimatePostAggregator) o;

if (!name.equals(that.name)) {
return false;
}
return field.equals(that.field);
}

@Override
public int hashCode()
{
return Objects.hash(name, field);
}

@Override
public byte[] getCacheKey()
{
return new CacheKeyBuilder(PostAggregatorIds.HLL_SKETCH_TO_ESTIMATE_CACHE_TYPE_ID)
.appendCacheable(field)
.build();
}

}
Expand Up @@ -43,6 +43,7 @@
*/
public class HllSketchToEstimateWithBoundsPostAggregator implements PostAggregator
{
public static final int DEFAULT_NUM_STD_DEVS = 1;

private final String name;
private final PostAggregator field;
Expand All @@ -57,7 +58,7 @@ public HllSketchToEstimateWithBoundsPostAggregator(
{
this.name = name;
this.field = field;
this.numStdDevs = numStdDevs == null ? 1 : numStdDevs;
this.numStdDevs = numStdDevs == null ? DEFAULT_NUM_STD_DEVS : numStdDevs;
}

@Override
Expand Down