Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL timeseries no longer skip empty buckets with all granularity #11188

Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented May 2, 2021

Fixes #11145 (maybe others since this has been an issue for quite a while now, but I couldn't find any while searching)

Description

This PR modifies the Druid SQL planner to no longer skip empty buckets when it generates a timeseries query with 'all' granularity, which should produce query result behavior that is more SQL compatible.

In SQL when running a query with only aggregators, such as SELECT COUNT(*), SUM(x) FROM y WHERE z = 'someval', then the aggregation is done as if there is a single universal grouping key, and so this example query should return something even if there are no values which match 'someval', and should produce [0, null] in this case.

We plan queries like this example into a Druid native timeseries query, and prior to this patch would always set skipEmptyBuckets query context parameter to 'true', because a timeseries query with a granularity other than 'ALL' is equivalent to grouping by a time floor expression, so empty results should be removed in this case. But when using 'ALL' granularity, such as we do in the example query when there is no time floor expression to be grouping on, this setting isn't appropriate, because it makes druid return empty results, instead of the initialized aggregators for the universal group like it correctly should.

I ran into this issue again when trying to write test for #11157 to test that something like SELECT ARRAY_AGG(x) FROM y WHERE z = 'missing value' produces a single [null] result, but then remembered that I couldn't write that test because of this issue 😅 .

The actual changes are small, ~5 lines in DruidQuery when constructing a timeseries query to check if 'ALL' granularity before setting 'skipEmptyBuckets' context parameter, everything else is test changes.

I went through and added tests for I think almost all of our SQL aggregator functions to capture their current outputs in both default mode and SQL compatible null handling mode for this empty timeseries result case, and got a bit carried away refactoring these tests to be simplified and extend BaseCalciteQueryTest which is why the PR is sort of large (but is a nice improvement for all of these SQL tests).

I'm not actually sure all of these aggregator functions behave correctly, notably some of the quantiles sketches which return Double.NaN in both modes, and some other sketches which return an empty sketch in both modes, but I'll save changing the behavior of any of these aggregators for another day.

Since this changes the default behavior of query results in that SQL queries which plan to timeseries queries with ALL granularity will now always produce a single result row instead of nothing. The old behavior can be retained by explicitly setting skipEmptyBuckets to true, which will overrule the planner behavior.

Key changed/added classes in this PR
  • DruidQuery.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

"(\"m1\" * 2)",
ValueType.FLOAT,
TestExprMacroTable.INSTANCE
cannotVectorize();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug that this cannot vectorize btw, the aggregator canVectorize method takes a ColumnInspector, but (in TimeseriesQueryEngine at least) this is the segment adapter, and not wrapped with the virtual columns, so it does not find column capabilities for v0 since it doesn't exist on the actual segment and so reports that it is not able to vectorize. The engine itself will confirm that v0 can vectorize, so the aggregator doesn't need to worry about that I think, but I think we might want a way to wrap the adapter with a new VirtualizedColumnInspector or something (think VirtualColumns.wrap but for a ColumnInspector instead of ColumnSelectorFactory) so that the column capabilities of the virtual columns are available to the aggregator to help make its decision on whether or not it can vectorize.

@clintropolis
Copy link
Member Author

clintropolis commented May 3, 2021

If #11157 goes in before this PR, this PR should be updated to add ARRAY_AGG to the non-vectorized built-in aggs test case. If #11182 is merged before this PR, the test case should be moved to the 'vectorized' built-in aggs test case. Likewise, those PRs should consider doing these things separately should this be merged first.

|`COUNT(DISTINCT expr)`|Counts distinct values of expr, which can be string, numeric, or hyperUnique. By default this is approximate, using a variant of [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). To get exact counts set "useApproximateCountDistinct" to "false". If you do this, expr must be string or numeric, since exact counts are not possible using hyperUnique columns. See also `APPROX_COUNT_DISTINCT(expr)`. In exact mode, only one distinct count per query is permitted unless `useGroupingSetForExactDistinct` is set to true in query contexts or broker configurations.|`0`|
|`SUM(expr)`|Sums numbers.|`0` in 'default' mode, `null` in SQL compatible mode|
|`MIN(expr)`|Takes the minimum of numbers.|`Long.MAX_VALUE` in 'default' mode, `null` in SQL compatible mode|
|`MAX(expr)`|Takes the maximum of numbers.|`Long.MIN_VALUE` in 'default' mode, `null` in SQL compatible mode|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should spell out what the value is. Druid users aren't necessarily going to be familiar with Java, so we shouldn't reference Java programming concepts.

|--------|-----|-------|
|`COUNT(*)`|Counts the number of rows.|`0`|
|`COUNT(DISTINCT expr)`|Counts distinct values of expr, which can be string, numeric, or hyperUnique. By default this is approximate, using a variant of [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). To get exact counts set "useApproximateCountDistinct" to "false". If you do this, expr must be string or numeric, since exact counts are not possible using hyperUnique columns. See also `APPROX_COUNT_DISTINCT(expr)`. In exact mode, only one distinct count per query is permitted unless `useGroupingSetForExactDistinct` is set to true in query contexts or broker configurations.|`0`|
|`SUM(expr)`|Sums numbers.|`0` in 'default' mode, `null` in SQL compatible mode|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we should consider using different phrases for these modes. For some reason, "default mode" doesn't evoke the proper concept for me. Maybe this instead:

null when druid.generic.useDefaultValueForNull = false, 0 when true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, and also flipped so druid.generic.useDefaultValueForNull=false is described first since the behavior is more sane on a page talking about SQL, though I hope it isn't too confusing putting it first to readers, since it isn't yet the default behavior.

|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct values of expr, which can be a regular column or an [HLL sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`0`|
|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of expr, which can be a regular column or a [Theta sketch](../development/extensions-core/datasketches-theta.md) column. The `size` parameter is described in the Theta sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`0`|
|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.md) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0'` (STRING)|
|`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.md) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0.0'` (STRING)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, is this good? Why does it return a string instead of a theta sketch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it actually returns a double, but we don't examine the finalized type so calcite thinks it is complex which is I guess how it ends up as a string instead of double because it just tries to serialize complex values (and I assume the same is true of the other sketches that return a string result). Theta sketch aggregator, along with many of the quantiles finalize their sketches by default, producing the estimate, instead of serializing the sketch like some of the other complex types do. A bit more discussion about it here #9419, but doesn't really look like it went anywhere.

I guess this also brings up the question of if we need to describe the difference between intermediary types and finalized types here (since this table is currently showing finalized types is why I called it the default value since it is what people will see, instead of like the initialized value of the aggregator which for all of these complex types is going to be an empty sketch).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it actually returns a double, but we don't examine the finalized type so calcite thinks it is complex which is I guess how it ends up as a string instead of double because it just tries to serialize complex values (and I assume the same is true of the other sketches that return a string result).

Hmm, weird, but, OK. We might want to change it to something saner later, but we don't have to do that right now.

I guess this also brings up the question of if we need to describe the difference between intermediary types and finalized types here

I think the way you did it is right.

I don't think we need to describe the intermediary / finalized type difference in user-facing documentation. That should be an internal detail. IMO if there's cases where users might benefit from seeing the non-finalized values, it'd be better to expose them through postaggregators that have well defined semantics (like sketch_bytes_as_base64 or something).

|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of expr, which can be a regular column or a [Theta sketch](../development/extensions-core/datasketches-theta.md) column. The `size` parameter is described in the Theta sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`0`|
|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.md) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0'` (STRING)|
|`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.md) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0.0'` (STRING)|
|`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator) exprs. The "probability" should be between 0 and 1 (exclusive). The "resolution" is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. The [approximate histogram extension](../development/extensions-core/approximate-histograms.md) must be loaded to use this function.|`Double.NaN`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above: we shouldn't reference Java concepts. Let's just go with NaN.

|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)|
|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.|
|`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy dimension is included in a row, when using `GROUPING SETS`. Refer to [additional documentation](aggregations.md#grouping-aggregator) on how to infer this number.|
|Function|Notes|Default|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to ask, but have you double-checked that for all these aggregators, the three implementations return the same values in the two null-handling modes? There are six cases to check: aggregator, buffer aggregator, vector aggregator vs. sql-compliant, replace-with-default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does it not matter because we'll always use one of the impls and never use the other two? (If that's the case, I suggest adding a comment somewhere discussing that, perhaps in TimeseriesQueryQueryToolChest or in the javadocs for the aggregators themselves.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did eyeball it, but that sounds lame when I say it out loud. I guess the unit tests added will cover heap and vector aggregators for vectorized agg so will confirm at least those match, but the ones which cannot be vectorized only test heap aggs.

I'll add a similar group by query test that uses filtered aggs to match nothing to cover the buffer agg path to make sure everything matches up, which seems pretty useful to have anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added near duplicates of the timeseries tests using group by queries with filtered aggregations that matched nothing to get additional coverage on buffer aggregators and luckily everything matched up.

@gianm
Copy link
Contributor

gianm commented May 3, 2021

The concept LGTM (I'm pleasantly surprised it seems so easy), my comments at this point are all about the docs.

|`COUNT(DISTINCT expr)`|Counts distinct values of expr, which can be string, numeric, or hyperUnique. By default this is approximate, using a variant of [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). To get exact counts set "useApproximateCountDistinct" to "false". If you do this, expr must be string or numeric, since exact counts are not possible using hyperUnique columns. See also `APPROX_COUNT_DISTINCT(expr)`. In exact mode, only one distinct count per query is permitted unless `useGroupingSetForExactDistinct` is set to true in query contexts or broker configurations.|`0`|
|`SUM(expr)`|Sums numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`MIN(expr)`|Takes the minimum of numbers.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `9223372036854775807` (maximum LONG value)|
|`MAX(expr)`|Takes the maximum of numbers.|`0` in 'default' mode, `null` in SQL compatible mode|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this one?

|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct values of expr, which can be a regular column or an [HLL sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`0`|
|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of expr, which can be a regular column or a [Theta sketch](../development/extensions-core/datasketches-theta.md) column. The `size` parameter is described in the Theta sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`0`|
|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.md) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0'` (STRING)|
|`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.md) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.|`'0.0'` (STRING)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it actually returns a double, but we don't examine the finalized type so calcite thinks it is complex which is I guess how it ends up as a string instead of double because it just tries to serialize complex values (and I assume the same is true of the other sketches that return a string result).

Hmm, weird, but, OK. We might want to change it to something saner later, but we don't have to do that right now.

I guess this also brings up the question of if we need to describe the difference between intermediary types and finalized types here

I think the way you did it is right.

I don't think we need to describe the intermediary / finalized type difference in user-facing documentation. That should be an internal detail. IMO if there's cases where users might benefit from seeing the non-finalized values, it'd be better to expose them through postaggregators that have well defined semantics (like sketch_bytes_as_base64 or something).

@@ -313,48 +313,51 @@ possible for two aggregators in the same SQL query to have different filters.

Only the COUNT and ARRAY_AGG aggregations can accept DISTINCT.

When no rows are selected, aggregate functions will return their initialized value for the grouping they belong to. What this value is exactly for a given aggregator is dependent on the configuration of Druid's SQL compatible null handling mode, controlled by `druid.generic.useDefaultValueForNull`. The table below defines the initial values for all aggregate functions in both modes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this paragraph reads oddly to me for two reasons:

  • At first blush II think a typical user would think it's impossible for groups to exist that do not have any rows. (It isn't typical SQLy behavior.) So we should list some examples of when the default value will show up. I can think of two cases: grand total (aggregations with no group by) and filtered aggregators where the filter does not match any rows within the group.
  • Not all aggregators have behavior dependent on druid.generic.useDefaultValueForNull, so it's not technically correct to say this categorically. I don't think we need to mention druid.generic.useDefaultValueForNull at all here, actually, because the individual aggregators in the table below call it out when appropriate. Or, if we do mention it, we could just say that it "may depend on" rather than "is dependent on".

Welcome to opinions from others about how to express this most clearly.

@gianm
Copy link
Contributor

gianm commented May 6, 2021

@clintropolis If you want to merge this patch and then update the docs in a followup, that's fine by me.

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @clintropolis for adding these tests and documentation for these aggregators.

Comment on lines 304 to 305
and querying, so for correct behavior, it should be set on all Druid service types to be available at both ingestion time and query time. There is some overhead
associated with the ability to handle NULLs; see the [segment internals](../design/segments.md#sql-compatible-null-handling)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and querying, so for correct behavior, it should be set on all Druid service types to be available at both ingestion time and query time. There is some overhead
associated with the ability to handle NULLs; see the [segment internals](../design/segments.md#sql-compatible-null-handling)
and querying, so for correct behavior, it should be set on all Druid service types to be available at both ingestion time and query time. There is some overhead associated with the ability to handle NULLs;
see the [segment internals](../design/segments.md#sql-compatible-null-handling)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I missed this one earlier, is it just line break changes? Since line breaks on this file are already sort of a mess, i'm going to hold off on adding this so i don't have to churn through CI again, unless I have to make any more changes then will add this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, I do have to make changes because changes in master have caused soft conflicts leading to compilation failure on TIMESERIES_CONTEXT_DEFAULT not being a thing anymore after this PR...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, we're going to need to watch out for this for a while in any PR that is adding these SQL tests that use this context parameter.

@clintropolis clintropolis merged commit 691d7a1 into apache:master May 10, 2021
@clintropolis clintropolis deleted the sql-timeseries-dont-skip-for-all-gran branch May 10, 2021 17:13
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
abhishekagarwal87 pushed a commit that referenced this pull request Jan 7, 2022
…ery (#12065)

Related to #11188

The above mentioned PR allowed timeseries queries to return a default result, when queries of type: select count(*) from table where dim1="_not_present_dim_" were executed. Before the PR, it returned no row, after the PR, it would return a row with value of count(*) as 0 (as expected by SQL standards of different dbs).

In Grouping#applyProject, we can sometimes perform optimization of a groupBy query to a timeseries query if possible (when the keys of the groupBy are constants, as generated by automated tools). For example, in select count(*) from table where dim1="_present_dim_" group by "dummy_key", the groupBy clause can be removed. However, in the case when the filter doesn't return anything, i.e. select count(*) from table where dim1="_not_present_dim_" group by "dummy_key", the behavior of general databases would be to return nothing, while druid (due to above change) returns an empty row. This PR aims to fix this divergence of behavior.

Example cases:

select count(*) from table where dim1="_not_present_dim_" group by "dummy_key".
CURRENT: Returns a row with count(*) = 0
EXPECTED: Return no row

select 'A', dim1 from foo where m1 = 123123 and dim1 = '_not_present_again_' group by dim1
CURRENT: Returns a row with ('A', 'wat')
EXPECTED: Return no row

To do this, a boolean droppedDimensionsWhileApplyingProject has been added to Grouping which is true whenever we make changes to the original shape with optimization. Hence if a timeseries query has a grouping with this set to true, we set skipEmptyBuckets=true in the query context (i.e. donot return any row).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql count(*) returns no results instead of 0 when no rows meet filter
3 participants