Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Support SQL folding of GROUP BY Constant #84574

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

luigidellaquila
Copy link
Contributor

@luigidellaquila luigidellaquila commented Mar 2, 2022

Allows to avoid pruning of constant groupings from the query plan, that could lead to too aggressive optimizations, like:

  • completely removing the grouping, that leads to different query semantics (ie. duplicates)
  • doing wrong assumptions about the possibility to perform local search (eg. getting results also when the target index is empty)

This PR overlaps with #84510 but it's more general as a fix. Though #84510, on some specific cases, can have a significant performance impact, so it can still be considered.

Fixes #84490
Fixes #74064

@luigidellaquila luigidellaquila added >bug :Analytics/SQL SQL querying Team:QL (Deprecated) Meta label for query languages team labels Mar 2, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've updated the changelog YAML for you.

@@ -423,6 +423,8 @@ else if (value instanceof IntervalDayTime
else {
throw new SqlIllegalArgumentException("Cannot GROUP BY function {}", exp);
}
} else if (exp instanceof Literal) {
key = new GroupByValue(aggId, QueryTranslator.nameOf(exp));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised that this actually works but apparently the terms aggregation just ignores fields that do not exist and creates a null bucket. I'm not sure though wether this might not introduce other bugs. E.g. if QueryTranslator.nameOf(exp) evaluates to an actual field name. Could it be an option to have a GroupByConstant for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GroupByConstant was my first idea as well (I also implemented it with the exact same name at first), but then it ended up being the exact copy of GroupByValue.

The fact that also scalar functions use GroupByValue (passing the function as a script, see the other constructor) is another pointer that convinced me to use it.

Unless there is some practical differentiation, probably it can be avoided (anyway, I don't have a strong opinion on that).

About the clash with field names, I reasoned a bit about that and did some tests.

On a person (name, surname) schema, the following are possible clashes

SELECT 1 as name FROM person GROUP BY name; // the alias is the same as a field name 

That is interpreted correctly

SELECT 1 FROM person GROUP BY 'name'; // the value of the literal is the same as a field name

Here the field name in the terms is 'name', so there is no clash with the field name.

Still, I just realised that you could define a field named 'name'(!!!). In this case, with the same query you would get an error:
Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on ['name'] in order to load field data by uninverting the inverted index. Note that this can use significant memory.

So I think I'll have to specialise the behaviour

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other cases that come to my mind are fields named null, true or false (I think ES does also not stop you from doing this).

@@ -279,7 +279,8 @@ protected LogicalPlan rule(Aggregate agg) {
}
}

if (prunedGroupings.size() > 0) {
// avoid complete removal of the grouping, that could change the query semantics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// avoid complete removal of the grouping, that could change the query semantics
// avoid complete removal of the grouping because it would turn the aggregation into an implicit grouping and change the semantics

I think the wording can be a bit stronger here because, as far as I understand, it always changes the semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not always change the semantics actually. If there are aggregate projections (ie. implicit groupings), the semantics do not change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case there wouldn't be any groupings to remove though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case the explicit groupings (constants), if present, can be safely removed.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this approach is correct. One reason is the one @Luegg mentioned, the other is that it's an overkill solution (running an aggregation when one is not actually needed).

@luigidellaquila
Copy link
Contributor Author

it's an overkill solution (running an aggregation when one is not actually needed).

The alternative is something like #84510, but:

  • alone it doesn't cover all the cases
  • this kind of optimisations is extremely error prone

Probably the good solution is a mix of the two, ie. support the constant aggregation properly, addressing the issue pointed out by @Luegg, AND reduce the cost of the operation with a LIMIT 1.

Any further feedback is welcome

@Luegg
Copy link
Contributor

Luegg commented Mar 3, 2022

Did you also explore solutions around making the behavior in the optimizer and query folder analog to what's happening for implicit aggregations and only differentiate the behavior during query execution (see Querier)? We already have the query logic for the implicit aggregation in place and it should be equivalent for the group by constant case.

Update: I just realized that this leaves out possible optimizations...

@luigidellaquila
Copy link
Contributor Author

Added some improvements to the solution.
As suggested by @Luegg, the approach is very similar to implicit aggregations, but it still needed some specialisation.

@luigidellaquila
Copy link
Contributor Author

@elasticmachine update branch

@luigidellaquila luigidellaquila marked this pull request as ready for review March 7, 2022 16:16
@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:08
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
@kingherc kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022
@rjernst rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023
@gmarouli gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023
@quux00 quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023
@mattc58 mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023
@elasticsearchmachine elasticsearchmachine added v8.14.0 Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed v8.13.0 labels Feb 14, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/SQL SQL querying >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:QL (Deprecated) Meta label for query languages team v9.0.0
Projects
None yet