-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL: Support SQL folding of GROUP BY Constant #84574
base: main
Are you sure you want to change the base?
SQL: Support SQL folding of GROUP BY Constant #84574
Conversation
Pinging @elastic/es-ql (Team:QL) |
Hi @luigidellaquila, I've created a changelog YAML for you. |
Hi @luigidellaquila, I've updated the changelog YAML for you. |
@@ -423,6 +423,8 @@ else if (value instanceof IntervalDayTime | |||
else { | |||
throw new SqlIllegalArgumentException("Cannot GROUP BY function {}", exp); | |||
} | |||
} else if (exp instanceof Literal) { | |||
key = new GroupByValue(aggId, QueryTranslator.nameOf(exp)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised that this actually works but apparently the terms aggregation just ignores fields that do not exist and creates a null
bucket. I'm not sure though wether this might not introduce other bugs. E.g. if QueryTranslator.nameOf(exp)
evaluates to an actual field name. Could it be an option to have a GroupByConstant
for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GroupByConstant
was my first idea as well (I also implemented it with the exact same name at first), but then it ended up being the exact copy of GroupByValue.
The fact that also scalar functions use GroupByValue (passing the function as a script, see the other constructor) is another pointer that convinced me to use it.
Unless there is some practical differentiation, probably it can be avoided (anyway, I don't have a strong opinion on that).
About the clash with field names, I reasoned a bit about that and did some tests.
On a person (name, surname)
schema, the following are possible clashes
SELECT 1 as name FROM person GROUP BY name; // the alias is the same as a field name
That is interpreted correctly
SELECT 1 FROM person GROUP BY 'name'; // the value of the literal is the same as a field name
Here the field name in the terms is 'name'
, so there is no clash with the field name
.
Still, I just realised that you could define a field named 'name'
(!!!). In this case, with the same query you would get an error:
Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on ['name'] in order to load field data by uninverting the inverted index. Note that this can use significant memory
.
So I think I'll have to specialise the behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other cases that come to my mind are fields named null
, true
or false
(I think ES does also not stop you from doing this).
@@ -279,7 +279,8 @@ protected LogicalPlan rule(Aggregate agg) { | |||
} | |||
} | |||
|
|||
if (prunedGroupings.size() > 0) { | |||
// avoid complete removal of the grouping, that could change the query semantics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// avoid complete removal of the grouping, that could change the query semantics | |
// avoid complete removal of the grouping because it would turn the aggregation into an implicit grouping and change the semantics |
I think the wording can be a bit stronger here because, as far as I understand, it always changes the semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not always change the semantics actually. If there are aggregate projections (ie. implicit groupings), the semantics do not change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case there wouldn't be any groupings to remove though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case the explicit groupings (constants), if present, can be safely removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this approach is correct. One reason is the one @Luegg mentioned, the other is that it's an overkill solution (running an aggregation when one is not actually needed).
The alternative is something like #84510, but:
Probably the good solution is a mix of the two, ie. support the constant aggregation properly, addressing the issue pointed out by @Luegg, AND reduce the cost of the operation with a LIMIT 1. Any further feedback is welcome |
Did you also explore solutions around making the behavior in the optimizer and query folder analog to what's happening for implicit aggregations and only differentiate the behavior during query execution (see Update: I just realized that this leaves out possible optimizations... |
…l_folder' into group_by_constant_in_sql_folder
Added some improvements to the solution. |
@elasticmachine update branch |
…l_folder' into group_by_constant_in_sql_folder
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Allows to avoid pruning of constant groupings from the query plan, that could lead to too aggressive optimizations, like:
This PR overlaps with #84510 but it's more general as a fix. Though #84510, on some specific cases, can have a significant performance impact, so it can still be considered.
Fixes #84490
Fixes #74064