-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: Fix synthetic attribute pruning #111413
ESQL: Fix synthetic attribute pruning #111413
Conversation
} else { | ||
AttributeSet childOutput = currentPlanNode.inputSet(); | ||
AttributeSet addedAttributes = currentPlanNode.outputSet().subtract(childOutput); | ||
requiredAttributes.set(requiredAttributes.get().subtract(addedAttributes).combine(currentPlanNode.requiredInputSet())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change; the way we collected all attributes that occurred and then removed generated attributes is fully abstracted away here.
// skip synthetically added attributes (the ones from AVG), see LogicalPlanOptimizer.SubstituteSurrogates | ||
if (attr.synthetic() == false && aliases.containsKey(attr) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synthetic attributes were skipped because we had to solve an NPE; IMHO this was not a correct long term fix, as it made heavy assumptions on where synthetic attributes are used.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/AggregateExec.java
Show resolved
Hide resolved
Hi @alex-spies, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ok imo. I am curious to see how this preliminary step will help with synthetic attributes usage in other places (mainly union types).
@@ -27,6 +28,11 @@ public List<NamedExpression> removals() { | |||
return removals; | |||
} | |||
|
|||
@Override | |||
public AttributeSet requiredInputSet() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't drop
different here? It's usage is to capture what the query wants to remove from projections and, in essence, a drop
doesn't "live" too long, being transformed in a projection in the Analyzer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Drop
essentially is not a real logical plan node, it's an AST node. The same is true for Lookup
and Keep
.
We can have this throw UnsupportedOperationException
instead; although I think returning references()
is correct, too; this contains the UnresolvedAttribute
s that this is trying to drop, excluding any wildcards.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java
Show resolved
Hide resolved
x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/plan/QueryPlan.java
Outdated
Show resolved
Hide resolved
Minor suggestion: just for debugging purposes, to see the in trace logging which attributes in the plan are also synthetic. Maybe something like |
I turned attributes synthetic that were meant to be synthetic in this commit. For union types specifically, we turn each expression like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good - thanks for incorporating the feedback and left another round.
...gin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/FieldAttribute.java
Outdated
Show resolved
Hide resolved
...k/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/Attribute.java
Outdated
Show resolved
Hide resolved
.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/PhysicalPlanOptimizer.java
Show resolved
Hide resolved
} | ||
return lazyReferences; | ||
} | ||
|
||
protected abstract AttributeSet computeReferences(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the default implementation Expressions.references(expressions())
like before since it applies to most implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually prefer not to give this a default implementation.
As long as we add new commands, I'd prefer each plan node to be forced to explicitly declare what it requires from its children. This is more verbose, but consistent with output()
- and the fact that references
had the mentioned default implementation led to many plan nodes having wrong implementations, and we built a bunch of workarounds to account for that.
Case in point: when the Join
node was first added, its references
implementation was wrong and needed fixing, but this was not immediately apparent at all.
Let me know how you'd like us to proceed. I'm fine with either way, but wanted to lay out my reasoning first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr - I've read your arguments however I'd prefer to use a default implementations, similar to inputSet, etc...
references and output are different.
The input of each set is well determined - the output set of its children. The same patter makes sense for references(), expressions() etc.. Not so with the output since each input had its own particularities.
As you point out, the issue was a bug in the default implementation - better to fix that in one place instead of having multiple places that do the same thing and potentially replicate the bug.
At the end of the day, it's an arbitrary choice. The style of the code-base is trustful towards the author, that is non-defensive: (for example collections are not copied for input/returns by default), method calls don't perform null checks for every params, etc...
Hence why I opt towards a default implementation that is useful and guiding - adding a new command requires the author to know what they're doing to begin with so I'd use this mindset, at least at this stage of the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you point out, the issue was a bug in the default implementation - better to fix that in one place instead of having multiple places that do the same thing and potentially replicate the bug.
The problem was having a default implementation. The default implementation just didn't apply for most query plans but we kept using it. The default implementation was exactly Expressions.references(expressions())
, that's why I'm hesitant to revert back to how it was before.
But it's probably fine either way and I'll revert it to using the default impl. Since we now actually rely on correct references()
implementations in ProjectAwayColumns
and the dependency checkers, there's now a higher chance to notice that the default implementation needed to be overridden if it didn't apply.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Aggregate.java
Outdated
Show resolved
Hide resolved
public static AttributeSet requiredAttributesFromChild(List<Alias> fields) { | ||
AttributeSet generated = new AttributeSet(asAttributes(fields)); | ||
return Expressions.references(fields).subtract(generated); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark - is this used elsewhere externally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, in EvalExec
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why the methods across Eval and Aggregate don't have the same name? Something like doComputeReferences() or determineReferences(), etc... required is a misnomer in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just overload computeReferences
(and output
for Aggregate
), that should be nice and consistent.
...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java
Outdated
Show resolved
Hide resolved
...in/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/PhysicalPlanOptimizerTests.java
Show resolved
Hide resolved
...gin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/FieldAttribute.java
Outdated
Show resolved
Hide resolved
@Override | ||
public AttributeSet references() { | ||
AttributeSet refs = super.references(); | ||
if (indexMode == IndexMode.TIME_SERIES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you make the move to EsqlSession? Was the fieldNames
the only place where metrics specific metadata attributes were needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fieldNames
is the only place where we would somehow call UnresolvedRelation.references()
and do something with the output in case of time series indices; after the first analysis steps, there's not supposed to be any UnresolvedRelation
s anymore.
Removing this code altogether makes the tests in k8s-metrics
fail, whereas it currently runs green; also, the only place in the whole (non-test) code base where IndexMode.TIME_SERIES
is set on an UnresolvedRelation
is LogicalPlanBuilder.visitMetricsCommand
.
I'll double-check if moving this code to EsqlSession.fieldNames
is the best course of action; if so, I'll add a test case to IndexResovlerFieldNamesTests
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it seems like it's reasonable to have a special case for METRICS
in EsqlSession.fieldNames
. I added a test.
I also added a comment and made this look for Aggregate
of Metrics
type rather than looking for unresolved relations with timeseries mode. It's the Aggregate
that needs to consume the @timestamp
field if it's in Metrics
mode, which I think is easier to grasp than "that's an unresolved time series, so probably we need the @timestamp
field somewhere".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to create an issue for this - adding a separate node (or extending Aggregate) is a more elegant way for evolving this requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this should be ready to go now. @costin would you like to take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a final remark regarding metrics
usage in EsqlSession
.
Thanks @alex-spies
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, LGTM otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, LGTM otherwise.
CI actually went through and was green; no idea why elasticsearch-ci/part-4 is still showing as pending. |
- Fix ProjectAwayColumns to handle synthetic attributes and mark synthetically introduced Aliases as synthetic again. - Fix QueryPlan.references() - Simplify ProjectAwayColumns. - Simplify DependencyConsistency. - Make AggregateExec track the intermediate attributes it actually outputs/requries.
synthetic
again.ProjectAwayColumns
.DependencyConsistency
.AggregateExec
track the intermediate attributes it actually outputs/requries.ProjectAwayColumns is an optimization, run on the coordinator node, which determines which columns/attributes are required from the data nodes; unused columns/attributes are projected out by adding a new projection to the PhysicalPlan's Fragment, i.e. the logical plan sent to data nodes.
The current approach tries to find all references by traversing the plan nodes, and, roughly, assuming each NamedExpression (that's usually an Attribute or Alias) is going to be required to be provided by the current plan nodes's children - unless it's a NamedExpression that the current node generates itself. E.g. for
EVAL x = 2*field, y = x + 1
, the plan node referencesfield
andx
, butx
is generated in the same node, so what's required from the child plan node is actually justfield
.However, this is brittle, as plan nodes can contain all kinds of NamedExpressions, and it is also very difficult to reason about this optimization rule's correctness. #99188 added a workaround to synthetic attributes accidentally not being accounted for correctly, which prevented us from using synthetic attributes in many situations because synthetic attributes couldn't be sent from data to coordinator node anymore; however, this is required for union types and some push down optimization rules.
This PR fixes and simplifies
ProjectAwayColumns
by observing that determining the minimum set of columns to run a given logical plan node while still creating the same output can be computed from 3 sets:QueryPlan.output()
)output()
)EVAL x = 2*field, y = x + 1
this is onlyfield
.To obtain 3., this PR fixes
QueryPlan.references
, which used to just mechanically traverse all expressions of a plan node and look for any attributes it could find. Because 3. is also implicitly computed inDependencyConsistency
to check whether a plan node's children provide all required attributes, this PR also simplifiesDependencyConsistency
.