Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Fix synthetic attribute pruning #111413

Merged
merged 51 commits into from
Sep 4, 2024

Conversation

alex-spies
Copy link
Contributor

@alex-spies alex-spies commented Jul 29, 2024

ProjectAwayColumns is an optimization, run on the coordinator node, which determines which columns/attributes are required from the data nodes; unused columns/attributes are projected out by adding a new projection to the PhysicalPlan's Fragment, i.e. the logical plan sent to data nodes.

The current approach tries to find all references by traversing the plan nodes, and, roughly, assuming each NamedExpression (that's usually an Attribute or Alias) is going to be required to be provided by the current plan nodes's children - unless it's a NamedExpression that the current node generates itself. E.g. for EVAL x = 2*field, y = x + 1, the plan node references field and x, but x is generated in the same node, so what's required from the child plan node is actually just field.

However, this is brittle, as plan nodes can contain all kinds of NamedExpressions, and it is also very difficult to reason about this optimization rule's correctness. #99188 added a workaround to synthetic attributes accidentally not being accounted for correctly, which prevented us from using synthetic attributes in many situations because synthetic attributes couldn't be sent from data to coordinator node anymore; however, this is required for union types and some push down optimization rules.

This PR fixes and simplifies ProjectAwayColumns by observing that determining the minimum set of columns to run a given logical plan node while still creating the same output can be computed from 3 sets:

  1. the expected output (QueryPlan.output())
  2. the attributes generated by this plan node (subtract the children input from output())
  3. any references required for this node specifically, e.g. for EVAL x = 2*field, y = x + 1 this is only field.

To obtain 3., this PR fixes QueryPlan.references, which used to just mechanically traverse all expressions of a plan node and look for any attributes it could find. Because 3. is also implicitly computed in DependencyConsistency to check whether a plan node's children provide all required attributes, this PR also simplifies DependencyConsistency.

Comment on lines 124 to 127
} else {
AttributeSet childOutput = currentPlanNode.inputSet();
AttributeSet addedAttributes = currentPlanNode.outputSet().subtract(childOutput);
requiredAttributes.set(requiredAttributes.get().subtract(addedAttributes).combine(currentPlanNode.requiredInputSet()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change; the way we collected all attributes that occurred and then removed generated attributes is fully abstracted away here.

Comment on lines -110 to -111
// skip synthetically added attributes (the ones from AVG), see LogicalPlanOptimizer.SubstituteSurrogates
if (attr.synthetic() == false && aliases.containsKey(attr) == false) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synthetic attributes were skipped because we had to solve an NPE; IMHO this was not a correct long term fix, as it made heavy assumptions on where synthetic attributes are used.

@alex-spies alex-spies added the test-full-bwc Trigger full BWC version matrix tests label Jul 30, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @alex-spies, I've created a changelog YAML for you.

@alex-spies alex-spies requested a review from astefan July 30, 2024 09:38
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok imo. I am curious to see how this preliminary step will help with synthetic attributes usage in other places (mainly union types).

@@ -27,6 +28,11 @@ public List<NamedExpression> removals() {
return removals;
}

@Override
public AttributeSet requiredInputSet() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't drop different here? It's usage is to capture what the query wants to remove from projections and, in essence, a drop doesn't "live" too long, being transformed in a projection in the Analyzer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Drop essentially is not a real logical plan node, it's an AST node. The same is true for Lookup and Keep.

We can have this throw UnsupportedOperationException instead; although I think returning references() is correct, too; this contains the UnresolvedAttributes that this is trying to drop, excluding any wildcards.

@astefan
Copy link
Contributor

astefan commented Aug 1, 2024

Minor suggestion: just for debugging purposes, to see the in trace logging which attributes in the plan are also synthetic. Maybe something like return qualifiedName() + "{" + label() + "}" + (synthetic() ? "{s}" : "") + "#" + id(); to be used here?

@alex-spies
Copy link
Contributor Author

I think this is ok imo. I am curious to see how this preliminary step will help with synthetic attributes usage in other places (mainly union types).

I turned attributes synthetic that were meant to be synthetic in this commit.

For union types specifically, we turn each expression like TO_STRING(multi_typed_field) into a synthetic field attribute $$multi_typed_field$converted_to$keyword. This was previously impossible, because they would get projected away before reaching the coordinator node, which is required e.g. to compute the top n over TO_STRING(multi_typed_field).

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good - thanks for incorporating the feedback and left another round.

}
return lazyReferences;
}

protected abstract AttributeSet computeReferences();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the default implementation Expressions.references(expressions()) like before since it applies to most implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer not to give this a default implementation.

As long as we add new commands, I'd prefer each plan node to be forced to explicitly declare what it requires from its children. This is more verbose, but consistent with output() - and the fact that references had the mentioned default implementation led to many plan nodes having wrong implementations, and we built a bunch of workarounds to account for that.

Case in point: when the Join node was first added, its references implementation was wrong and needed fixing, but this was not immediately apparent at all.

Let me know how you'd like us to proceed. I'm fine with either way, but wanted to lay out my reasoning first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr - I've read your arguments however I'd prefer to use a default implementations, similar to inputSet, etc...

references and output are different.
The input of each set is well determined - the output set of its children. The same patter makes sense for references(), expressions() etc.. Not so with the output since each input had its own particularities.

As you point out, the issue was a bug in the default implementation - better to fix that in one place instead of having multiple places that do the same thing and potentially replicate the bug.

At the end of the day, it's an arbitrary choice. The style of the code-base is trustful towards the author, that is non-defensive: (for example collections are not copied for input/returns by default), method calls don't perform null checks for every params, etc...

Hence why I opt towards a default implementation that is useful and guiding - adding a new command requires the author to know what they're doing to begin with so I'd use this mindset, at least at this stage of the project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you point out, the issue was a bug in the default implementation - better to fix that in one place instead of having multiple places that do the same thing and potentially replicate the bug.

The problem was having a default implementation. The default implementation just didn't apply for most query plans but we kept using it. The default implementation was exactly Expressions.references(expressions()), that's why I'm hesitant to revert back to how it was before.

But it's probably fine either way and I'll revert it to using the default impl. Since we now actually rely on correct references() implementations in ProjectAwayColumns and the dependency checkers, there's now a higher chance to notice that the default implementation needed to be overridden if it didn't apply.

Comment on lines 79 to 82
public static AttributeSet requiredAttributesFromChild(List<Alias> fields) {
AttributeSet generated = new AttributeSet(asAttributes(fields));
return Expressions.references(fields).subtract(generated);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark - is this used elsewhere externally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, in EvalExec!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why the methods across Eval and Aggregate don't have the same name? Something like doComputeReferences() or determineReferences(), etc... required is a misnomer in this context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just overload computeReferences (and output for Aggregate), that should be nice and consistent.

@alex-spies alex-spies requested a review from costin August 22, 2024 09:02
@Override
public AttributeSet references() {
AttributeSet refs = super.references();
if (indexMode == IndexMode.TIME_SERIES) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you make the move to EsqlSession? Was the fieldNames the only place where metrics specific metadata attributes were needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fieldNames is the only place where we would somehow call UnresolvedRelation.references() and do something with the output in case of time series indices; after the first analysis steps, there's not supposed to be any UnresolvedRelations anymore.

Removing this code altogether makes the tests in k8s-metrics fail, whereas it currently runs green; also, the only place in the whole (non-test) code base where IndexMode.TIME_SERIES is set on an UnresolvedRelation is LogicalPlanBuilder.visitMetricsCommand.

I'll double-check if moving this code to EsqlSession.fieldNames is the best course of action; if so, I'll add a test case to IndexResovlerFieldNamesTests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it seems like it's reasonable to have a special case for METRICS in EsqlSession.fieldNames. I added a test.

I also added a comment and made this look for Aggregate of Metrics type rather than looking for unresolved relations with timeseries mode. It's the Aggregate that needs to consume the @timestamp field if it's in Metrics mode, which I think is easier to grasp than "that's an unresolved time series, so probably we need the @timestamp field somewhere".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to create an issue for this - adding a separate node (or extending Aggregate) is a more elegant way for evolving this requirement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #112473; I'm not sure I captured your intent @costin , so feel free to comment on or update that.

@alex-spies
Copy link
Contributor Author

Ok, this should be ready to go now.

@costin would you like to take another look?

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a final remark regarding metrics usage in EsqlSession.
Thanks @alex-spies

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, LGTM otherwise.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, LGTM otherwise.

@alex-spies
Copy link
Contributor Author

CI actually went through and was green; no idea why elasticsearch-ci/part-4 is still showing as pending.

@alex-spies alex-spies merged commit 9b96665 into elastic:main Sep 4, 2024
14 of 15 checks passed
@alex-spies
Copy link
Contributor Author

Thanks for your reviews @astefan and @costin !

@alex-spies alex-spies deleted the fix-synthetic-attribute-pruning branch September 4, 2024 08:22
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
- Fix ProjectAwayColumns to handle synthetic attributes and mark synthetically introduced Aliases as synthetic again.
- Fix QueryPlan.references()
- Simplify ProjectAwayColumns.
- Simplify DependencyConsistency.
- Make AggregateExec track the intermediate attributes it actually outputs/requries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) test-full-bwc Trigger full BWC version matrix tests v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: ProjectAwayColumns rule not handling synthetic attribute correctly
4 participants