-
Notifications
You must be signed in to change notification settings - Fork 25.5k
ESQL: Enable pushing down LOOKUP JOIN past Project #127776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
List<ValuesSourceReaderOperator.FieldInfo> fields = new ArrayList<>(extractFields.size()); | ||
for (NamedExpression extractField : extractFields) { | ||
String physicalName = extractField instanceof FieldAttribute fa ? fa.fieldName() | ||
: extractField instanceof Alias a ? ((NamedExpression) a.child()).name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a comment: alias and reference attribute cases only relevant for ENRICH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a bunch of additional tests + updating the expectations of the tests inside here.
// TODO: This probably also led to bugs for LOOKUP JOIN on a union typed field, let's add a test. | ||
this(match.exactAttribute().fieldName(), input.channel(), input.type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff touches multiple places that should have used field names but used attribute names, instead.
To make this PR cleaner, I think we should have a separate PR just with these fixes + corresponding tests. This should also address #127521.
This approach would require that we can rename the lookup attributes that |
Closes #119082
Assume a lookup index with fields
language_code, lookup_field
. We want to push down a LOOKUP JOIN past an upstream Project, like so:Pulling up the
Project
allows us to combine it with otherProject
s downstream, which may eliminate some lookup fields entirely. An example is the query from #119082:Avoiding the early
Project
s also allows us to perform field extractions later - theProject
ahead of theLOOKUP JOIN
otherwise causesInsertFieldExtraction
to load any and all fields that we need from the main index before theLOOKUP JOIN
.Like with any pushdown optimization, we have to deal with name conflicts:
LOOKUP JOIN
shadows any conflicting attributes if the lookup fields have the same name; in this regard, it behaves likeENRICH
orEVAL
.Example: Assume the field
lookup_field
occurs both inlookup_index
and inmain_index
:There are 2 ways to deal with this:
Project
orEval
upstream from theJoin
to rename conflicting attributes to some arbitrary names, then in the newProject
that we place downstream from theJoin
, name them to the desired names.LOOKUP JOIN
adds.Option 1. is not ideal, because the renaming before the
LOOKUP JOIN
can still trigger field extractions. This PR thus goes with 2., which is also the approach our other pushdown rules take, see here.To implement 2., we leverage the fact that
LOOKUP JOIN
essentially behaves likeENRICH
: thus, we can represent aLOOKUP JOIN
as a unary plan node by wrapping it in a dedicated class and then we apply the same pushdown logic as toENRICH
,EVAL
etc.This requires that the (field) attributes that a
LOOKUP JOIN
adds to the plan can be renamed to arbitrary names, rather than using the physical field names. Ideally, we'd just use temporary qualifiers for this, but this mechanism doesn't exist yet. But! We already have field attributes with arbitrary attribute names and use them for union-typed fields; so we can do the same here and simply rename the field attributes of theEsRelation
that represents the lookup index (without actually renaming the corresponding physical fields they refer to).For this to work, we need to make sure that the compute code of
LOOKUP JOIN
doesn't rely onFieldAttribute#name
(the, potentially arbitrary, attribute name) but rather onFieldAttribute#fieldName
(the name of the physical field). There are some places in the code where we don't use#fieldName
, yet - these are bugs (and won't work with union types!) and need to be fixed and backported before the bwc tests of this PR can truly pass. This is related to #127521.