Skip to content

Conversation

alex-spies
Copy link
Contributor

@alex-spies alex-spies commented May 6, 2025

Closes #119082

Assume a lookup index with fields language_code, lookup_field. We want to push down a LOOKUP JOIN past an upstream Project, like so:

FROM main_index | KEEP other_field1, language_code | LOOKUP JOIN lookup_index ON language_code 

->

\_Join[LEFT,[language_code]]
  |_Project[[other_field1, language_code]]
  |  \_EsRelation[main_index][language_code, other_field1, other_field2]
   \_EsRelation[lookup_index][LOOKUP][language_code, lookup_field]

Move the Project up from the Join's left hand branch ->

Project[[other_field1, language_code, lookup_field]]
  \_Join[LEFT,[language_code]]
    |_EsRelation[main_index][language_code, other_field1, other_field2]
    \_EsRelation[lookup_index][LOOKUP][language_code, lookup_field]

Pulling up the Project allows us to combine it with other Projects downstream, which may eliminate some lookup fields entirely. An example is the query from #119082:

FROM test
| KEEP languages, emp_no
| EVAL language_code = languages
| LOOKUP JOIN languages_lookup ON language_code
| RENAME language_name AS foo              <- the lookup field is later dropped and shouldn't be loaded at all!
| LOOKUP JOIN languages_lookup ON language_code
| DROP foo

Avoiding the early Projects also allows us to perform field extractions later - the Project ahead of the LOOKUP JOIN otherwise causes InsertFieldExtraction to load any and all fields that we need from the main index before the LOOKUP JOIN.

Like with any pushdown optimization, we have to deal with name conflicts: LOOKUP JOIN shadows any conflicting attributes if the lookup fields have the same name; in this regard, it behaves like ENRICH or EVAL.

Example: Assume the field lookup_field occurs both in lookup_index and in main_index:

FROM main_index | RENAME lookup_field AS ln | LOOKUP JOIN lookup_index ON language_code

\_Join[LEFT,[language_code]]
  |_Project[[language_code, lookup_field AS ln]]
  |  \_EsRelation[main_index][language_code, lookup_field]
   \_EsRelation[lookup_index][LOOKUP][language_code, lookup_field]

Try to move up the Project as before:

Project[[language_code, lookup_field AS ln]]]  ⚡! The original lookup_field from main_index got shadowed!
  \_Join[LEFT,[language_code]]
    |_EsRelation[main_index][language_code, lookup_field]
    \_EsRelation[lookup_index][LOOKUP][language_code, lookup_field]

There are 2 ways to deal with this:

  1. Leave a partial Project or Eval upstream from the Join to rename conflicting attributes to some arbitrary names, then in the new Project that we place downstream from the Join, name them to the desired names.
  2. Change the names of the attributes that LOOKUP JOIN adds.

Option 1. is not ideal, because the renaming before the LOOKUP JOIN can still trigger field extractions. This PR thus goes with 2., which is also the approach our other pushdown rules take, see here.

To implement 2., we leverage the fact that LOOKUP JOIN essentially behaves like ENRICH: thus, we can represent a LOOKUP JOIN as a unary plan node by wrapping it in a dedicated class and then we apply the same pushdown logic as to ENRICH, EVAL etc.

This requires that the (field) attributes that a LOOKUP JOIN adds to the plan can be renamed to arbitrary names, rather than using the physical field names. Ideally, we'd just use temporary qualifiers for this, but this mechanism doesn't exist yet. But! We already have field attributes with arbitrary attribute names and use them for union-typed fields; so we can do the same here and simply rename the field attributes of the EsRelation that represents the lookup index (without actually renaming the corresponding physical fields they refer to).

For this to work, we need to make sure that the compute code of LOOKUP JOIN doesn't rely on FieldAttribute#name (the, potentially arbitrary, attribute name) but rather on FieldAttribute#fieldName (the name of the physical field). There are some places in the code where we don't use #fieldName, yet - these are bugs (and won't work with union types!) and need to be fixed and backported before the bwc tests of this PR can truly pass. This is related to #127521.

List<ValuesSourceReaderOperator.FieldInfo> fields = new ArrayList<>(extractFields.size());
for (NamedExpression extractField : extractFields) {
String physicalName = extractField instanceof FieldAttribute fa ? fa.fieldName()
: extractField instanceof Alias a ? ((NamedExpression) a.child()).name()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a comment: alias and reference attribute cases only relevant for ENRICH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a bunch of additional tests + updating the expectations of the tests inside here.

Comment on lines +721 to +722
// TODO: This probably also led to bugs for LOOKUP JOIN on a union typed field, let's add a test.
this(match.exactAttribute().fieldName(), input.channel(), input.type());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff touches multiple places that should have used field names but used attribute names, instead.

To make this PR cleaner, I think we should have a separate PR just with these fixes + corresponding tests. This should also address #127521.

@alex-spies
Copy link
Contributor Author

This approach would require that we can rename the lookup attributes that LOOKUP JOIN adds to the plan. This is not possible before 8.18.3 (will become possible only with #129355), and thus bwc between 8.18.0-8.18.2 and 8.19 would be broken; the same holds for bwc between 9.0.0-9.0.2 and 9.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: LOOKUP JOIN push down optimizations
2 participants