Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Add support for TEXT fields in comparison operators and SORT #98528

Merged
merged 16 commits into from Aug 30, 2023

Conversation

luigidellaquila
Copy link
Contributor

@luigidellaquila luigidellaquila commented Aug 16, 2023

(Refactoring of #98317, extending the operators instead of modifying QL module)

This fix adds support for TEXT fields in comparison operators (==, !=, >, >= etc., including LIKE and RLIKE) and in SORT command.

The fix consists in passing one more parameter to BinaryComparison, Order and RegexMatch constructors (and all the subclasses) to define whether the instance should resolve on TEXT fields (for ESQL) or not (for SQL and EQL).

Operators pushdown (both for filtering and sorting) is enabled only if the property also declares an exact multi-field that can be used as a fallback, otherwise the optimization gets skipped and the operation is executed in-line.

[edit]

Added a rule that replaces fields with their exact subfield when possible

Fixes #98642

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

@luigidellaquila
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-1

@luigidellaquila
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one more relevant note, otherwise it LGTM.

import static org.elasticsearch.xpack.ql.expression.TypeResolutions.ParamOrdinal.DEFAULT;
import static org.elasticsearch.xpack.ql.expression.TypeResolutions.isString;

public class WildcardLike extends org.elasticsearch.xpack.ql.expression.predicate.regex.WildcardLike {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect a case insensitive like to be added? if not, i guess we could get rid of the c'tor parameter.


@Override
public BinaryComparison reverse() {
return super.reverse();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return a QL GT.

I'm wondering if all these functions shouldn't rather extend QL's BinaryComparsion directly, since they override most methods anyways. Or alterantively, there could be an EsqlBinaryComparion (extending QL's), which could implement the type resolution, which is identical in all the new comparisons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh! The whole point of extending that method was to return the right instance, and I just missed it.
Fixing it right away

@@ -354,11 +354,16 @@ public void testLiteralSimple() throws IOException {
}

public void testOrderSimple() throws IOException {
var orig = new Order(Source.EMPTY, field("val", DataTypes.INTEGER), Order.OrderDirection.ASC, Order.NullsPosition.FIRST);
var orig = new org.elasticsearch.xpack.esql.expression.Order(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess ESQL's Order could be imported directly.

@ChrisHegarty ChrisHegarty deleted the branch elastic:main August 17, 2023 13:05
@luigidellaquila luigidellaquila changed the base branch from feature/esql to main August 17, 2023 13:28
Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for waiting Luigi and looking into this. Left a couple of comments - this is going to clash with #98628, thanks in advanced for working around it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule made sense before since the comparison classes were part of QL - now that we have our own classes, the type validation should be within the comparison classes themselves.
There's no advantage of it being outside rather it's confusing since the classes inherit the QL behavior which is used inside resolution and then complemented through the Verifier, which is both redundant and error-prone (the type resolution and Verifier need to be kept in sync and not trip over one another).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some tests, moving this logic inside the resolution flow; the fix is not complex, but it has some practical implications:

  • the Analyzer does some automatic conversion (KEYWORD->DATETIME), that happen before this rule, and only if the expressions are foldable. At the same time, the resolution is used before the conversion (eg. in ResolveFunctions rule) and fails if we keep the logic as it is. So we need to review the logic a bit.
  • AbstractBinaryComparisonTestCase does not consider these automatic conversions (and today it makes no distinction between foldable and non-foldable expressions, see above), so we have to review that as well.

If you don't mind, I'd prefer to address this problem with a follow-up PR and discuss the changes separately

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WFM - please raise an issue that explains the follow-up items and link it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 288 to 293
if (fa.getExactInfo().hasExact() == false) {
return false;
}
} else {
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (order.child() instanceof FieldAttribute && fa.getExactInfo().hasExact()) == false {
   return false;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a bit more readable:

return orders.stream().allMatch(o -> o.child() instanceof FieldAttribute fa && fa.getExactInfo().hasExact());

}

private List<EsQueryExec.FieldSort> buildFieldSorts(List<Order> orders) {
List<EsQueryExec.FieldSort> sorts = new ArrayList<>(orders.size());
for (Order o : orders) {
sorts.add(new EsQueryExec.FieldSort(((FieldAttribute) o.child()), o.direction(), o.nullsPosition()));
sorts.add(new EsQueryExec.FieldSort(((FieldAttribute) o.child()).exactAttribute(), o.direction(), o.nullsPosition()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sort would have to be extracted as part of #98642 and instead of letting the original field be passed in and select its exact attribute, have an optimization rule that determines (potentially locally) that an exact field is available an use that instead.
This would apply across sort and aggregations transparently without having to switch things in multiple places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a rule that implements #98642 and that also covers this case.
I wanted to do it with a separate PR, but with currently supported data types the only way to write thorough tests is to have both fixes together, so I ended up adding it here.

return ExpressionTranslator.wrapIfNested(new SingleValueQuery(querySupplier.get(), fa.name()), field);
return ExpressionTranslator.wrapIfNested(
new SingleValueQuery(querySupplier.get(), fa.exactAttribute().name()),
((FieldAttribute) field).exactAttribute()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Comment on lines 732 to 736
? new org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.Equals(
k.source(),
k,
v.iterator().next(),
finalZoneId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing a createEquals similar to createIn, avoids the need to duplicate this class and instead keep using the original rule.
Or am I missing something else?

@@ -684,4 +771,35 @@ protected LogicalPlan rule(Limit plan) {
return p;
}
}

public static class ReplaceRegexMatch extends OptimizerRules.OptimizerExpressionRule<RegexMatch<?>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to copy the class - extend ReplaceRegexMatch in QL and override regextToEquals to return the ESQL Equals class similar to EQL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these (regex) classes added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only to override resolveType()

@luigidellaquila
Copy link
Contributor Author

Thanks for your feedback @costin

this is going to clash with #98628, thanks in advanced for working around it.

No problem, I'll take care of resolving the conflicts

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've updated the changelog YAML for you.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for incorporating the feedback Luigi on this wide PR. Left a round of comments.

@@ -271,6 +271,9 @@ public static Failure validateBinaryComparison(BinaryComparison bc) {
if (false == r.resolved()) {
return fail(bc, r.message());
}
if (DataTypes.isString(bc.left().dataType()) && DataTypes.isString(bc.right().dataType())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Aren't the operators performing their own, full type resolution?
In fact why still have this rule in place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above #98528 (comment)

Comment on lines 32 to 34
if (e instanceof FieldAttribute fa && fa.dataType() == DataTypes.TEXT) {
return TypeResolution.TYPE_RESOLVED;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes for an utility method in EsqlTypeResolutions or even in the original TypeResolutions class.

Comment on lines 28 to 30
if (e instanceof FieldAttribute fa && fa.dataType() == DataTypes.TEXT) {
return TypeResolution.TYPE_RESOLVED;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me again why are these two classes here? They don't seem to add anything over the QL ones...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    protected TypeResolution resolveType() {
        return isString(field(), sourceText(), DEFAULT);
    }

vs

    protected TypeResolution resolveType() {
        return isStringAndExact(field(), sourceText(), DEFAULT);
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pushing the filter down, we might want to keep as is since it can be used as a prefix query for example.
Worth raising an issue for it for when we have more time to investigate (or get some advice from the search team).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


@Override
protected TypeResolution resolveType() {
return field().dataType() == DataTypes.TEXT ? TypeResolution.TYPE_RESOLVED : super.resolveType();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be another resolution utility method - such as isString() which for ESQL includes TEXT or isStringFamily to not confuse it with the method from QL.

Comment on lines +810 to +813

protected Equals createEquals(Expression k, Set<Expression> v, ZoneId finalZoneId) {
return new Equals(k.source(), k, v.iterator().next(), finalZoneId);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -88,6 +92,7 @@ protected static List<Batch<LogicalPlan>> rules() {

var operators = new Batch<>(
"Operator Optimization",
new ReplaceAttributesWithExact(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The Rewrite/Replacement only needs to occur once hence move it in the "Substitutions" batch above.
  2. the name should reflect that not all attributes are changed --> ReplaceFieldAttributesWithExactSubfield, quite long but clear about the intent.

}
}

private static class ReplaceAttributesWithExact extends OptimizerRules.OptimizerRule<LogicalPlan> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the fields that occur inside certain operations need to be replaced - doing this across all fields is going to have side effects. Fields inside sort, stats need to be changed; potentially those used inside filters (though that depends whether full text search is used on them or not).
If the list is too broad, the rule could look into the fields that need to be preserved as are and then change everything else to be exact (though again that might backfire).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'm changing this to limit the scope to OrderBy, Aggregate and Filter plans. I'm including Filter since we have no full text search for now, and it would be a shame not pushing down filters when possible. Likely, we'll have to refine this rule in the future.

Comment on lines 848 to 869
private NamedExpression toExactAlias(FieldAttribute e, List<? extends NamedExpression> projections) {

if (e.getExactInfo().hasExact() && e.exactAttribute() != e) {
FieldAttribute calculatedExact = toExact(e);

// avoid using multiple IDs just to load the same field twice
NamedExpression exact = projections.stream()
.filter(x -> x.name().equals(calculatedExact.name()))
.map(NamedExpression.class::cast)
.findFirst()
.orElse(calculatedExact);

return new Alias(e.source(), e.name(), exact);
}
return e;
}

private static FieldAttribute toExact(FieldAttribute fa) {
if (fa.getExactInfo().hasExact() && fa.exactAttribute() != fa) {
return fa.exactAttribute();
}
return fa;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add new aliases and fields - simply replace the FieldAttribute in place, with the same underlying NameId but set the EsField to its exact counterpart. That is the attribute remains essentially the same, it's just the backing field that has changed.

@luigidellaquila
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grazie Luigi for patiently iterating on this PR.
Looks good to me - however I left a comment on why both the field and its raw subfield are being extracted; should just raw be used instead?

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 159 to 165
.mapToInt(
f -> (EstimatesRowSize.estimateSize(EsqlDataTypes.widenSmallNumericTypes(f.getDataType())) + f.getProperties()
.values()
.stream()
.mapToInt(x -> EstimatesRowSize.estimateSize(EsqlDataTypes.widenSmallNumericTypes(x.getDataType())))
.sum())
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave a comment on why the properties are considered as well.

assertEquals(Set.of("emp_no"), Sets.newHashSet(names(extract.attributesToExtract())));

var query = as(extract.child(), EsQueryExec.class);
assertThat(query.estimatedRowSize(), equalTo(Integer.BYTES + allFieldRowSize));
}

private Set<String> allFields(Map<String, EsField> mapping) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose does this method serve; in particular where it is being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in two tests that check the output based on the mapping.
Before this change, it was just mapping.keySet(), but now the mapping also contains a TEXT field with a KEYWORD subfield, so we have to take it into consideration as well.

@@ -423,7 +419,7 @@ public void testExtractorMultiEvalWithSameName() {
var extract = as(project.child(), FieldExtractExec.class);
assertThat(
names(extract.attributesToExtract()),
contains("_meta_field", "emp_no", "first_name", "gender", "job.raw", "languages", "last_name", "salary")
contains("_meta_field", "emp_no", "first_name", "gender", "job", "job.raw", "languages", "last_name", "salary")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are both job and job.raw extracted? Wouldn't job.raw be enough or does jaw leak somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the initial version of this PR, I replaced all the occurrences in the plan with their exact subfield (including EsRelation), and as a result there was no need to load job anymore, but it made the planning a bit more convoluted, with the addition of some aliases to keep the original names (otherwise the output for from idx would result in job.raw instead of job).
There could be a smarter way to do it and avoid fetching TEXT fields; in case we can further investigate and do it with a follow-up PR (at low level there are other optimizations, so we should not load from _source anyway)

@luigidellaquila luigidellaquila merged commit ba87357 into elastic:main Aug 30, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:QL (Deprecated) Meta label for query languages team v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: exact operation should use the exact field if available
5 participants