New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: Add support for TEXT fields in comparison operators and SORT #98528
ESQL: Add support for TEXT fields in comparison operators and SORT #98528
Conversation
Pinging @elastic/es-ql (Team:QL) |
Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL) |
@elasticmachine run elasticsearch-ci/part-1 |
@elasticmachine run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one more relevant note, otherwise it LGTM.
import static org.elasticsearch.xpack.ql.expression.TypeResolutions.ParamOrdinal.DEFAULT; | ||
import static org.elasticsearch.xpack.ql.expression.TypeResolutions.isString; | ||
|
||
public class WildcardLike extends org.elasticsearch.xpack.ql.expression.predicate.regex.WildcardLike { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect a case insensitive like
to be added? if not, i guess we could get rid of the c'tor parameter.
|
||
@Override | ||
public BinaryComparison reverse() { | ||
return super.reverse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return a QL GT.
I'm wondering if all these functions shouldn't rather extend QL's BinaryComparsion directly, since they override most methods anyways. Or alterantively, there could be an EsqlBinaryComparion (extending QL's), which could implement the type resolution, which is identical in all the new comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh! The whole point of extending that method was to return the right instance, and I just missed it.
Fixing it right away
@@ -354,11 +354,16 @@ public void testLiteralSimple() throws IOException { | |||
} | |||
|
|||
public void testOrderSimple() throws IOException { | |||
var orig = new Order(Source.EMPTY, field("val", DataTypes.INTEGER), Order.OrderDirection.ASC, Order.NullsPosition.FIRST); | |||
var orig = new org.elasticsearch.xpack.esql.expression.Order( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess ESQL's Order could be imported directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for waiting Luigi and looking into this. Left a couple of comments - this is going to clash with #98628, thanks in advanced for working around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule made sense before since the comparison classes were part of QL - now that we have our own classes, the type validation should be within the comparison classes themselves.
There's no advantage of it being outside rather it's confusing since the classes inherit the QL behavior which is used inside resolution and then complemented through the Verifier, which is both redundant and error-prone (the type resolution and Verifier need to be kept in sync and not trip over one another).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some tests, moving this logic inside the resolution flow; the fix is not complex, but it has some practical implications:
- the Analyzer does some automatic conversion (KEYWORD->DATETIME), that happen before this rule, and only if the expressions are foldable. At the same time, the resolution is used before the conversion (eg. in
ResolveFunctions
rule) and fails if we keep the logic as it is. So we need to review the logic a bit. AbstractBinaryComparisonTestCase
does not consider these automatic conversions (and today it makes no distinction between foldable and non-foldable expressions, see above), so we have to review that as well.
If you don't mind, I'd prefer to address this problem with a follow-up PR and discuss the changes separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WFM - please raise an issue that explains the follow-up items and link it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (fa.getExactInfo().hasExact() == false) { | ||
return false; | ||
} | ||
} else { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (order.child() instanceof FieldAttribute && fa.getExactInfo().hasExact()) == false {
return false;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably a bit more readable:
return orders.stream().allMatch(o -> o.child() instanceof FieldAttribute fa && fa.getExactInfo().hasExact());
} | ||
|
||
private List<EsQueryExec.FieldSort> buildFieldSorts(List<Order> orders) { | ||
List<EsQueryExec.FieldSort> sorts = new ArrayList<>(orders.size()); | ||
for (Order o : orders) { | ||
sorts.add(new EsQueryExec.FieldSort(((FieldAttribute) o.child()), o.direction(), o.nullsPosition())); | ||
sorts.add(new EsQueryExec.FieldSort(((FieldAttribute) o.child()).exactAttribute(), o.direction(), o.nullsPosition())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the sort would have to be extracted as part of #98642 and instead of letting the original field be passed in and select its exact attribute, have an optimization rule that determines (potentially locally) that an exact field is available an use that instead.
This would apply across sort and aggregations transparently without having to switch things in multiple places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a rule that implements #98642 and that also covers this case.
I wanted to do it with a separate PR, but with currently supported data types the only way to write thorough tests is to have both fixes together, so I ended up adding it here.
return ExpressionTranslator.wrapIfNested(new SingleValueQuery(querySupplier.get(), fa.name()), field); | ||
return ExpressionTranslator.wrapIfNested( | ||
new SingleValueQuery(querySupplier.get(), fa.exactAttribute().name()), | ||
((FieldAttribute) field).exactAttribute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
? new org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.Equals( | ||
k.source(), | ||
k, | ||
v.iterator().next(), | ||
finalZoneId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing a createEquals
similar to createIn
, avoids the need to duplicate this class and instead keep using the original rule.
Or am I missing something else?
@@ -684,4 +771,35 @@ protected LogicalPlan rule(Limit plan) { | |||
return p; | |||
} | |||
} | |||
|
|||
public static class ReplaceRegexMatch extends OptimizerRules.OptimizerExpressionRule<RegexMatch<?>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to copy the class - extend ReplaceRegexMatch in QL and override regextToEquals to return the ESQL Equals class similar to EQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these (regex) classes added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only to override resolveType()
Hi @luigidellaquila, I've created a changelog YAML for you. |
Hi @luigidellaquila, I've updated the changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for incorporating the feedback Luigi on this wide PR. Left a round of comments.
@@ -271,6 +271,9 @@ public static Failure validateBinaryComparison(BinaryComparison bc) { | |||
if (false == r.resolved()) { | |||
return fail(bc, r.message()); | |||
} | |||
if (DataTypes.isString(bc.left().dataType()) && DataTypes.isString(bc.right().dataType())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Aren't the operators performing their own, full type resolution?
In fact why still have this rule in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above #98528 (comment)
if (e instanceof FieldAttribute fa && fa.dataType() == DataTypes.TEXT) { | ||
return TypeResolution.TYPE_RESOLVED; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes for an utility method in EsqlTypeResolutions
or even in the original TypeResolutions
class.
if (e instanceof FieldAttribute fa && fa.dataType() == DataTypes.TEXT) { | ||
return TypeResolution.TYPE_RESOLVED; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me again why are these two classes here? They don't seem to add anything over the QL ones...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected TypeResolution resolveType() {
return isString(field(), sourceText(), DEFAULT);
}
vs
protected TypeResolution resolveType() {
return isStringAndExact(field(), sourceText(), DEFAULT);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When pushing the filter down, we might want to keep as is since it can be used as a prefix query for example.
Worth raising an issue for it for when we have more time to investigate (or get some advice from the search team).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
@Override | ||
protected TypeResolution resolveType() { | ||
return field().dataType() == DataTypes.TEXT ? TypeResolution.TYPE_RESOLVED : super.resolveType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be another resolution utility method - such as isString()
which for ESQL includes TEXT or isStringFamily
to not confuse it with the method from QL.
|
||
protected Equals createEquals(Expression k, Set<Expression> v, ZoneId finalZoneId) { | ||
return new Equals(k.source(), k, v.iterator().next(), finalZoneId); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -88,6 +92,7 @@ protected static List<Batch<LogicalPlan>> rules() { | |||
|
|||
var operators = new Batch<>( | |||
"Operator Optimization", | |||
new ReplaceAttributesWithExact(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The Rewrite/Replacement only needs to occur once hence move it in the "Substitutions" batch above.
- the name should reflect that not all attributes are changed -->
ReplaceFieldAttributesWithExactSubfield
, quite long but clear about the intent.
} | ||
} | ||
|
||
private static class ReplaceAttributesWithExact extends OptimizerRules.OptimizerRule<LogicalPlan> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the fields that occur inside certain operations need to be replaced - doing this across all fields is going to have side effects. Fields inside sort, stats need to be changed; potentially those used inside filters (though that depends whether full text search is used on them or not).
If the list is too broad, the rule could look into the fields that need to be preserved as are and then change everything else to be exact (though again that might backfire).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'm changing this to limit the scope to OrderBy
, Aggregate
and Filter
plans. I'm including Filter since we have no full text search for now, and it would be a shame not pushing down filters when possible. Likely, we'll have to refine this rule in the future.
private NamedExpression toExactAlias(FieldAttribute e, List<? extends NamedExpression> projections) { | ||
|
||
if (e.getExactInfo().hasExact() && e.exactAttribute() != e) { | ||
FieldAttribute calculatedExact = toExact(e); | ||
|
||
// avoid using multiple IDs just to load the same field twice | ||
NamedExpression exact = projections.stream() | ||
.filter(x -> x.name().equals(calculatedExact.name())) | ||
.map(NamedExpression.class::cast) | ||
.findFirst() | ||
.orElse(calculatedExact); | ||
|
||
return new Alias(e.source(), e.name(), exact); | ||
} | ||
return e; | ||
} | ||
|
||
private static FieldAttribute toExact(FieldAttribute fa) { | ||
if (fa.getExactInfo().hasExact() && fa.exactAttribute() != fa) { | ||
return fa.exactAttribute(); | ||
} | ||
return fa; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add new aliases and fields - simply replace the FieldAttribute in place, with the same underlying NameId but set the EsField to its exact counterpart. That is the attribute remains essentially the same, it's just the backing field that has changed.
@elasticmachine run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grazie Luigi for patiently iterating on this PR.
Looks good to me - however I left a comment on why both the field and its raw subfield are being extracted; should just raw be used instead?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
.mapToInt( | ||
f -> (EstimatesRowSize.estimateSize(EsqlDataTypes.widenSmallNumericTypes(f.getDataType())) + f.getProperties() | ||
.values() | ||
.stream() | ||
.mapToInt(x -> EstimatesRowSize.estimateSize(EsqlDataTypes.widenSmallNumericTypes(x.getDataType()))) | ||
.sum()) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave a comment on why the properties are considered as well.
assertEquals(Set.of("emp_no"), Sets.newHashSet(names(extract.attributesToExtract()))); | ||
|
||
var query = as(extract.child(), EsQueryExec.class); | ||
assertThat(query.estimatedRowSize(), equalTo(Integer.BYTES + allFieldRowSize)); | ||
} | ||
|
||
private Set<String> allFields(Map<String, EsField> mapping) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What purpose does this method serve; in particular where it is being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in two tests that check the output based on the mapping.
Before this change, it was just mapping.keySet()
, but now the mapping also contains a TEXT field with a KEYWORD subfield, so we have to take it into consideration as well.
@@ -423,7 +419,7 @@ public void testExtractorMultiEvalWithSameName() { | |||
var extract = as(project.child(), FieldExtractExec.class); | |||
assertThat( | |||
names(extract.attributesToExtract()), | |||
contains("_meta_field", "emp_no", "first_name", "gender", "job.raw", "languages", "last_name", "salary") | |||
contains("_meta_field", "emp_no", "first_name", "gender", "job", "job.raw", "languages", "last_name", "salary") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are both job and job.raw extracted? Wouldn't job.raw be enough or does jaw leak somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the initial version of this PR, I replaced all the occurrences in the plan with their exact subfield (including EsRelation
), and as a result there was no need to load job
anymore, but it made the planning a bit more convoluted, with the addition of some aliases to keep the original names (otherwise the output for from idx
would result in job.raw
instead of job
).
There could be a smarter way to do it and avoid fetching TEXT fields; in case we can further investigate and do it with a follow-up PR (at low level there are other optimizations, so we should not load from _source anyway)
(Refactoring of #98317, extending the operators instead of modifying QL module)
This fix adds support for TEXT fields in comparison operators (
==
,!=
,>
,>=
etc., including LIKE and RLIKE) and in SORT command.The fix consists in passing one more parameter to
BinaryComparison
,Order
andRegexMatch
constructors (and all the subclasses) to define whether the instance should resolve on TEXT fields (for ESQL) or not (for SQL and EQL).Operators pushdown (both for filtering and sorting) is enabled only if the property also declares an exact multi-field that can be used as a fallback, otherwise the optimization gets skipped and the operation is executed in-line.
[edit]
Added a rule that replaces fields with their exact subfield when possible
Fixes #98642