-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL: Add optimisations for not-equalities #51088
Conversation
This commit adds optimisations of not-equalities in conjunctions and disjunctions: * for conjunctions, the not-equality can be optimized away when applied together with a range or inequality, in case the not-equality point falls outside the domain of the later condition; if its on the boarder, it will modify the bound, to simply exclude the equality, if present; otherwise no optimisation can be applied; * for disjunctions, the not-equals filters away the ranges and inequalities, unless these include an equality on the bound, in which case the entire condition becomes always true.
This commit fixes the loop that aggregates inequalities into ranges: - it won't advance the outer loop index in case of a merge, since the current element is removed; - it will break the inner loop, since comparision against the element selected in the outer loop can't continue, as it had been removed.
Pinging @elastic/es-search (:Search/SQL) |
x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java
Outdated
Show resolved
Hide resolved
…timizer/Optimizer.java apply review suggestion Co-Authored-By: Marios Trivyzas <matriv@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments, and want to revisit the implementation of the new optimisations.
x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java
Outdated
Show resolved
Hide resolved
@@ -1391,7 +1412,14 @@ private Expression combine(Or or) { | |||
} | |||
} | |||
|
|||
return changed ? Predicates.combineOr(CollectionUtils.combine(exps, bcs, ranges)) : or; | |||
Boolean updated = filterDisjunctionByNotEquals(notEquals, bcs, ranges); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the return value of the function to be boolean
and not Boolean
since false is never returned currently. Or if I'm confused if false is also a valid return value then why do we return a below literal with Boolean.TRUE
instead of using the updated
variable there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false
will be returned in case the fields of a NotEquals
and a BinaryComparison
aren't semantically equal (and can't be simplified otherwise either; smth like a != 2 OR b < 3
).
So I'd need a three states-return: one for no optimisation possible, one for optimisation has been applied and another one when the entire expression evaluates to literal TRUE
.
I thought an enum might be too much, but happy to apply suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thank you, I missed that somehow.
…timizer/Optimizer.java apply review suggestion. Co-Authored-By: Marios Trivyzas <matriv@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through half of the code changes and I have a concern regarding OR
simplification, from scoring point of view. And, thinking a bit more about this, I think the AND
simplification also affects scoring.
|
||
return true; | ||
} else { // comp > 0 : a != 4 AND 2 < a < ? : can only remove NotEquals if outside the range | ||
comp = range.upper().foldable() ? BinaryComparison.compare(neqVal, range.upper().fold()) : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find a bit confusing and potentially problematic, in the future if this code will change, re-using the comp
variable in this block. Initially, it has a value assigned to it in on an upper level, it goes through two if
branches, then in a separate, distinct conditional branch on an inner level gets another value and is treated similarly as before. I am not suggesting to change this, I am just mentioning it as a nitpick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result of a comp
assignment is used right away (i.e. next line) and there only. So I considered unnecessary to declare multiple vars. But I can see how this can be regarded with unease. I'd think that if the usage pattern ever changes, a new var is indeed preferable.
for (Iterator<BinaryComparison> bcIterator = bcs.iterator(); bcIterator.hasNext(); ) { | ||
BinaryComparison bc = bcIterator.next(); | ||
if (neq.left().semanticEquals(bc.left())) { | ||
if (bc instanceof LessThan || bc instanceof GreaterThan) { // a != 2 OR a < 3 -> a != 2 (plus LessThen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this simplification here. If the not-equality and a <
or >
act on the same variable (a
in this case) why is a != 2 OR a < 3
resulting in a a != 2
condition?
One edge case is a = 2
and in this case it matches a < 3
. The other edge case is anything >= 3
and in this case it matches the condition a != 2
. The way I see it, a != 2 OR a < 3
is always TRUE.
BUT, in Elasticsearch these two conditions being existent in a query (even if they can be simplified), they can contribute to the scoring of a document: an OR
is translated into a bool
query with two should
statements. And, even if, the documents being returned are the same in a simplified (assuming the simplification returns always TRUE) and non-simplified way, if the user also asks for SCORE(), the actual results will be different in the two cases. I would take a step back and re-evaluate the OR
simplification in general to make sure it doesn't affect the scoring of documents, in case SCORE() is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One edge case is
a = 2
and in this case it matchesa < 3
. The other edge case isanything >= 3
and in this case it matches the conditiona != 2
. The way I see it,a != 2 OR a < 3
is alwaysTRUE
.
That's actually TRUE
, thanks for catching it!
I believe the correct optimisation for a != X OR Y </<= a </<= Z
should be to:
a != X
if:- X ∉ (Y, Z); or
- bound equalities are disjunctive:
X==Y && (Y < a ... )
orX==Z && (... a < Z)
;
TRUE
in all other cases.
And same goes for inequalities, where Y or Z would become -∞ or +∞ respectively.
I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would take a step back and re-evaluate the OR simplification in general to make sure it doesn't affect the scoring of documents, in case SCORE() is returned.
Interesting point. Do you think it'd be easy (or even necessary) to come up with an example? This being about numerical conditions optimisations, I'm wondering if the scoring would be indeed affected, in isolation: a != 2 OR a > 3 OR b = 'foo'
vs. a != 2 OR b = 'foo'
would yield different scores?
I'll look into it (unless you're positive about it and I can save that time). Others - @costin? - welcome to chime in, I believe the simplifications in the optimiser are not isolated to equalities and inequalities around AND
/OR
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bpintea select SCORE(), salary from test_emp where salary <> 70000 OR salary < 30000 ORDER BY salary
which is the case for a != X OR a < Y
. The results are like this (tested this in 6.8, but I am reasonably confident it's the same in master
):
SCORE() | salary
---------------+---------------
2.0 |25324
2.0 |25945
2.0 |25976
2.0 |26436
2.0 |27215
2.0 |28035
2.0 |28336
2.0 |28941
2.0 |29175
1.0 |30404
1.0 |31120
1.0 |31897
1.0 |32263
1.0 |32272
1.0 |32568
1.0 |33370
1.0 |33956
1.0 |34341
1.0 |35222
1.0 |35742
1.0 |36051
1.0 |36174
1.0 |37112
1.0 |37137
1.0 |37691
1.0 |37702
1.0 |37716
1.0 |37853
If you simplify those two conditions to TRUE
, there will be a single score of 1.0 I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
It seems indeed that other (existing) optimisations would also impact the scoring. (Smth like WHERE salary < 30000 OR salary < 50000
yield same scores as without the first condition, merging inequalities into ranges would probably also affect it etc.)
So I guess the presence of scoring might require a deeper consideration in regards to deploying the optimiser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefan has a good point. The optimizations for AND
should be safe since either you pass all o them or none - thus there's no scoring.
However with OR
one result can match multiple queries hence why I would trim the optimizations to a minimum.
We could add awarness regarding scoring and thus enable some optimizations only when no scoring is needed however I think that adds additional complexity for unclear gains thus I would spend the effort somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However with OR one result can match multiple queries hence why I would trim the optimizations to a minimum.
Sure, I'll then trim the extras.
We could add awarness regarding scoring and thus enable some optimizations only when no scoring is needed however I think that adds additional complexity for unclear gains thus I would spend the effort somewhere else.
Got it. I guess machine-generated queries - maybe also BIs - can produce very verbose statements, but it's true that the optimisations can be an obscure ground and maybe not worth it if users tend to go for the scores as well.
Would opening a ticket to track/remove existing optimisation be worth it?
elasticsearch/x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java
Line 1367 in 9396e75
private Expression combine(Or or) { |
elasticsearch/x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java
Line 1435 in 9396e75
// (1 < a < 3) OR (2 < a < 3) -> (1 < a < 3) |
potentially others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would opening a ticket to track/remove existing optimisation be worth it?
Please go ahead - it's better to have an issue that we close than forget about it because there's no issue.
Revert the optimisations of inequalities and ranges against not-equals, since this can influence the scoring.
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left one comment related to some of the tests.
And and = new And(EMPTY, neq, lte); | ||
|
||
CombineBinaryComparisons rule = new CombineBinaryComparisons(); | ||
Expression exp = rule.rule(and); | ||
assertEquals(And.class, exp.getClass()); // can't optimize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't here be more correct to also assert that the expression resulted from applying the rule is the same And as before? Meaning, is it enough to assume the type of the expression is And
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd be a stricter check than necessary: if the evaluation would build another And instance, the test would still be correct.
Otoh, given the implementation, it's safe to assume that it should be the same And instance, so I've taken on your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- make one test assumption tighter
* Optimize not-equalities in con-/disjunctions This commit adds optimisations of not-equalities in conjunctions and disjunctions: * for conjunctions, the not-equality can be optimized away when applied together with a range or inequality, in case the not-equality point falls outside the domain of the later condition; if its on the boarder, it will modify the bound, to simply exclude the equality, if present; otherwise no optimisation can be applied; * for disjunctions, the not-equals could filter away the ranges and inequalities, unless these include an equality on the bound, in which case the entire condition becomes always true, but this would influence the score() function, so it's been omitted; * fix aggregations of inequalities in ranges This commit fixes the loop that aggregates inequalities into ranges: - it won't advance the outer loop index in case of a merge, since the current element is removed; - it will break the inner loop, since comparision against the element selected in the outer loop can't continue, as it had been removed. (cherry picked from commit 789724a)
This PR adds optimisations of not-equalities in conjunctions and
disjunctions:
together with a range or inequality, in case the not-equality point
falls outside the domain of the later condition; if its on the boarder,
it will modify the bound, to simply exclude the equality, if present;
otherwise no optimisation can be applied;
for disjunctions, the not-equals filters away the ranges andinequalities, unless these include an equality on the bound, in which
case the entire condition becomes always true.
It also fixes the aggregation of inequalities into ranges.
Adresses #49637.