SQL: Add optimisations for not-equalities #51088

bpintea · 2020-01-16T09:27:51Z

This PR adds optimisations of not-equalities in conjunctions and
disjunctions:

for conjunctions, the not-equality can be optimized away when applied
together with a range or inequality, in case the not-equality point
falls outside the domain of the later condition; if its on the boarder,
it will modify the bound, to simply exclude the equality, if present;
otherwise no optimisation can be applied;
for disjunctions, the not-equals filters away the ranges and
inequalities, unless these include an equality on the bound, in which
case the entire condition becomes always true.

It also fixes the aggregation of inequalities into ranges.

Adresses #49637.

This commit adds optimisations of not-equalities in conjunctions and disjunctions: * for conjunctions, the not-equality can be optimized away when applied together with a range or inequality, in case the not-equality point falls outside the domain of the later condition; if its on the boarder, it will modify the bound, to simply exclude the equality, if present; otherwise no optimisation can be applied; * for disjunctions, the not-equals filters away the ranges and inequalities, unless these include an equality on the bound, in which case the entire condition becomes always true.

This commit fixes the loop that aggregates inequalities into ranges: - it won't advance the outer loop index in case of a merge, since the current element is removed; - it will break the inner loop, since comparision against the element selected in the outer loop can't continue, as it had been removed.

elasticmachine · 2020-01-16T09:27:53Z

Pinging @elastic/es-search (:Search/SQL)

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java

…timizer/Optimizer.java apply review suggestion Co-Authored-By: Marios Trivyzas <matriv@gmail.com>

matriv

Left a couple of comments, and want to revisit the implementation of the new optimisations.

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java

matriv · 2020-01-16T16:39:25Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java

@@ -1391,7 +1412,14 @@ private Expression combine(Or or) {
                }
            }

-            return changed ? Predicates.combineOr(CollectionUtils.combine(exps, bcs, ranges)) : or;
+            Boolean updated = filterDisjunctionByNotEquals(notEquals, bcs, ranges);


I would prefer the return value of the function to be boolean and not Boolean since false is never returned currently. Or if I'm confused if false is also a valid return value then why do we return a below literal with Boolean.TRUE instead of using the updated variable there?

false will be returned in case the fields of a NotEquals and a BinaryComparison aren't semantically equal (and can't be simplified otherwise either; smth like a != 2 OR b < 3).
So I'd need a three states-return: one for no optimisation possible, one for optimisation has been applied and another one when the entire expression evaluates to literal TRUE.
I thought an enum might be too much, but happy to apply suggestions.

Ok, thank you, I missed that somehow.

…timizer/Optimizer.java apply review suggestion. Co-Authored-By: Marios Trivyzas <matriv@gmail.com>

matriv

LGTM

astefan

I went through half of the code changes and I have a concern regarding OR simplification, from scoring point of view. And, thinking a bit more about this, I think the AND simplification also affects scoring.

astefan · 2020-01-21T07:00:55Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java

+
+                            return true;
+                        } else { // comp > 0 : a != 4 AND 2 < a < ? : can only remove NotEquals if outside the range
+                            comp = range.upper().foldable() ? BinaryComparison.compare(neqVal, range.upper().fold()) : null;


I find a bit confusing and potentially problematic, in the future if this code will change, re-using the comp variable in this block. Initially, it has a value assigned to it in on an upper level, it goes through two if branches, then in a separate, distinct conditional branch on an inner level gets another value and is treated similarly as before. I am not suggesting to change this, I am just mentioning it as a nitpick.

The result of a comp assignment is used right away (i.e. next line) and there only. So I considered unnecessary to declare multiple vars. But I can see how this can be regarded with unease. I'd think that if the usage pattern ever changes, a new var is indeed preferable.

astefan · 2020-01-21T07:23:55Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java

+                for (Iterator<BinaryComparison> bcIterator = bcs.iterator(); bcIterator.hasNext(); ) {
+                    BinaryComparison bc = bcIterator.next();
+                    if (neq.left().semanticEquals(bc.left())) {
+                        if (bc instanceof LessThan || bc instanceof GreaterThan) { // a != 2 OR a < 3 -> a != 2 (plus LessThen)


I don't understand this simplification here. If the not-equality and a < or > act on the same variable (a in this case) why is a != 2 OR a < 3 resulting in a a != 2 condition?

One edge case is a = 2 and in this case it matches a < 3. The other edge case is anything >= 3 and in this case it matches the condition a != 2. The way I see it, a != 2 OR a < 3 is always TRUE.

BUT, in Elasticsearch these two conditions being existent in a query (even if they can be simplified), they can contribute to the scoring of a document: an OR is translated into a bool query with two should statements. And, even if, the documents being returned are the same in a simplified (assuming the simplification returns always TRUE) and non-simplified way, if the user also asks for SCORE(), the actual results will be different in the two cases. I would take a step back and re-evaluate the OR simplification in general to make sure it doesn't affect the scoring of documents, in case SCORE() is returned.

One edge case is a = 2 and in this case it matches a < 3. The other edge case is anything >= 3 and in this case it matches the condition a != 2. The way I see it, a != 2 OR a < 3 is always TRUE.

That's actually TRUE, thanks for catching it!
I believe the correct optimisation for a != X OR Y </<= a </<= Z should be to:

a != X if:

X ∉ (Y, Z); or

bound equalities are disjunctive: X==Y && (Y < a ... ) or X==Z && (... a < Z);

TRUE in all other cases.

And same goes for inequalities, where Y or Z would become -∞ or +∞ respectively.
I'll fix it.

I would take a step back and re-evaluate the OR simplification in general to make sure it doesn't affect the scoring of documents, in case SCORE() is returned.

Interesting point. Do you think it'd be easy (or even necessary) to come up with an example? This being about numerical conditions optimisations, I'm wondering if the scoring would be indeed affected, in isolation: a != 2 OR a > 3 OR b = 'foo' vs. a != 2 OR b = 'foo' would yield different scores?
I'll look into it (unless you're positive about it and I can save that time). Others - @costin? - welcome to chime in, I believe the simplifications in the optimiser are not isolated to equalities and inequalities around AND/ORs.

@bpintea select SCORE(), salary from test_emp where salary <> 70000 OR salary < 30000 ORDER BY salary which is the case for a != X OR a < Y. The results are like this (tested this in 6.8, but I am reasonably confident it's the same in master):

SCORE() | salary ---------------+--------------- 2.0 |25324 2.0 |25945 2.0 |25976 2.0 |26436 2.0 |27215 2.0 |28035 2.0 |28336 2.0 |28941 2.0 |29175 1.0 |30404 1.0 |31120 1.0 |31897 1.0 |32263 1.0 |32272 1.0 |32568 1.0 |33370 1.0 |33956 1.0 |34341 1.0 |35222 1.0 |35742 1.0 |36051 1.0 |36174 1.0 |37112 1.0 |37137 1.0 |37691 1.0 |37702 1.0 |37716 1.0 |37853

If you simplify those two conditions to TRUE, there will be a single score of 1.0 I think.

Thanks.
It seems indeed that other (existing) optimisations would also impact the scoring. (Smth like WHERE salary < 30000 OR salary < 50000 yield same scores as without the first condition, merging inequalities into ranges would probably also affect it etc.)

So I guess the presence of scoring might require a deeper consideration in regards to deploying the optimiser?

@astefan has a good point. The optimizations for AND should be safe since either you pass all o them or none - thus there's no scoring.
However with OR one result can match multiple queries hence why I would trim the optimizations to a minimum.
We could add awarness regarding scoring and thus enable some optimizations only when no scoring is needed however I think that adds additional complexity for unclear gains thus I would spend the effort somewhere else.

However with OR one result can match multiple queries hence why I would trim the optimizations to a minimum.

Sure, I'll then trim the extras.

We could add awarness regarding scoring and thus enable some optimizations only when no scoring is needed however I think that adds additional complexity for unclear gains thus I would spend the effort somewhere else.

Got it. I guess machine-generated queries - maybe also BIs - can produce very verbose statements, but it's true that the optimisations can be an obscure ground and maybe not worth it if users tend to go for the scores as well.

Would opening a ticket to track/remove existing optimisation be worth it?

elasticsearch/x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java

Line 1367 in 9396e75

private Expression combine(Or or) {

elasticsearch/x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java

Line 1435 in 9396e75

// (1 < a < 3) OR (2 < a < 3) -> (1 < a < 3)

potentially others?

Would opening a ticket to track/remove existing optimisation be worth it?

Please go ahead - it's better to have an issue that we close than forget about it because there's no issue.

Revert the optimisations of inequalities and ranges against not-equals, since this can influence the scoring.

bpintea · 2020-01-23T21:07:34Z

@elasticmachine merge upstream

astefan

LGTM. Left one comment related to some of the tests.

astefan · 2020-01-29T11:56:30Z

x-pack/plugin/sql/src/test/java/org/elasticsearch/xpack/sql/optimizer/OptimizerTests.java

+        And and = new And(EMPTY, neq, lte);
+
+        CombineBinaryComparisons rule = new CombineBinaryComparisons();
+        Expression exp = rule.rule(and);
+        assertEquals(And.class, exp.getClass()); // can't optimize


Wouldn't here be more correct to also assert that the expression resulted from applying the rule is the same And as before? Meaning, is it enough to assume the type of the expression is And?

That'd be a stricter check than necessary: if the evaluation would build another And instance, the test would still be correct.
Otoh, given the implementation, it's safe to assume that it should be the same And instance, so I've taken on your suggestion.

costin

LGTM

- make one test assumption tighter

…timisations

* Optimize not-equalities in con-/disjunctions This commit adds optimisations of not-equalities in conjunctions and disjunctions: * for conjunctions, the not-equality can be optimized away when applied together with a range or inequality, in case the not-equality point falls outside the domain of the later condition; if its on the boarder, it will modify the bound, to simply exclude the equality, if present; otherwise no optimisation can be applied; * for disjunctions, the not-equals could filter away the ranges and inequalities, unless these include an equality on the bound, in which case the entire condition becomes always true, but this would influence the score() function, so it's been omitted; * fix aggregations of inequalities in ranges This commit fixes the loop that aggregates inequalities into ranges: - it won't advance the outer loop index in case of a merge, since the current element is removed; - it will break the inner loop, since comparision against the element selected in the outer loop can't continue, as it had been removed. (cherry picked from commit 789724a)

bpintea added 2 commits January 16, 2020 10:06

bpintea added >feature :Analytics/SQL SQL querying v8.0.0 v7.7.0 labels Jan 16, 2020

bpintea requested review from costin, astefan and matriv January 16, 2020 09:27

matriv reviewed Jan 16, 2020

View reviewed changes

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/optimizer/Optimizer.java Outdated Show resolved Hide resolved

Update x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/op…

23e3aac

…timizer/Optimizer.java apply review suggestion Co-Authored-By: Marios Trivyzas <matriv@gmail.com>

matriv reviewed Jan 16, 2020

View reviewed changes

Update x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/op…

4b7fc42

…timizer/Optimizer.java apply review suggestion. Co-Authored-By: Marios Trivyzas <matriv@gmail.com>

matriv approved these changes Jan 17, 2020

View reviewed changes

astefan reviewed Jan 21, 2020

View reviewed changes

remove optimisations of ORs influencing scoring

2926328

Revert the optimisations of inequalities and ranges against not-equals, since this can influence the scoring.

Merge branch 'master' into enh/not-equals-optimisations

4bb8899

bpintea requested a review from astefan January 29, 2020 11:14

astefan approved these changes Jan 29, 2020

View reviewed changes

costin approved these changes Jan 29, 2020

View reviewed changes

bpintea added 2 commits January 30, 2020 18:26

address review comments

d63e4f4

- make one test assumption tighter

Merge remote-tracking branch 'upstream/master' into enh/not-equals-op…

f83c98a

…timisations

bpintea merged commit 789724a into elastic:master Jan 30, 2020

bpintea mentioned this pull request Jan 30, 2020

SQL: Add optimisations for not-equalities (#51088) #51700

Merged

bpintea deleted the enh/not-equals-optimisations branch January 30, 2020 22:28

bpintea mentioned this pull request Feb 3, 2020

SQL: Improve optimization of binary comparisons and equal #49637

Closed

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket elastic/elasticsearch-net#4525

Closed

38 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL: Add optimisations for not-equalities #51088

SQL: Add optimisations for not-equalities #51088

bpintea commented Jan 16, 2020 •

edited

Loading

elasticmachine commented Jan 16, 2020

matriv left a comment

matriv Jan 16, 2020

bpintea Jan 16, 2020 •

edited

Loading

matriv Jan 17, 2020

matriv left a comment

astefan left a comment

astefan Jan 21, 2020

bpintea Jan 21, 2020

astefan Jan 21, 2020 •

edited

Loading

bpintea Jan 21, 2020

bpintea Jan 21, 2020

astefan Jan 21, 2020 •

edited

Loading

bpintea Jan 21, 2020

costin Jan 21, 2020

bpintea Jan 22, 2020

costin Jan 23, 2020

bpintea commented Jan 23, 2020

astefan left a comment

astefan Jan 29, 2020

bpintea Jan 30, 2020

costin left a comment

SQL: Add optimisations for not-equalities #51088

SQL: Add optimisations for not-equalities #51088

Conversation

bpintea commented Jan 16, 2020 • edited Loading

elasticmachine commented Jan 16, 2020

matriv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpintea Jan 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matriv left a comment

Choose a reason for hiding this comment

astefan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astefan Jan 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astefan Jan 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpintea commented Jan 23, 2020

astefan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costin left a comment

Choose a reason for hiding this comment

bpintea commented Jan 16, 2020 •

edited

Loading

bpintea Jan 16, 2020 •

edited

Loading

astefan Jan 21, 2020 •

edited

Loading

astefan Jan 21, 2020 •

edited

Loading