Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Add optimisations for not-equalities #51088

Merged
merged 8 commits into from
Jan 30, 2020

Conversation

bpintea
Copy link
Contributor

@bpintea bpintea commented Jan 16, 2020

This PR adds optimisations of not-equalities in conjunctions and
disjunctions:

  • for conjunctions, the not-equality can be optimized away when applied
    together with a range or inequality, in case the not-equality point
    falls outside the domain of the later condition; if its on the boarder,
    it will modify the bound, to simply exclude the equality, if present;
    otherwise no optimisation can be applied;
  • for disjunctions, the not-equals filters away the ranges and
    inequalities, unless these include an equality on the bound, in which
    case the entire condition becomes always true.

It also fixes the aggregation of inequalities into ranges.

Adresses #49637.

This commit adds optimisations of not-equalities in conjunctions and
disjunctions:
* for conjunctions, the not-equality can be optimized away when applied
together with a range or inequality, in case the not-equality point
falls outside the domain of the later condition; if its on the boarder,
it will modify the bound, to simply exclude the equality, if present;
otherwise no optimisation can be applied;
* for disjunctions, the not-equals filters away the ranges and
inequalities, unless these include an equality on the bound, in which
case the entire condition becomes always true.
This commit fixes the loop that aggregates inequalities into ranges:
- it won't advance the outer loop index in case of a merge, since the
current element is removed;
- it will break the inner loop, since comparision against the element
selected in the outer loop can't continue, as it had been removed.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/SQL)

…timizer/Optimizer.java


apply review suggestion

Co-Authored-By: Marios Trivyzas <matriv@gmail.com>
Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments, and want to revisit the implementation of the new optimisations.

@@ -1391,7 +1412,14 @@ private Expression combine(Or or) {
}
}

return changed ? Predicates.combineOr(CollectionUtils.combine(exps, bcs, ranges)) : or;
Boolean updated = filterDisjunctionByNotEquals(notEquals, bcs, ranges);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the return value of the function to be boolean and not Boolean since false is never returned currently. Or if I'm confused if false is also a valid return value then why do we return a below literal with Boolean.TRUE instead of using the updated variable there?

Copy link
Contributor Author

@bpintea bpintea Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false will be returned in case the fields of a NotEquals and a BinaryComparison aren't semantically equal (and can't be simplified otherwise either; smth like a != 2 OR b < 3).
So I'd need a three states-return: one for no optimisation possible, one for optimisation has been applied and another one when the entire expression evaluates to literal TRUE.
I thought an enum might be too much, but happy to apply suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you, I missed that somehow.

…timizer/Optimizer.java


apply review suggestion.

Co-Authored-By: Marios Trivyzas <matriv@gmail.com>
Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through half of the code changes and I have a concern regarding OR simplification, from scoring point of view. And, thinking a bit more about this, I think the AND simplification also affects scoring.


return true;
} else { // comp > 0 : a != 4 AND 2 < a < ? : can only remove NotEquals if outside the range
comp = range.upper().foldable() ? BinaryComparison.compare(neqVal, range.upper().fold()) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find a bit confusing and potentially problematic, in the future if this code will change, re-using the comp variable in this block. Initially, it has a value assigned to it in on an upper level, it goes through two if branches, then in a separate, distinct conditional branch on an inner level gets another value and is treated similarly as before. I am not suggesting to change this, I am just mentioning it as a nitpick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result of a comp assignment is used right away (i.e. next line) and there only. So I considered unnecessary to declare multiple vars. But I can see how this can be regarded with unease. I'd think that if the usage pattern ever changes, a new var is indeed preferable.

for (Iterator<BinaryComparison> bcIterator = bcs.iterator(); bcIterator.hasNext(); ) {
BinaryComparison bc = bcIterator.next();
if (neq.left().semanticEquals(bc.left())) {
if (bc instanceof LessThan || bc instanceof GreaterThan) { // a != 2 OR a < 3 -> a != 2 (plus LessThen)
Copy link
Contributor

@astefan astefan Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this simplification here. If the not-equality and a < or > act on the same variable (a in this case) why is a != 2 OR a < 3 resulting in a a != 2 condition?

One edge case is a = 2 and in this case it matches a < 3. The other edge case is anything >= 3 and in this case it matches the condition a != 2. The way I see it, a != 2 OR a < 3 is always TRUE.

BUT, in Elasticsearch these two conditions being existent in a query (even if they can be simplified), they can contribute to the scoring of a document: an OR is translated into a bool query with two should statements. And, even if, the documents being returned are the same in a simplified (assuming the simplification returns always TRUE) and non-simplified way, if the user also asks for SCORE(), the actual results will be different in the two cases. I would take a step back and re-evaluate the OR simplification in general to make sure it doesn't affect the scoring of documents, in case SCORE() is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One edge case is a = 2 and in this case it matches a < 3. The other edge case is anything >= 3 and in this case it matches the condition a != 2. The way I see it, a != 2 OR a < 3 is always TRUE.

That's actually TRUE, thanks for catching it!
I believe the correct optimisation for a != X OR Y </<= a </<= Z should be to:

  • a != X if:
    • X ∉ (Y, Z); or
    • bound equalities are disjunctive: X==Y && (Y < a ... ) or X==Z && (... a < Z);
  • TRUE in all other cases.

And same goes for inequalities, where Y or Z would become -∞ or +∞ respectively.
I'll fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would take a step back and re-evaluate the OR simplification in general to make sure it doesn't affect the scoring of documents, in case SCORE() is returned.

Interesting point. Do you think it'd be easy (or even necessary) to come up with an example? This being about numerical conditions optimisations, I'm wondering if the scoring would be indeed affected, in isolation: a != 2 OR a > 3 OR b = 'foo' vs. a != 2 OR b = 'foo' would yield different scores?
I'll look into it (unless you're positive about it and I can save that time). Others - @costin? - welcome to chime in, I believe the simplifications in the optimiser are not isolated to equalities and inequalities around AND/ORs.

Copy link
Contributor

@astefan astefan Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bpintea select SCORE(), salary from test_emp where salary <> 70000 OR salary < 30000 ORDER BY salary which is the case for a != X OR a < Y. The results are like this (tested this in 6.8, but I am reasonably confident it's the same in master):

    SCORE()    |    salary     
---------------+---------------
2.0            |25324          
2.0            |25945          
2.0            |25976          
2.0            |26436          
2.0            |27215          
2.0            |28035          
2.0            |28336          
2.0            |28941          
2.0            |29175          
1.0            |30404          
1.0            |31120          
1.0            |31897          
1.0            |32263          
1.0            |32272          
1.0            |32568          
1.0            |33370          
1.0            |33956          
1.0            |34341          
1.0            |35222          
1.0            |35742          
1.0            |36051          
1.0            |36174          
1.0            |37112          
1.0            |37137          
1.0            |37691          
1.0            |37702          
1.0            |37716          
1.0            |37853          

If you simplify those two conditions to TRUE, there will be a single score of 1.0 I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
It seems indeed that other (existing) optimisations would also impact the scoring. (Smth like WHERE salary < 30000 OR salary < 50000 yield same scores as without the first condition, merging inequalities into ranges would probably also affect it etc.)

So I guess the presence of scoring might require a deeper consideration in regards to deploying the optimiser?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefan has a good point. The optimizations for AND should be safe since either you pass all o them or none - thus there's no scoring.
However with OR one result can match multiple queries hence why I would trim the optimizations to a minimum.
We could add awarness regarding scoring and thus enable some optimizations only when no scoring is needed however I think that adds additional complexity for unclear gains thus I would spend the effort somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However with OR one result can match multiple queries hence why I would trim the optimizations to a minimum.

Sure, I'll then trim the extras.

We could add awarness regarding scoring and thus enable some optimizations only when no scoring is needed however I think that adds additional complexity for unclear gains thus I would spend the effort somewhere else.

Got it. I guess machine-generated queries - maybe also BIs - can produce very verbose statements, but it's true that the optimisations can be an obscure ground and maybe not worth it if users tend to go for the scores as well.

Would opening a ticket to track/remove existing optimisation be worth it?



potentially others?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would opening a ticket to track/remove existing optimisation be worth it?

Please go ahead - it's better to have an issue that we close than forget about it because there's no issue.

Revert the optimisations of inequalities and ranges against not-equals,
since this can influence the scoring.
@bpintea
Copy link
Contributor Author

bpintea commented Jan 23, 2020

@elasticmachine merge upstream

@bpintea bpintea requested a review from astefan January 29, 2020 11:14
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left one comment related to some of the tests.

Comment on lines 1402 to 1406
And and = new And(EMPTY, neq, lte);

CombineBinaryComparisons rule = new CombineBinaryComparisons();
Expression exp = rule.rule(and);
assertEquals(And.class, exp.getClass()); // can't optimize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't here be more correct to also assert that the expression resulted from applying the rule is the same And as before? Meaning, is it enough to assume the type of the expression is And?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be a stricter check than necessary: if the evaluation would build another And instance, the test would still be correct.
Otoh, given the implementation, it's safe to assume that it should be the same And instance, so I've taken on your suggestion.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bpintea bpintea merged commit 789724a into elastic:master Jan 30, 2020
@bpintea bpintea deleted the enh/not-equals-optimisations branch January 30, 2020 22:28
bpintea added a commit that referenced this pull request Jan 30, 2020
* Optimize not-equalities in con-/disjunctions

This commit adds optimisations of not-equalities in conjunctions and
disjunctions:
* for conjunctions, the not-equality can be optimized away when applied
together with a range or inequality, in case the not-equality point
falls outside the domain of the later condition; if its on the boarder,
it will modify the bound, to simply exclude the equality, if present;
otherwise no optimisation can be applied;
* for disjunctions, the not-equals could filter away the ranges and
inequalities, unless these include an equality on the bound, in which
case the entire condition becomes always true, but this would influence
the score() function, so it's been omitted;

* fix aggregations of inequalities in ranges

This commit fixes the loop that aggregates inequalities into ranges:
- it won't advance the outer loop index in case of a merge, since the
current element is removed;
- it will break the inner loop, since comparision against the element
selected in the outer loop can't continue, as it had been removed.



(cherry picked from commit 789724a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants