SQL: Fix ORDER BY on aggregates and GROUPed BY fields #51894

matriv · 2020-02-04T22:29:47Z

Previously, in the in-memory sorting module
LocalAggregationSorterListener only the aggregate functions where used
(grabbed by the sortingColumns). As a consequence, if the ORDER BY
was also using columns of the GROUP BY clause, (especially in the case
of higher priority - before the aggregate functions) wrong results were
produced. E.g.:

SELECT gender, MAX(salary) AS max FROM test_emp
GROUP BY gender
ORDER BY gender, max

Add all columns of the ORDER BY to the sortingColumns so that the
LocalAggregationSorterListener can use the correct comparators in
the underlying PriorityQueue used to implement the in-memory sorting.

Fixes: #50355

Previously, in the in-memory sorting module `LocalAggregationSorterListener` only the aggregate functions where used (grabbed by the `sortingColumns`). As a consequence, if the ORDER BY was also using columns of the GROUP BY clause, (especially in the case of higher priority - before the aggregate functions) wrong results were produced. E.g.: ``` SELECT gender, MAX(salary) AS max FROM test_emp GROUP BY gender ORDER BY gender, max ``` Add all columns of the ORDER BY to the `sortingColumns` so that the `LocalAggregationSorterListener` can use the correct comparators in the underlying PriorityQueue used to implement the in-memory sorting. Fixes: elastic#50355

elasticmachine · 2020-02-04T22:29:50Z

Pinging @elastic/es-search (:Search/SQL)

matriv · 2020-02-04T22:36:29Z

I'm thinking that the only optimisation we can do is maybe skip the columns after the last aggregate functions in the ORDER BY, For example in

SELECT f1, f2, f3, MIN(f4) AS min, MAX(f5) AS max FROM test
GROUP BY f1, f2, f3
ORDER BY f1, max, f2, min, f3

The f3 can be skipped since the rows are processed already with the ordering defined by f1, f2, f3.

astefan

Left some comments.

astefan · 2020-02-05T11:52:59Z

.../plugin/sql/src/main/java/org/elasticsearch/xpack/sql/querydsl/container/QueryContainer.java

-                comp = s.missing() == Sort.Missing.FIRST ? Comparator.nullsFirst(comp) : Comparator.nullsLast(comp);
-
-                tuple = new Tuple<>(Integer.valueOf(atIndex), comp);
+                customSort = Boolean.TRUE;


Why not breaking early, if there's the first AggregateSort found?

We cannot break after the first AggregateSort. Maybe we could break after the last AggregateSort.
If we have:

SELECT f1, f2, f3, MAX(f4) as max, MIN(f5) as min FROM test GROUP BY f1, f2, f3 ORDER BY f1, max, f2, min, f3

we cannot break after max, we could break after min.

I'd rather leave the fix as is and introduce this optimisation in a separate PR where it's properly tested that it works.
(Needs some carefully chosen data set to test this ordering case)

I don't understand. That's a simple loop that, when finds an AggregateSort, will set customSort to TRUE. It doesn't really matter what's after in the list of sorts because it doesn't change the value of customSort.
Also, I meant breaking from inside the loop not from the method...

Sorry, misunderstood you there, sure we should break once the 1st is found.

astefan · 2020-02-05T11:54:05Z

.../plugin/sql/src/main/java/org/elasticsearch/xpack/sql/querydsl/container/QueryContainer.java

+                    break;
+                }
+            }
+            if (atIndex==-1) {


atIndex == -1

Somehow I broke the formatting :(

astefan · 2020-02-05T11:54:27Z

.../plugin/sql/src/main/java/org/elasticsearch/xpack/sql/querydsl/container/QueryContainer.java

+                throw new SqlIllegalArgumentException("Cannot find backing column for ordering aggregation [{}]", s);
+            }
+            // assemble a comparator for it
+            Comparator comp = s.direction()==Sort.Direction.ASC ? Comparator.naturalOrder():Comparator.reverseOrder();


Comparator comp = s.direction() == Sort.Direction.ASC ? Comparator.naturalOrder() : Comparator.reverseOrder();

astefan · 2020-02-05T11:54:41Z

.../plugin/sql/src/main/java/org/elasticsearch/xpack/sql/querydsl/container/QueryContainer.java

+            }
+            // assemble a comparator for it
+            Comparator comp = s.direction()==Sort.Direction.ASC ? Comparator.naturalOrder():Comparator.reverseOrder();
+            comp = s.missing()==Sort.Missing.FIRST ? Comparator.nullsFirst(comp):Comparator.nullsLast(comp);


comp = s.missing() == Sort.Missing.FIRST ? Comparator.nullsFirst(comp) : Comparator.nullsLast(comp);

astefan · 2020-02-05T12:00:06Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/querydsl/container/ScoreSort.java

@@ -8,13 +8,22 @@
 import java.util.Objects;

 public class ScoreSort extends Sort {
-    public ScoreSort(Direction direction, Missing missing) {
+
+    final String id;


Did you explore the idea of having the id as part of Sort, as I'm seeing a lot of repeated code? (skimming through this PR's changes, the id seems to be introduced in all classes inheriting Sort). I am probably missing something that prevents this approach to be used, would love to hear the reasons.

I decided not to, because 2 classes the AttributeSort and AggregateSort can extract the id from their attribute variables. On the other hand 3 classes need to store it, I just chose the 1st approach with forcing to implement the public String id() getter.

I would have done it differently. I feel like id and id() should belong to the same base class - Sort, with the help of an additional constructor (that receives the id as an argument). Sort would have also provided a default implementation for id() (to return the id) and the other classes would have overriden the default implementation of id() where appropriate. Also, these "special" cases classes would have called the aforementioned freshly added constructor in Sort.

The main bothering aspect for me is that there is a required id() method in Sort, but the id itself (that could be related to this id() method) lives in the inheriting classes.

You can leave it as is, I just wanted to point out something that doesn't feel right to me.

astefan

LGTM

bpintea · 2020-02-07T14:12:25Z

it LGTM, but there are aspects I didn't take time to delve into.

costin · 2020-02-08T20:17:25Z

I might be missing something so bear with me.
The Comparator does not sort on all fields on purpose - to quote the implementation:

// if a sort item is not in the list, it is assumed the sorting happened in ES
// and the results are left as is (by using the row ordering), otherwise it is 
// sorted based on the given criteria.
//
// Take for example ORDER BY a, x, b, y
// a, b - are sorted in ES
// x, y - need to be sorted client-side
// sorting on x kicks in, only if the values for a are equal.

this means the query

SELECT gender, MAX(salary) AS max FROM test_emp
GROUP BY gender
ORDER BY gender, max

gets translated into ORDER BY gender followed by a client-side order by max which acts as a tie-breaker inside equal gender.
So from the AggSortingQueue, gender has an equality comparator which only triggers the max one.

What's incorrect with this approach (an example would help). Thanks.

matriv · 2020-02-08T23:34:07Z

Take for example this query that you mentioned:

SELECT gender, MAX(salary) AS max FROM test_emp
GROUP BY gender
ORDER BY gender, max

We receive the rows ordered by gender.
Currently (before this PR) the AggSortingQueue will only have a comparator for index 1 (the max field). The lessThan() that does the "magic", iterates over the comparators, the 1st one it finds is the comparator for max and it applies it independently of the values of the 2 rows for the gender (position 0), so it messes up the ordering. Checkout the following example:

@SuppressWarnings("rawtypes")
public void testAggSorting_TwoFields_LALA() {
    List<Tuple<Integer, Comparator>> tuples = new ArrayList<>(2);
//        tuples.add(new Tuple<>(0, Comparator.reverseOrder()));
    tuples.add(new Tuple<>(1, Comparator.reverseOrder()));
    Querier.AggSortingQueue queue = new AggSortingQueue(10, tuples);

    for (int i = 1; i <= 100; i++) {
        queue.insertWithOverflow(new Tuple<>(Arrays.asList(100 - i + 1, i), i));
    }
    List<List<?>> results = queue.asList();

    assertEquals(10, results.size());
    for (int i = 0; i < 10; i++) {
        assertEquals(100 - i, results.get(i).get(0));
        assertEquals(i + 1, results.get(i).get(1));
    }
}

The test fails with the commented out line (comparator on the 1st column) and the first row returned
is [1, 100] instead of [100, 1].

Another solution, instead of passing all the comparators, would be to change the implementation of the queue. When a comparator is encountered for column i, it needs to be applied only if the values for all the previous columns [0 ... i - 1] are the same, otherwise do nothing and leave ordering as is. Or maybe another approach where we iterate over the columns and not the comparators. I'll check it out if it makes sense, keeping in mind the performance.

matriv · 2020-02-09T11:11:56Z

Here is a proposed solution: https://gist.github.com/matriv/1f3d6dc3150e5ff231598ec4c68be8b1
Need to check the NULLS FIRST, NULLS LAST handling too though.

matriv · 2020-02-09T14:30:36Z

The approach above doesn't work. The problem is that we have for every result row all the returned columns (can be more than the ones used for ordering) and we lack the information of which rows were involved in ordering. As far as I can think, we need this information, because we need to apply the AggSorting on column n only if all columns [0 .. n-1] are equal. Therefore we need to pass this info to the AggSortingQueue. We could do something like passing only the integer indices of the columns involved in the ORDER BY. Then use this info to check that values for columns [0 . .n-1]are equal and then proceed on using the AggSorting comparators. So we also need to do this equals check on the previous columns, which leads me to prefer passing all the comparators involved instead of those column indices and use Objects.equals(leftValue, rightValue().

Maybe you have some better idea though, or maybe I'm missing something in my whole approach.

matriv · 2020-02-11T18:12:50Z

@astefan @costin After discussion, pushed a slightly different approach, where we don't pass the comparators for ES pre-ordered columns, but just the indices of those columns (on the output rows) with a null comparator and keep Objects.equals() to find ties.

Added a couple of integ tests for histograms and and a couple more unit tests for the AggSortingQueue.

costin

LGTM - thanks for your patience while looking into this one.

costin · 2020-02-11T20:07:38Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/execution/search/Querier.java

                if (comparator != null) {
                    int result = comparator.compare(vl, vr);
-                    // if things are equals, move to the next comparator
+                    // if things are not equal: return the comparison result,
+                    // or else: move to the next comparator to solve the tie.


Nit: or else -> else or otherwise

matriv · 2020-02-11T20:11:36Z

@costin Thx for your help to tackle the issue correctly!

Previously, in the in-memory sorting module `LocalAggregationSorterListener` only the aggregate functions where used (grabbed by the `sortingColumns`). As a consequence, if the ORDER BY was also using columns of the GROUP BY clause, (especially in the case of higher priority - before the aggregate functions) wrong results were produced. E.g.: ``` SELECT gender, MAX(salary) AS max FROM test_emp GROUP BY gender ORDER BY gender, max ``` Add all columns of the ORDER BY to the `sortingColumns` so that the `LocalAggregationSorterListener` can use the correct comparators in the underlying PriorityQueue used to implement the in-memory sorting. Fixes: #50355 (cherry picked from commit be680af)

matriv added >bug :Analytics/SQL SQL querying v8.0.0 v7.7.0 v7.6.1 labels Feb 4, 2020

matriv requested review from costin, astefan and bpintea February 4, 2020 22:29

matriv and others added 3 commits February 5, 2020 01:13

Merge remote-tracking branch 'upstream/master' into fix-50355

a68c6de

make field private

90cb862

Merge remote-tracking branch 'upstream/master' into fix-50355

ab0c627

astefan reviewed Feb 5, 2020

View reviewed changes

matriv added 2 commits February 5, 2020 13:12

fix formatting

9d25c93

break from loop once custom sort is recognized

aaf2bd3

matriv requested a review from astefan February 5, 2020 12:15

astefan approved these changes Feb 5, 2020

View reviewed changes

bpintea approved these changes Feb 7, 2020

View reviewed changes

matriv and others added 3 commits February 11, 2020 19:07

change approach - add more tests

a1ec293

Merge remote-tracking branch 'upstream/master' into fix-50355

65a1ca3

revert changes to untouched files

46b9700

matriv requested a review from astefan February 11, 2020 18:10

costin approved these changes Feb 11, 2020

View reviewed changes

Address comment

637a9f9

matriv merged commit be680af into elastic:master Feb 12, 2020

matriv deleted the fix-50355 branch February 12, 2020 08:37

matriv added backport pending v6.8.7 labels Feb 12, 2020

matriv removed the backport pending label Feb 12, 2020

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL: Fix ORDER BY on aggregates and GROUPed BY fields #51894

SQL: Fix ORDER BY on aggregates and GROUPed BY fields #51894

matriv commented Feb 4, 2020

elasticmachine commented Feb 4, 2020

matriv commented Feb 4, 2020 •

edited

astefan left a comment

astefan Feb 5, 2020

matriv Feb 5, 2020

astefan Feb 5, 2020

matriv Feb 5, 2020

astefan Feb 5, 2020

matriv Feb 5, 2020

astefan Feb 5, 2020

astefan Feb 5, 2020

astefan Feb 5, 2020

matriv Feb 5, 2020

astefan Feb 5, 2020

astefan left a comment

bpintea commented Feb 7, 2020

costin commented Feb 8, 2020

matriv commented Feb 8, 2020

matriv commented Feb 9, 2020

matriv commented Feb 9, 2020

matriv commented Feb 11, 2020

costin left a comment

costin Feb 11, 2020 •

edited

matriv commented Feb 11, 2020

SQL: Fix ORDER BY on aggregates and GROUPed BY fields #51894

SQL: Fix ORDER BY on aggregates and GROUPed BY fields #51894

Conversation

matriv commented Feb 4, 2020

elasticmachine commented Feb 4, 2020

matriv commented Feb 4, 2020 • edited

astefan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astefan left a comment

Choose a reason for hiding this comment

bpintea commented Feb 7, 2020

costin commented Feb 8, 2020

matriv commented Feb 8, 2020

matriv commented Feb 9, 2020

matriv commented Feb 9, 2020

matriv commented Feb 11, 2020

costin left a comment

Choose a reason for hiding this comment

costin Feb 11, 2020 • edited

Choose a reason for hiding this comment

matriv commented Feb 11, 2020

matriv commented Feb 4, 2020 •

edited

costin Feb 11, 2020 •

edited