Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouping Engine fix when a limit spec with different order by columns is applied #16534

Merged
merged 9 commits into from
Jun 20, 2024

Conversation

sreemanamala
Copy link
Contributor

@sreemanamala sreemanamala commented Jun 1, 2024

Description

fixed the grouping engine for a query with grouping sets when limit is applied with order by columns being different from the query dimensions in order to merge the subtotal results correctly.

for example a query like :

SELECT
  isNew, isRobot, COUNT(*) AS "Cnt"
FROM "wikipedia"
GROUP BY GROUPING SETS ((isRobot), (isNew))
ORDER BY 2, 1
limit 100

we need to sort by the 2nd dimension first before applying the merge function.

Release note


Key changed/added classes in this PR
  • GroupingEngine

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@@ -686,13 +686,27 @@ public Sequence<ResultRow> processSubtotalsSpec(
processingConfig.intermediateComputeSizeBytes()
);

List<String> queryDimNames = baseSubtotalQuery.getDimensions().stream().map(DimensionSpec::getOutputName)
List<String> queryDimNamesInOrder = baseSubtotalQuery.getDimensions().stream().map(DimensionSpec::getOutputName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the reasons/etc here - but if these are the column the query's output will be ordered; can't we place these logics inside a member method as GroupByQuery#getOrderColumnNames or something ?

@@ -577,6 +577,7 @@ private Ordering<ResultRow> getRowOrderingForPushDown(
final List<Boolean> needsReverseList = new ArrayList<>();
final List<ColumnType> dimensionTypes = new ArrayList<>();
final List<StringComparator> comparators = new ArrayList<>();
final List<DimensionSpec> dimensionsInOrder = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this double-book-keeping is a pretty wierd way to supply this detail - but it works...

I think it would be more beneficial

  • identify the differences between getRowOrderingForPushDown and getRowOrdering
  • merge the differing behaviour into the getRowOrderingForPushDown method
    • isn't the main difference is that getRowOrdering ignores the limitSpec regardless its set?
  • possibly also eat-up the RowBasedGrouperHelper which is another copy of the process of creating the comparators - I think correctness of the execution depends on knowing on which columns we ordered and in what order....having just one source of truth could reduce the amount of issues we face

all of the above are refactors - which might delay the fix of this bug....

@@ -565,7 +565,7 @@ private boolean canDoLimitPushDown(
* limit/order spec (unlike non-push down case where the results always use the default natural ascending order),
* so when merging these partial result streams, the merge needs to use the same ordering to get correct results.
*/
private Ordering<ResultRow> getRowOrderingForPushDown(
private OrderingAndDimensions getRowOrderingAndDimensionsForPushDown(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I think you could keep the old method with its signature; and service it like: return getOrderingAndDimensions(false).getRowOrdering()

Comment on lines 690 to 693
.getDimensions()
.stream()
.map(DimensionSpec::getOutputName)
.collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: can this be hidden somewhere - like a method named: getDimentsionColumnNames or something

@kgyrtkirk kgyrtkirk merged commit 7ac0862 into apache:master Jun 20, 2024
87 checks passed
@317brian
Copy link
Contributor

Should this have the Bug fix label on it too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants