Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-6261] AssertionError with field pruning & duplicate agg calls #3703

Merged
merged 1 commit into from
Feb 26, 2024

Conversation

nielspardon
Copy link
Contributor

This PR fixes issues [CALCITE-6261] and [CALCITE-5888] which experienced an AssertionError when using field pruning with duplicate aggregation calls since Apache Calcite v1.35.0.

The issue was introduced with the code change from [CALCITE-4334] where the groupSet and groupSets variables were replaced by groupSet2 and groupSets2. The replacement was not done consistently so that the code for the elimination of duplicate aggregation calls was still using the groupSet and groupSets variables instead of the new variables.

The issue was caused by groupSet no longer containing the permuted field indices after the pruning of unnecessary fields and still containing the old field indices while the permuted field indices are now only available in groupSet2.

PR #3700 also attempted to fix this in parallel to my work on this PR due to some misunderstanding about the status of [CALCITE-6261] but did not consistently make use of the new variable groupSet2 in the agg call deduplication code.

Also, this PR contains unit tests for reproducing both scenarios from [CALCITE-6261] and [CALCITE-5888].

During testing I also tried reproducing the issue using SQL and the Quidem test suite but was unable to reproduce it via SQL.

@@ -2510,7 +2510,7 @@ private RelBuilder aggregate_(GroupKeyImpl groupKey,
// duplicates, then add a Project.
final Set<AggregateCall> callSet = new HashSet<>();
final PairList<Integer, @Nullable String> projects = PairList.of();
Util.range(groupSet.cardinality())
Util.range(groupSet2.cardinality())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is groupSet.cardinality() != groupSet2.cardinality()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding the return value of groupSet.cardinality() should always be the same as for groupSet2.cardinality() in the current implementation. My main point is that with all the logic at the beginning of the aggregate_() method which introduces groupSet2 we should use it consistently in the latter part of the method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but this means that #3700 is actually equivalent to this fix.
So that PR probably fixes both issues as well.
Is there a more descriptive name that could be used instead of groupSet2? Maybe that's the root problem: it is very hard to tell what groupSet2 means. But if it was called e.g., replacementGroupSet, then it would become apparent where it is misused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but this means that #3700 is actually equivalent to this fix.
So that PR probably fixes both issues as well.

Yes, functionally the other PR does fix the AssertionError and the functional outcome is the same.

Is there a more descriptive name that could be used instead of groupSet2? Maybe that's the root problem: it is very hard to tell what groupSet2 means. But if it was called e.g., replacementGroupSet, then it would become apparent where it is misused.

Yes, I also think the problem was introduced because of the ambiguity of the variables in the first place. A first thought I had was that maybe some refactoring of this method into smaller methods like extracting the agg call deduplication code into its own method could help to have a clearer separation with less ambiguity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have approved the PR, but if you are willing to improve the code I will wait for a new commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give it a shot on Monday when I'm back in the office.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I renamed groupSet2 and groupSets2 to groupSetAfterPruning and groupSetsAfterPruning respectively to give them more obvious names.
  • I also renamed the helper variable map to sourceFieldToTargetFieldMap to make it more obvious.
  • I moved the field pruning and agg call deduplication code into its own pruneAggregateInputFieldsAndDeduplicateAggCalls method to reduce the length of the original aggregate_() method.

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Copy link

sonarcloud bot commented Feb 26, 2024

@rubenada
Copy link
Contributor

LGTM

@mihaibudiu mihaibudiu added the LGTM-will-merge-soon Overall PR looks OK. Only minor things left. label Feb 26, 2024
@mihaibudiu mihaibudiu merged commit 022d878 into apache:main Feb 26, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LGTM-will-merge-soon Overall PR looks OK. Only minor things left.
Projects
None yet
3 participants