adjust topn heap algorithm to only use known cardinality path when dictionary is unique#11186
Merged
jon-wei merged 3 commits intoapache:masterfrom Jun 10, 2021
Conversation
…ctionary is unique
| ) | ||
| { | ||
| if (selector.getValueCardinality() != DimensionDictionarySelector.CARDINALITY_UNKNOWN) { | ||
| if (capabilities.isDictionaryEncoded().and(capabilities.areDictionaryValuesUnique()).isTrue()) { |
Contributor
There was a problem hiding this comment.
so it is not possible to have unique dictionary ids but unknown cardinality?
Member
Author
There was a problem hiding this comment.
Hmm, that is a good point, I guess it is implicit of the current state of things that the only things we have right now that report having unique dictionary ids are things with known cardinality, but I suppose this check needs both things to be true. I'll modify this check both conditions.
Member
Author
There was a problem hiding this comment.
I've updated the check, and added a comment to try to explain what is going on with the aggregation algorithm selection and why
jon-wei
approved these changes
Jun 10, 2021
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adjusts the heap based topN algorithm to only use known the "known" cardinality path only when the dictionary contains unique values. The known cardinality path to aggregate values uses an array based approach, where an array of aggregator arrays the size of the value cardinality is created, and the dictionaryId is expected to index to an array position with the aggregators for that value, as an optimization to avoid a map lookup.
However, when a selector is aggregated which does not have unique dictionaryIds, but does know its cardinality, such as a selector from an
IndexedTablefrom a join result which uses the row number as the dictionaryId instead, it means that each dictionaryId will be 'new', and thus have a null array entry and still incur the map interaction this path is trying to avoid.Instead, these selectors will now just use the map directly by using the cardinality "unknown" path instead.
This PR has: