Skip to content

adjust topn heap algorithm to only use known cardinality path when dictionary is unique#11186

Merged
jon-wei merged 3 commits intoapache:masterfrom
clintropolis:topn-heap-cardinality-known-when-unique
Jun 10, 2021
Merged

adjust topn heap algorithm to only use known cardinality path when dictionary is unique#11186
jon-wei merged 3 commits intoapache:masterfrom
clintropolis:topn-heap-cardinality-known-when-unique

Conversation

@clintropolis
Copy link
Member

Description

This PR adjusts the heap based topN algorithm to only use known the "known" cardinality path only when the dictionary contains unique values. The known cardinality path to aggregate values uses an array based approach, where an array of aggregator arrays the size of the value cardinality is created, and the dictionaryId is expected to index to an array position with the aggregators for that value, as an optimization to avoid a map lookup.

However, when a selector is aggregated which does not have unique dictionaryIds, but does know its cardinality, such as a selector from an IndexedTable from a join result which uses the row number as the dictionaryId instead, it means that each dictionaryId will be 'new', and thus have a null array entry and still incur the map interaction this path is trying to avoid.

Instead, these selectors will now just use the map directly by using the cardinality "unknown" path instead.

This PR has:

  • been self-reviewed.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

)
{
if (selector.getValueCardinality() != DimensionDictionarySelector.CARDINALITY_UNKNOWN) {
if (capabilities.isDictionaryEncoded().and(capabilities.areDictionaryValuesUnique()).isTrue()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is not possible to have unique dictionary ids but unknown cardinality?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that is a good point, I guess it is implicit of the current state of things that the only things we have right now that report having unique dictionary ids are things with known cardinality, but I suppose this check needs both things to be true. I'll modify this check both conditions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the check, and added a comment to try to explain what is going on with the aggregation algorithm selection and why

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants