Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ClusterByStatisticsCollectorImpl to use bytes instead of keys #12998

Conversation

adarshsanjeev
Copy link
Contributor

@adarshsanjeev adarshsanjeev commented Aug 30, 2022

Currently, ClusterByStatisticsCollectorImpl uses number of keys retained as a metric to check when to downsample a bucket. The size of keys is not necessarily the best metric, if the size of keys is small, more than the threshold could fit in memory. If the size is very large, it could lead to out of memory issues. This could be improved by changing this to check for the number of bytes currently retained instead. This would let us avoid downsampling if the keys are small, which would result in better accuracy for key collector.
The K value (which determines accuracy) for Quantile sketches which are used by KeyCollectors is set at 4096, which potentially could be higher, since we can downsample (which reduces the K). This has also been increased to 32768 for greater accuracy (at the cost of more memory).

Changes
  • Update ClusterByStatisticsCollector to store average bytes for a key.
  • Update ClusterByStatisticsCollector to use bytes as thresolds for downsampling.
  • Increased initial K value for quantile sketches from 4096 to 32768.

Memory Usage

Currently, the threshold for keys per ClusterByStatisticsCollector is 65,536. Once all sketches have a combined total of this threshold, sketches are downsampled till the total keys retained is halved. The new byte threshold for ClusterByStatisticsCollector is 300MB. For keys of average length 300 (which is around one time column and 8 string columns of 30 characters each), we would have 300MB/300 or 1,000,000 keys.

For quantiles sketches, the K has been increased to 32768, which comes around to 11 * 32,768 or 360,448 keys. Considering a key as 300 bytes again, this comes out to 108MB.

For DistinctKeyCollectors, the initial byte threshold before downsampling is 100MB.


Key changed/added classes in this PR
  • ClusterByStatisticsCollectorImpl
  • DistinctKeyCollector
  • QuantilesSketchKeyCollector
  • DelegateOrMinKeyCollector

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.
Will approve once they are addressed.

Thanks for this patch @adarshsanjeev. This would be super usefull.

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting the accidental approve :)

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
+1 (non binding)
Request @gianm to take a look on this also.

@abhishekagarwal87 abhishekagarwal87 merged commit 92d2633 into apache:master Oct 3, 2022
@kfaraz kfaraz added this to the 25.0 milestone Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants