-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ClusterByStatisticsCollectorImpl to use bytes instead of keys #12998
Update ClusterByStatisticsCollectorImpl to use bytes instead of keys #12998
Conversation
...re/multi-stage-query/src/main/java/org/apache/druid/msq/statistics/DistinctKeyCollector.java
Show resolved
Hide resolved
...ge-query/src/main/java/org/apache/druid/msq/statistics/ClusterByStatisticsCollectorImpl.java
Outdated
Show resolved
Hide resolved
...re/multi-stage-query/src/main/java/org/apache/druid/msq/statistics/DistinctKeyCollector.java
Outdated
Show resolved
Hide resolved
...i-stage-query/src/main/java/org/apache/druid/msq/statistics/QuantilesSketchKeyCollector.java
Show resolved
Hide resolved
...query/src/main/java/org/apache/druid/msq/statistics/QuantilesSketchKeyCollectorSnapshot.java
Show resolved
Hide resolved
...stage-query/src/test/java/org/apache/druid/msq/statistics/DelegateOrMinKeyCollectorTest.java
Show resolved
Hide resolved
...i-stage-query/src/main/java/org/apache/druid/msq/statistics/QuantilesSketchKeyCollector.java
Outdated
Show resolved
Hide resolved
...i-stage-query/src/main/java/org/apache/druid/msq/statistics/QuantilesSketchKeyCollector.java
Outdated
Show resolved
Hide resolved
...ge-query/src/main/java/org/apache/druid/msq/statistics/ClusterByStatisticsCollectorImpl.java
Outdated
Show resolved
Hide resolved
...ge-query/src/main/java/org/apache/druid/msq/statistics/ClusterByStatisticsCollectorImpl.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments.
Will approve once they are addressed.
Thanks for this patch @adarshsanjeev. This would be super usefull.
...-query/src/main/java/org/apache/druid/msq/statistics/QuantilesSketchKeyCollectorFactory.java
Show resolved
Hide resolved
...re/multi-stage-query/src/main/java/org/apache/druid/msq/statistics/DistinctKeyCollector.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting the accidental approve :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
+1 (non binding)
Request @gianm to take a look on this also.
Currently, ClusterByStatisticsCollectorImpl uses number of keys retained as a metric to check when to downsample a bucket. The size of keys is not necessarily the best metric, if the size of keys is small, more than the threshold could fit in memory. If the size is very large, it could lead to out of memory issues. This could be improved by changing this to check for the number of bytes currently retained instead. This would let us avoid downsampling if the keys are small, which would result in better accuracy for key collector.
The K value (which determines accuracy) for Quantile sketches which are used by KeyCollectors is set at 4096, which potentially could be higher, since we can downsample (which reduces the K). This has also been increased to 32768 for greater accuracy (at the cost of more memory).
Changes
Memory Usage
Currently, the threshold for keys per ClusterByStatisticsCollector is 65,536. Once all sketches have a combined total of this threshold, sketches are downsampled till the total keys retained is halved. The new byte threshold for ClusterByStatisticsCollector is 300MB. For keys of average length 300 (which is around one time column and 8 string columns of 30 characters each), we would have 300MB/300 or 1,000,000 keys.
For quantiles sketches, the K has been increased to 32768, which comes around to 11 * 32,768 or 360,448 keys. Considering a key as 300 bytes again, this comes out to 108MB.
For DistinctKeyCollectors, the initial byte threshold before downsampling is 100MB.
Key changed/added classes in this PR
ClusterByStatisticsCollectorImpl
DistinctKeyCollector
QuantilesSketchKeyCollector
DelegateOrMinKeyCollector
This PR has: