Adaptive merge rollup segment sizing#17762
Open
davecromberge wants to merge 6 commits intoapache:masterfrom
Open
Adaptive merge rollup segment sizing#17762davecromberge wants to merge 6 commits intoapache:masterfrom
davecromberge wants to merge 6 commits intoapache:masterfrom
Conversation
Enable size-based segment generation for tables with variable-sized data
(e.g., Theta sketches) where static row counts produce inconsistent segment
sizes.
Implemented two strategies:
- AdaptiveSegmentNumRowProvider: EMA-based learning for homogeneous data
- PercentileAdaptiveSegmentNumRowProvider: Reservoir sampling with percentile
estimation for heterogeneous/multi-tenant data (resistant to outliers)
Configuration reads directly from MergeRollupTask config map, following the
eraseDimensionValues pattern. No changes to shared SegmentConfig or framework.
Example config:
{
"MergeRollupTask": {
"desiredSegmentSizeBytes": "209715200",
"segmentSizingStrategy": "PERCENTILE",
"sizingPercentile": "75"
}
}
Adds maxSegmentSizeBytesPerTask config to group input segments by total size instead of row count, providing more predictable memory usage for tables with variable row sizes (e.g., theta sketches, HLL). Falls back to row-based grouping when not configured.
…PerTask limit Changed task creation logic to check if adding a segment would exceed the configured size limit before grouping it, rather than after. This prevents tasks from significantly overshooting the maxSegmentSizeBytesPerTask threshold when grouping multiple segments, while still allowing single oversized segments to be merged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enable size-based segment generation for tables with variable-sized data
(e.g., Theta sketches) where static row counts produce inconsistent segment
sizes. Finally, adds size-based segment grouping for MergeRollupTask.
Implemented two strategies:
estimation for heterogeneous/multi-tenant data (resistant to outliers)
Configuration reads directly from MergeRollupTask config map, following the
eraseDimensionValues pattern. No changes to shared SegmentConfig or framework.
Instructions:
The PR has to be tagged with at least one of the following labels (*):
featureperformancerelease-notes- New configuration options