Skip to content

Adaptive merge rollup segment sizing#17762

Open
davecromberge wants to merge 6 commits intoapache:masterfrom
permutive-engineering:feature-contrib/adaptive-merge-rollup-segment-sizing
Open

Adaptive merge rollup segment sizing#17762
davecromberge wants to merge 6 commits intoapache:masterfrom
permutive-engineering:feature-contrib/adaptive-merge-rollup-segment-sizing

Conversation

@davecromberge
Copy link
Member

Summary

Enable size-based segment generation for tables with variable-sized data
(e.g., Theta sketches) where static row counts produce inconsistent segment
sizes. Finally, adds size-based segment grouping for MergeRollupTask.

Implemented two strategies:

  • AdaptiveSegmentNumRowProvider: EMA-based learning for homogeneous data
  • PercentileAdaptiveSegmentNumRowProvider: Reservoir sampling with percentile
    estimation for heterogeneous/multi-tenant data (resistant to outliers)

Configuration reads directly from MergeRollupTask config map, following the
eraseDimensionValues pattern. No changes to shared SegmentConfig or framework.

Example config:
{
  "MergeRollupTask": {
    "maxSegmentSizeBytesPerTask": "4194000",
    "desiredSegmentSizeBytes": "209715200",
    "segmentSizingStrategy": "PERCENTILE",
    "sizingPercentile": "75"
  }
}

Instructions:

The PR has to be tagged with at least one of the following labels (*):

  • feature
  • performance
  • release-notes - New configuration options

Enable size-based segment generation for tables with variable-sized data
(e.g., Theta sketches) where static row counts produce inconsistent segment
sizes.

Implemented two strategies:

- AdaptiveSegmentNumRowProvider: EMA-based learning for homogeneous data
- PercentileAdaptiveSegmentNumRowProvider: Reservoir sampling with percentile
  estimation for heterogeneous/multi-tenant data (resistant to outliers)

Configuration reads directly from MergeRollupTask config map, following the
eraseDimensionValues pattern. No changes to shared SegmentConfig or framework.

Example config:
{
  "MergeRollupTask": {
    "desiredSegmentSizeBytes": "209715200",
    "segmentSizingStrategy": "PERCENTILE",
    "sizingPercentile": "75"
  }
}
Adds maxSegmentSizeBytesPerTask config to group input segments by total size instead of
row count, providing more predictable memory usage for tables with variable row sizes (e.g., theta sketches, HLL).
Falls back to row-based grouping when not configured.
…PerTask limit

Changed task creation logic to check if adding a segment would exceed the configured
size limit before grouping it, rather than after. This prevents tasks from significantly
overshooting the maxSegmentSizeBytesPerTask threshold when grouping multiple segments,
while still allowing single oversized segments to be merged.
@davecromberge davecromberge changed the title Feature contrib/adaptive merge rollup segment sizing Adaptive merge rollup segment sizing Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant