Skip to content

Conversation

@davecromberge
Copy link
Member

@davecromberge davecromberge commented Nov 19, 2025

Summary

Enable size-based segment generation for tables with variable-sized data
(e.g., Theta sketches) where static row counts produce inconsistent segment
sizes. Finally, adds size-based segment grouping for MergeRollupTask.

Implemented two strategies:

  • AdaptiveSegmentNumRowProvider: EMA-based learning for homogeneous data
  • PercentileAdaptiveSegmentNumRowProvider: Reservoir sampling with percentile
    estimation for heterogeneous/multi-tenant data (resistant to outliers)

Configuration reads directly from MergeRollupTask config map, following the
eraseDimensionValues pattern. No changes to shared SegmentConfig or framework.

Example config:
{
  "MergeRollupTask": {
    "maxSegmentSizeBytesPerTask": "4194000",
    "desiredSegmentSizeBytes": "209715200",
    "segmentSizingStrategy": "PERCENTILE",
    "sizingPercentile": "75"
  }
}

Instructions:

The PR has to be tagged with at least one of the following labels (*):

  • feature
  • performance
  • release-notes - New configuration options

Enable size-based segment generation for tables with variable-sized data
(e.g., Theta sketches) where static row counts produce inconsistent segment
sizes.

Implemented two strategies:

- AdaptiveSegmentNumRowProvider: EMA-based learning for homogeneous data
- PercentileAdaptiveSegmentNumRowProvider: Reservoir sampling with percentile
  estimation for heterogeneous/multi-tenant data (resistant to outliers)

Configuration reads directly from MergeRollupTask config map, following the
eraseDimensionValues pattern. No changes to shared SegmentConfig or framework.

Example config:
{
  "MergeRollupTask": {
    "desiredSegmentSizeBytes": "209715200",
    "segmentSizingStrategy": "PERCENTILE",
    "sizingPercentile": "75"
  }
}
@codecov-commenter
Copy link

codecov-commenter commented Nov 19, 2025

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
10107 1 10106 47
View the top 3 failed test(s) by shortest run time
org.apache.pinot.plugin.minion.tasks.mergerollup.MergeRollupTaskGeneratorTest::testMaxSegmentSizeBytesPerTask
Stack Traces | 0.014s run time
expected [1] but found [2]
org.apache.pinot.controller.helix.core.minion.PinotTaskManagerDistributedLockingTest::testConcurrentCreateTaskFromMultipleControllers
Stack Traces | 9.07s run time
At least one task generation should have occurred expected [1] but found [2]
org.apache.pinot.controller.helix.core.minion.PinotTaskManagerDistributedLockingTest::testConcurrentCreateTaskFromMultipleControllers
Stack Traces | 12s run time
At least one task generation should have occurred expected [1] but found [2]

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Adds maxSegmentSizeBytesPerTask config to group input segments by total size instead of
row count, providing more predictable memory usage for tables with variable row sizes (e.g., theta sketches, HLL).
Falls back to row-based grouping when not configured.
…PerTask limit

Changed task creation logic to check if adding a segment would exceed the configured
size limit before grouping it, rather than after. This prevents tasks from significantly
overshooting the maxSegmentSizeBytesPerTask threshold when grouping multiple segments,
while still allowing single oversized segments to be merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants