perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs#19439
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes GroupBy spilling by introducing an output stream that buffers small spill runs in memory and only creates disk files when the serialized spill exceeds the configured threshold, reducing unnecessary file I/O for small spills.
Changes:
- Adds
SpillOutputStreamto switch from heap buffering toLimitedTemporaryStorageonly after the threshold is exceeded. - Refactors
SpillingGrouperspill serialization to use the new stream while preserving pending-run batching and disk-spill behavior. - Adds and updates unit tests for in-memory spill handling, threshold behavior, disk fallback, and storage-limit enforcement.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillOutputStream.java |
Adds the threshold-aware spill output stream. |
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java |
Routes grouper spill serialization through SpillOutputStream. |
processing/src/test/java/org/apache/druid/query/groupby/epinephelinae/SpillOutputStreamTest.java |
Adds unit coverage for the new stream behavior. |
processing/src/test/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouperTest.java |
Updates spilling tests for in-memory small-spill behavior and storage-limit scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
FrankChen021
left a comment
There was a problem hiding this comment.
I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.
Reviewed 4 of 4 changed files.
This is an automated review by Codex GPT-5.5
gianm
left a comment
There was a problem hiding this comment.
Seems fine relative to what was there before.
perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs
Description
This is a followup to #19357
The spill batching logic (introduced to avoid thousands of tiny disk files) previously had to write to disk first and check the file size afterward, because the serialized size isn't known upfront — and if the serialized data turns out to be large, buffering it entirely in memory before deciding would risk OOM. So the safe path was: always write to a temp file, then read it back into memory only if it was small enough to batch.
This is correct but expensive for certain cases. When groupBy queries produce spill runs whose serialized size is much smaller than their in-memory buffer (e.g., HLL sketches in sparse/SET mode serialize to a fraction of their pre-allocated buffer), this creates thousands of unnecessary file create/write/read/delete cycles just to discover the data was small enough to batch in memory.
SpillOutputStream solves both concerns: it writes to a heap buffer first, and only when the buffer exceeds the threshold does it open a file and flush the accumulated bytes to disk. Large spills still go to disk (no OOM risk), but small spills never touch the filesystem. Peak extra heap is bounded to the threshold size (minSpillFileSize, default 1MB).
Key changed/added classes in this PR
Benchmarks result
Before this PR:
After this PR:
The existing queryMultiQueryableIndexWithSpilling/queryMultiQueryableIndexWithSpillingTTFR uses bufferGrouperMaxSize=4000 which produce reasonable spills of ~200kb. I have also added new benchmarks with similar idea but producing spill files on extremes ends for size. queryMultiQueryableIndexWithSmallSpilling/queryMultiQueryableIndexWithSmallSpillingTTFR sets bufferGrouperMaxSize=100, producing spill size ~6 KB. This would result in more batching. queryMultiQueryableIndexWithLargeSpilling/queryMultiQueryableIndexWithLargeSpillingTTFR sets bufferGrouperMaxSize=70000, producing spill size ~4 MB. This would skip batching. These new benchmarks are not added to the PR since they are really the same as queryMultiQueryableIndexWithSpilling/queryMultiQueryableIndexWithSpillingTTFR just with different config values.
Key changed/added classes in this PR
SpillingGrouperSpillOutputStreamThis PR has: