perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs by maytasm · Pull Request #19439 · apache/druid

maytasm · 2026-05-09T02:35:06Z

perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs

Description

This is a followup to #19357

The spill batching logic (introduced to avoid thousands of tiny disk files) previously had to write to disk first and check the file size afterward, because the serialized size isn't known upfront — and if the serialized data turns out to be large, buffering it entirely in memory before deciding would risk OOM. So the safe path was: always write to a temp file, then read it back into memory only if it was small enough to batch.

This is correct but expensive for certain cases. When groupBy queries produce spill runs whose serialized size is much smaller than their in-memory buffer (e.g., HLL sketches in sparse/SET mode serialize to a fraction of their pre-allocated buffer), this creates thousands of unnecessary file create/write/read/delete cycles just to discover the data was small enough to batch in memory.

SpillOutputStream solves both concerns: it writes to a heap buffer first, and only when the buffer exceeds the threshold does it open a file and flush the accumulated bytes to disk. Large spills still go to disk (no OOM risk), but small spills never touch the filesystem. Peak extra heap is bounded to the threshold size (minSpillFileSize, default 1MB).

Key changed/added classes in this PR

Introduces SpillOutputStream, an OutputStream that buffers in memory and only spills to disk when written bytes exceed the minSpillFileSize threshold. This eliminates the previous write-to-file → check-size → read-back → delete round-trip for small spill runs. This integrates with LimitedTemporaryStorage and hence still enforces all the limits.
Refactors SpillingGrouper.spill() to serialize through SpillOutputStream instead of always creating a temp file first. The serialization logic is extracted into serializeToStream() to separate it from file lifecycle management.
Adds comprehensive unit tests for SpillOutputStream covering in-memory buffering, disk spillover, single-byte writes, threshold boundary behavior, and error handling.

Benchmarks result

Before this PR:

Benchmark                                                  (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndexWithSpilling                         -1                       4              4                 all            100000           basic.A        force  avgt   15  352910.001 ±  7477.535  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpillingTTFR                     -1                       4              4                 all            100000           basic.A        force  avgt   15  149027.367 ±  4829.306  us/op

New Benchmarks
bufferGrouperMaxSize=100 (spill size ~6 KB)
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpilling                    -1                       4              4                 all            100000           basic.A        force  avgt   15  750775.683 ± 10103.062  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpillingTTFR                -1                       4              4                 all            100000           basic.A        force  avgt   15  539507.482 ±  7332.067  us/op

bufferGrouperMaxSize=70000 (spill size ~4 MB)
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpilling                    -1                       4              4                 all           1000000           basic.A        force  avgt   15  3529647.635 ±  69419.867  us/op
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpillingTTFR                -1                       4              4                 all           1000000           basic.A        force  avgt   15  1358536.490 ± 139732.304  us/op

After this PR:

Benchmark                                                  (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndexWithSpilling                    -1                       4              4                 all            100000           basic.A        force  avgt   15  344787.202 ± 13998.456  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSpillingTTFR                -1                       4              4                 all            100000           basic.A        force  avgt   15  141381.603 ±  3583.413  us/op

New Benchmarks
bufferGrouperMaxSize=100 (spill size ~6 KB)
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpilling                    -1                       4              4                 all            100000           basic.A        force  avgt   15  420431.477 ± 5600.233  us/op
GroupByBenchmark.queryMultiQueryableIndexWithSmallSpillingTTFR                -1                       4              4                 all            100000           basic.A        force  avgt   15   195450.731 ± 2552.290  us/op

bufferGrouperMaxSize=70000 (spill size ~4 MB)
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpilling                    -1                       4              4                 all           1000000           basic.A        force  avgt   15  3569103.927 ± 282616.750  us/op
GroupByBenchmark.queryMultiQueryableIndexWithLargeSpillingTTFR                -1                       4              4                 all           1000000           basic.A        force  avgt   15  1243519.844 ±  80173.130  us/op

The existing queryMultiQueryableIndexWithSpilling/queryMultiQueryableIndexWithSpillingTTFR uses bufferGrouperMaxSize=4000 which produce reasonable spills of ~200kb. I have also added new benchmarks with similar idea but producing spill files on extremes ends for size. queryMultiQueryableIndexWithSmallSpilling/queryMultiQueryableIndexWithSmallSpillingTTFR sets bufferGrouperMaxSize=100, producing spill size ~6 KB. This would result in more batching. queryMultiQueryableIndexWithLargeSpilling/queryMultiQueryableIndexWithLargeSpillingTTFR sets bufferGrouperMaxSize=70000, producing spill size ~4 MB. This would skip batching. These new benchmarks are not added to the PR since they are really the same as queryMultiQueryableIndexWithSpilling/queryMultiQueryableIndexWithSpillingTTFR just with different config values.

Default spilling (bufferGrouperMaxSize=4000) — within noise:
                                                                                                   
  ┌───────────┬───────────────┬───────────────┬───────┐
  │ Benchmark │      OLD      │      NEW      │ Delta │
  ├───────────┼───────────────┼───────────────┼───────┤
  │ Spilling  │ 352,910 us/op │ 344,787 us/op │ -2.3% │                                            
  ├───────────┼───────────────┼───────────────┼───────┤
  │ TTFR      │ 149,027 us/op │ 141,382 us/op │ -5.1% │
  └───────────┴───────────────┴───────────────┴───────┘
Error bars overlap, so statistically neutral. Spills here are moderately sized and few in number — not the target optimization scenario.

Small spills (~6 KB each, bufferGrouperMaxSize=100) — huge win:

  ┌───────────┬───────────────┬───────────────┬───────┐
  │ Benchmark │      OLD      │      NEW      │ Delta │                                            
  ├───────────┼───────────────┼───────────────┼───────┤
  │ Spilling  │ 750,776 us/op │ 420,431 us/op │ -44%  │
  ├───────────┼───────────────┼───────────────┼───────┤
  │ TTFR      │ 539,507 us/op │ 195,451 us/op │ -64%  │
  └───────────┴───────────────┴───────────────┴───────┘
This is the sweet spot for the optimization. Each spill is ~6 KB, well under the 1 MB MIN_SPILL_FILE_BYTES threshold, so they stay entirely in memory — no file create/write/read/delete round-trip. The TTFR improvement is even larger because the first result no longer waits on disk I/O for early spills.

Large spills (~4 MB each, bufferGrouperMaxSize=70000) — neutral:
  
  ┌───────────┬─────────────────┬─────────────────┬──────────────────────────┐
  │ Benchmark │       OLD       │       NEW       │          Delta           │
  ├───────────┼─────────────────┼─────────────────┼──────────────────────────┤
  │ Spilling  │ 3,529,648 us/op │ 3,569,104 us/op │ +1.1% (noise)            │
  ├───────────┼─────────────────┼─────────────────┼──────────────────────────┤                     
  │ TTFR      │ 1,358,536 us/op │ 1,243,520 us/op │ -8.5% (large error bars) │
  └───────────┴─────────────────┴─────────────────┴──────────────────────────┘
Spills exceed the 1 MB threshold and go to disk in both versions, so no difference.


The optimization eliminates disk I/O for small spills, saving ~330,000 us/op total and ~344,000 us/op to first result in the small-spill case. Large spills exceed the 1MB threshold and hit disk regardless, so no change there.

The optimization delivers exactly where designed — many small spills that previously hit disk now stay in memory, cutting latency 44-64%. Large spills are unaffected. No regressions.

Key changed/added classes in this PR

SpillingGrouper
SpillOutputStream

This PR has:

Copilot

Pull request overview

This PR optimizes GroupBy spilling by introducing an output stream that buffers small spill runs in memory and only creates disk files when the serialized spill exceeds the configured threshold, reducing unnecessary file I/O for small spills.

Changes:

Adds SpillOutputStream to switch from heap buffering to LimitedTemporaryStorage only after the threshold is exceeded.
Refactors SpillingGrouper spill serialization to use the new stream while preserving pending-run batching and disk-spill behavior.
Adds and updates unit tests for in-memory spill handling, threshold behavior, disk fallback, and storage-limit enforcement.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillOutputStream.java`	Adds the threshold-aware spill output stream.
`processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java`	Routes grouper spill serialization through `SpillOutputStream`.
`processing/src/test/java/org/apache/druid/query/groupby/epinephelinae/SpillOutputStreamTest.java`	Adds unit coverage for the new stream behavior.
`processing/src/test/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouperTest.java`	Updates spilling tests for in-memory small-spill behavior and storage-limit scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

FrankChen021

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 4 of 4 changed files.

This is an automated review by Codex GPT-5.5

gianm

Seems fine relative to what was there before.

Optimize Spill files more

957c020

maytasm marked this pull request as draft May 9, 2026 02:35

maytasm added 2 commits May 10, 2026 19:14

revert BufferedOutputStream change

b0c49f8

revert BufferedOutputStream change

a93da31

maytasm changed the title ~~Optimize Spill files more v2~~ perf: Optimizes SpillingGrouper spill logic May 14, 2026

maytasm changed the title ~~perf: Optimizes SpillingGrouper spill logic~~ perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs May 14, 2026

maytasm marked this pull request as ready for review May 14, 2026 05:04

Add tests

0471ad5

maytasm requested review from Copilot and jtuglu1 May 14, 2026 07:45

Copilot started reviewing on behalf of maytasm May 14, 2026 07:45 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Fix check

e417b72

FrankChen021 reviewed May 14, 2026

View reviewed changes

gianm approved these changes May 18, 2026

View reviewed changes

maytasm merged commit 781aba6 into apache:master May 19, 2026
138 of 142 checks passed

maytasm deleted the spill_file_improvement_v2 branch May 19, 2026 05:53

github-actions Bot added this to the 38.0.0 milestone May 19, 2026

gianm mentioned this pull request May 19, 2026

perf: vectorize topN native engine #19353

Draft

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs#19439

perf: Optimize SpillingGrouper to avoid unnecessary disk I/O for small spill runs#19439
maytasm merged 5 commits into
apache:masterfrom
maytasm:spill_file_improvement_v2

maytasm commented May 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

FrankChen021 left a comment

Uh oh!

gianm left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

maytasm commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key changed/added classes in this PR

Benchmarks result

Key changed/added classes in this PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maytasm commented May 9, 2026 •

edited

Loading