Skip to content

perf: add spark.comet.exec.shuffle.maxBufferedBatches config#3800

Closed
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:shuffle-max-buffered-batches
Closed

perf: add spark.comet.exec.shuffle.maxBufferedBatches config#3800
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:shuffle-max-buffered-batches

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Mar 26, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

When shuffle spills only when the memory pool is exhausted, peak memory usage on executors can be very high — especially with many concurrent tasks. Spilling earlier, before memory pressure is critical, reduces peak memory at the cost of slightly more disk I/O.

What changes are included in this PR?

  • Adds spark.comet.exec.shuffle.maxBufferedBatches config (default 0 = disabled). When set, the native shuffle repartitioner spills once it has buffered this many batches, before waiting for the memory pool to refuse an allocation.
  • Fixes a file descriptor leak: spill files are now closed after each spill event and reopened in append mode for the next, so FD usage is proportional to active writes rather than to the number of partitions that have ever spilled.

How are these changes tested?

Existing shuffle tests cover the spill path. The new config defaults to 0 (disabled), so no existing behaviour changes without opt-in.

Add a new configuration option to limit the number of batches buffered
in memory before spilling during native shuffle. Setting a small value
causes earlier spilling, reducing peak memory usage on executors at the
cost of more disk I/O. The default of 0 preserves existing behavior
(spill only when the memory pool is exhausted).

Also fix a too-many-open-files issue where each partition held one spill
file descriptor open for the lifetime of the task. The spill file is now
closed after each spill event and reopened in append mode for the next,
keeping FD usage proportional to active writes rather than total partitions.
@andygrove
Copy link
Copy Markdown
Member Author

This does not work in practice

@andygrove andygrove closed this Mar 27, 2026
@andygrove andygrove deleted the shuffle-max-buffered-batches branch March 27, 2026 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant