feat: use byte-based target batch size for shuffle IPC blocks#3913
Closed
andygrove wants to merge 3 commits intoapache:mainfrom
Closed
feat: use byte-based target batch size for shuffle IPC blocks#3913andygrove wants to merge 3 commits intoapache:mainfrom
andygrove wants to merge 3 commits intoapache:mainfrom
Conversation
Replace the row-based batch size (8192 rows) with a byte-based threshold (default 1 MiB) for coalescing small batches before writing shuffle IPC blocks. For narrow schemas, 8192 rows can produce tiny blocks with high per-block schema overhead. A byte-based threshold ensures reasonably sized blocks regardless of schema width. Changes: - Replace BatchCoalescer (row-based) with byte-based accumulation in BufBatchWriter - Switch SinglePartitionShufflePartitioner buffering to byte-based - Add target_batch_bytes parameter to MultiPartitionShuffleRepartitioner (keeping row-based batch_size for scratch space and input slicing) - Add COMET_SHUFFLE_TARGET_BATCH_BYTES config (spark.comet.exec.shuffle.targetBatchBytes) - Add target_batch_bytes field to ShuffleWriter protobuf message - Fix bug: COMET_SHUFFLE_WRITE_BUFFER_SIZE used .max(Int.MaxValue) instead of .min(Int.MaxValue), always sending 2GB regardless of config
Member
Author
|
I tried running this with TPC-H @ 1TB and hit OOM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Partial fix for #3882
Rationale for this change
The native shuffle writer currently uses a row-based target batch size of 8192 rows for coalescing small batches before writing IPC blocks. For narrow schemas (few columns, small data types), this produces tiny blocks with disproportionate per-block IPC schema overhead.
A byte-based threshold ensures reasonably sized blocks regardless of schema width, improving shuffle write efficiency for narrow schemas without negatively impacting wide schemas.
What changes are included in this PR?
BatchCoalescer(row-based) with byte-based accumulation inBufBatchWriter— batches are buffered until their total memory size reaches the target, then concatenated and written as a single IPC blockSinglePartitionShufflePartitionerbuffering from row-count to byte-size thresholdtarget_batch_bytesparameter toMultiPartitionShuffleRepartitioner, while keeping the row-basedbatch_sizefor scratch space sizing and input batch slicing (which is about processing chunk limits, not output block size)COMET_SHUFFLE_TARGET_BATCH_BYTESconfig (spark.comet.exec.shuffle.targetBatchBytes, default 1 MiB)target_batch_bytesfield toShuffleWriterprotobuf message--target-batch-bytesCLI arg to the standalone shuffle benchmark toolCOMET_SHUFFLE_WRITE_BUFFER_SIZEused.max(Int.MaxValue)instead of.min(Int.MaxValue)when converting Long to Int for protobuf, which always sent 2GB regardless of the configured valueHow are these changes tested?
Existing shuffle tests (19 tests) all pass. The
test_batch_coalescing_reduces_sizetest validates that byte-based coalescing still produces smaller output than no coalescing.