feat(spill): optimize external sort with SpillFileMerger to reduce write amplification#288
Merged
Merged
Conversation
…te amplification Introduce LeveledMerger (LSM-tree-like structure) to manage spill files in levels, reducing read/write amplification from O(N/K) to O(log_K(N)). Also adds dynamic estimation of spill parameters based on actual row sizes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Doxygen comments to public methods in LeveledMerger and InMemorySortBuffer - Replace bare 'int' with 'int32_t' in leveled_merger_test.cpp - Remove unused #include <numeric> - Add integer type convention to docs/code-style.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- WriteBufferTest.TestMergeSpilledFilesSkipsWithSingleFile: update min file handles from 1 to 2 (now enforced minimum) - WriteBufferTest.TestSpillDiskQuotaEnforcement Case 3: use single spill_file_size as quota since leveled compaction changes disk usage - WriteInteTest: relax intermediate file count assertions to allow leveled merger's multi-level structure (files cleaned at read time) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-free When FlushWriteBuffer fails mid-operation (e.g., IO error during rolling_writer->Write()), the producer thread may leave a KeyValue cached in merge_function_wrapper_. That KeyValue holds Arrow data referencing SpillReader::arrow_pool_ via raw pointer. After the async producer/consumer is closed and SpillReaders are destroyed, the cached KeyValue becomes a dangling reference, causing SEGV during MergeTreeWriter destruction. Fix: call merge_function_wrapper_->Reset() in the write_guard ScopeGuard to release cached Arrow data before SpillReaders are destroyed. Also adapts MergeTreeWriterTest assertions to accommodate leveled compaction file count behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lxy-9602
reviewed
May 18, 2026
…t use-after-free Move the merge_function_wrapper_->Reset() into SortMergeReader::Close() (both LoserTree and MinHeap variants) so that cached KeyValue data is released before underlying readers are destroyed. This prevents Arrow Buffer objects from calling Free() on a dangling pool pointer when SpillReader is later destructed. Also reorder the cleanup in MergeTreeCompactRewriter::RewriteCompaction to call reader->Close() before merge_file_split_read_.reset(), ensuring the Reset() triggered by Close() runs while all resource providers are still alive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…arity - Rename class LeveledMerger → SpillFileMerger, files leveled_merger → spill_file_merger - Rename Compact/Compaction methods to Merge (RunMergeIfNeeded, RunFinalMergeIfNeeded, etc.) - Move kMinFanIn validation from CoreOptions to ExternalSortBuffer::Create - Improve variable names (l→level_idx, a/b→lhs/rhs, n→files_to_merge, f→file) - Convert 6 spill tests from TEST_P to TEST_F (no parameterization needed) - Fix 4 misused TEST_P in btree_global_index and file_system tests - Make test assertions exact instead of range-based Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add TestSpillWithIOException exercising IO error injection across all spill code paths (write, intermediate merge, final merge, flush) - Add SetMaxFanInToLargerValueSuppressesCompaction verifying dynamic fan-in - Improve inline comments explaining leveled merge behavior in spill tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mments - Rename compact/compaction to merge in spill-related comments and test names - Refine Write/FlushMemory return value variable names for clearer semantics - Add Write 3 case to TestSpillDiskQuotaEnforcement - Add comments in sort_buffer_test.cpp for Clear and empty batch behavior - Fix ASSERT_LE to ASSERT_EQ for deterministic spill file counts in inte tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add WriteContextBuilder::SetWriteBufferSpillThreadNumber(int32_t) to control Arrow IPC thread usage during spill. When > 0, sets Arrow CPU thread pool capacity and enables use_threads in SpillReader/SpillWriter. The bool is passed through the full constructor chain: KeyValueFileStoreWrite -> MergeTreeWriter -> WriteBuffer -> ExternalSortBuffer -> SpillReader/SpillWriter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…efault values and option overrides
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Optimize external sort spill performance by introducing a SpillFileMerger (LSM-tree-like structure) to manage spill files, and add
SetWriteBufferSpillThreadNumberAPI to control Arrow IPC threading during spill.Key changes:
SpillFileMergerclass that organizes spill files into levels and triggers merge when a level accumulates >= max_fan_in files, reducing read/write amplification from O(N/K) to O(log_K(N))spill_batch_size_,actual_max_fan_in_) based on actual row sizes to better utilize memory budgetWriteContextBuilder::SetWriteBufferSpillThreadNumber(int32_t)API to control Arrow thread pool capacity anduse_threadsin SpillReader/SpillWriter, passed asbool enable_multi_thread_spillthrough the constructor chain: KeyValueFileStoreWrite -> MergeTreeWriter -> WriteBuffer -> ExternalSortBuffer -> SpillReader/SpillWriterlocal-sort.max-num-file-handlesconfig (must be >= 2)MergeAndReplaceFilesnow only deletes input files on success, relying onSpillChannelManager::Reset()for cleanup on failuremerge_function_wrapperinSortMergeReader::Closeand on flush errorGetParam())Tests
SpillFileMergerTest.NoMergeBelowFanInSpillFileMergerTest.MergeTriggeredAtFanInSpillFileMergerTest.MinimalFanInTwoSpillFileMergerTest.MultiLevelMergeSpillFileMergerTest.ManyFilesWithFanInTwoSpillFileMergerTest.FinalCleanupReducesFileCountSpillFileMergerTest.FinalCleanupMergesSmallestFirstSpillFileMergerTest.FinalCleanupNoOpWhenAlreadyBelowTargetSpillFileMergerTest.FinalCleanupConvergesToTargetSpillFileMergerTest.MergeFnFailurePreservesStateSpillFileMergerTest.ClearRemovesAllFilesSpillFileMergerTest.SetMaxFanInAffectsMergeSpillFileMergerTest.SetMaxFanInToLargerValueSuppressesMergeSpillFileMergerTest.MergeOnlyTakesFanInFilesFromLevelSortBufferTest.TestInMemorySortBufferEstimateMemoryUseForEachRowMergeTreeWriterTest.TestSpillWithIOExceptionWriteBufferTest.TestSpillDiskQuotaEnforcement(added Write 3 case)ReadContextTest,ScanContextTest,WriteContextTestwith default value and option override coverageAPI and Format
WriteContextBuilder::SetWriteBufferSpillThreadNumber(int32_t thread_number)<= 0: disables Arrow IPC threading in spill (default)> 0: setsarrow::SetCpuThreadPoolCapacity(thread_number)and enablesuse_threadsin SpillReader/SpillWriterDocumentation
No.
Generative AI tooling
Generated-by: Claude Code (Claude Opus 4.6)