Fix massive spill files for StringView/BinaryView columns II#21633
Fix massive spill files for StringView/BinaryView columns II#21633adriangb merged 9 commits intoapache:mainfrom
Conversation
|
|
||
| [dependencies] | ||
| arrow = { workspace = true } | ||
| arrow-data = { workspace = true } |
There was a problem hiding this comment.
This dependency seems unused.
The only occurrence of arrow_data is at https://github.com/apache/datafusion/pull/21633/changes#diff-1f7d15c867929af294664ebbde4e8c9038186222cbb95ed86e527406cf066e84R463 for a test helper.
| // on top of the sliced size for views buffer. This matches the intended semantics of | ||
| // "bytes needed if we materialized exactly this slice into fresh buffers". | ||
| // This is a workaround until https://github.com/apache/arrow-rs/issues/8230 | ||
| if let Some(sv) = array.as_any().downcast_ref::<StringViewArray>() { |
There was a problem hiding this comment.
The same is needed for BinaryViewArray, no ?
Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files. Changes: - Add gc_view_arrays() function to apply GC on view arrays - Integrate GC into InProgressSpillFile::append_batch() - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size) Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers. Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.
- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes
The SpillManager now handles GC for StringView/BinaryView arrays internally via gc_view_arrays(), making the organize_stringview_arrays() function in external sort redundant. Changes: - Remove organize_stringview_arrays() call and function from sort.rs - Use batch.clone() for early return (cheaper than creating new batch) - Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant - Update comment in spill_manager.rs to reference gc_view_arrays()
Address review comments from PR apache#19444: - Replace row count heuristic with 10KB memory threshold - Add comprehensive documentation explaining GC rationale and mechanism - Use direct array parameter for better type safety - Maintain early return optimization for non-view arrays The GC now triggers based on actual buffer memory usage rather than row counts, providing more accurate and efficient garbage collection for sliced StringView/BinaryView arrays during spilling. Tests confirm 80%+ reduction in spill file sizes for pathological cases like ClickBench (820MB -> 33MB).
- Return post-GC sliced size from append_batch so callers use the correct post-GC size for memory accounting (fixes cetra3's CHANGES_REQUESTED: max_record_batch_size was measured pre-GC in sort.rs and spill_manager.rs) - Fix incorrect comment claiming Arrow gc() is a no-op; it always allocates new compact buffers - Add comment in should_gc_view_array explaining why we sum data_buffers directly instead of using get_buffer_memory_size() - Enhance append_batch doc comment with GC rationale per reviewer request - Reduce row counts in heavy GC tests
Address PR review: avoid duplicating data-buffer size calculation by deriving it from get_buffer_memory_size minus the views buffer.
bd7fa4c to
1530ba7
Compare
|
cc @alamb in case you want to re-review before merging. otherwise I plan to merge this in a day or so |
alamb
left a comment
There was a problem hiding this comment.
this makes sense to me @adriangb -- thank you (and thaink you @martin-g for the review)
I would personally recommend also doing some sort "end to end" test -- specifically setup a sort of StringView data that was mostly sliced and ensure the spill files ar enot huge
It was not clear to me if we have tested the spill file size
|
|
||
| /// Size of a single view structure in StringView/BinaryView arrays (in bytes). | ||
| /// Each view is 16 bytes: 4 bytes length + 4 bytes prefix + 8 bytes buffer ID/offset. | ||
| const VIEW_SIZE_BYTES: usize = 16; |
There was a problem hiding this comment.
Related constant here: https://docs.rs/arrow-data/58.1.0/arrow_data/constant.MAX_INLINE_VIEW_LEN.html
I think arrow uses std::mem::size_of<u128> for this value as each view is a u128
| } | ||
|
|
||
| fn gc_array_children(array: &ArrayRef) -> Result<(ArrayRef, bool)> { | ||
| let data = array.to_data(); |
There was a problem hiding this comment.
FWIW to_data is not free (it allocates a vec, etc)
But I see the need to traverse a nested array and gc the whole thing. Short of adding specific code for each array type I think using ArrayData is the best we can do
|
|
||
| /// Appends a `RecordBatch` to the spill file, initializing the writer if necessary. | ||
| /// | ||
| /// Before writing, performs GC on StringView/BinaryView arrays to compact backing |
There was a problem hiding this comment.
FWIW I think the same general approach might be useful / needed for other "view" type arrays -- I am thinking sliced LIstView for example as well as sliced ListArray and sliced Utf8 🤔
Maybe as a follow on PR
There was a problem hiding this comment.
Yes I think there's a larger discussion around the footgun of view arrays sharing data and how slicing in general can result in fragmentation of data in arrays and wasted memory. I don't know what the big picture story is but personally I feel it would be reasonable to have some heuristic like "if an array becomes more than 50% dead references gc it"
I don't know that there is anything like that in DataFusion. Do you think it's needed for this PR or just a general idea? Anecdotally we've had this PR cherry picked for months and it does fix the issues we've seen in production. |
I don't think it is needed for this PR I was just thinking that if size of spill files is important (which I think it is) we should also be testing that somehow (to avoid regressions, etc) |
|
I opened #21683 to track |
FYi @EeshanBembi