Fix massive spill files for StringView/BinaryView columns II by adriangb · Pull Request #21633 · apache/datafusion

adriangb · 2026-04-14T22:35:57Z

Replaces Fix massive spill files for StringView/BinaryView columns #19444 which seems stuck.
fixes [Bug] BinaryView/StringView columns are spilled without GC and results in enormous spill files #19414
closes Fix massive spill files for StringView/BinaryView columns #19444

martin-g · 2026-04-15T06:02:56Z


 [dependencies]
 arrow = { workspace = true }
+arrow-data = { workspace = true }


This dependency seems unused.
The only occurrence of arrow_data is at https://github.com/apache/datafusion/pull/21633/changes#diff-1f7d15c867929af294664ebbde4e8c9038186222cbb95ed86e527406cf066e84R463 for a test helper.

martin-g · 2026-04-15T06:16:48Z

            // on top of the sliced size for views buffer. This matches the intended semantics of
            // "bytes needed if we materialized exactly this slice into fresh buffers".
            // This is a workaround until https://github.com/apache/arrow-rs/issues/8230
            if let Some(sv) = array.as_any().downcast_ref::<StringViewArray>() {


The same is needed for BinaryViewArray, no ?

Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files. Changes: - Add gc_view_arrays() function to apply GC on view arrays - Integrate GC into InProgressSpillFile::append_batch() - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size) Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers. Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.

- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes

The SpillManager now handles GC for StringView/BinaryView arrays internally via gc_view_arrays(), making the organize_stringview_arrays() function in external sort redundant. Changes: - Remove organize_stringview_arrays() call and function from sort.rs - Use batch.clone() for early return (cheaper than creating new batch) - Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant - Update comment in spill_manager.rs to reference gc_view_arrays()

Address review comments from PR apache#19444: - Replace row count heuristic with 10KB memory threshold - Add comprehensive documentation explaining GC rationale and mechanism - Use direct array parameter for better type safety - Maintain early return optimization for non-view arrays The GC now triggers based on actual buffer memory usage rather than row counts, providing more accurate and efficient garbage collection for sliced StringView/BinaryView arrays during spilling. Tests confirm 80%+ reduction in spill file sizes for pathological cases like ClickBench (820MB -> 33MB).

- Return post-GC sliced size from append_batch so callers use the correct post-GC size for memory accounting (fixes cetra3's CHANGES_REQUESTED: max_record_batch_size was measured pre-GC in sort.rs and spill_manager.rs) - Fix incorrect comment claiming Arrow gc() is a no-op; it always allocates new compact buffers - Add comment in should_gc_view_array explaining why we sum data_buffers directly instead of using get_buffer_memory_size() - Enhance append_batch doc comment with GC rationale per reviewer request - Reduce row counts in heavy GC tests

Address PR review: avoid duplicating data-buffer size calculation by deriving it from get_buffer_memory_size minus the views buffer.

adriangb · 2026-04-16T14:14:35Z

cc @alamb in case you want to re-review before merging.

otherwise I plan to merge this in a day or so

alamb

this makes sense to me @adriangb -- thank you (and thaink you @martin-g for the review)

I would personally recommend also doing some sort "end to end" test -- specifically setup a sort of StringView data that was mostly sliced and ensure the spill files ar enot huge

It was not clear to me if we have tested the spill file size

alamb · 2026-04-16T18:43:51Z


+/// Size of a single view structure in StringView/BinaryView arrays (in bytes).
+/// Each view is 16 bytes: 4 bytes length + 4 bytes prefix + 8 bytes buffer ID/offset.
+const VIEW_SIZE_BYTES: usize = 16;


Related constant here: https://docs.rs/arrow-data/58.1.0/arrow_data/constant.MAX_INLINE_VIEW_LEN.html

I think arrow uses std::mem::size_of<u128> for this value as each view is a u128

alamb · 2026-04-16T18:44:43Z

+}
+
+fn gc_array_children(array: &ArrayRef) -> Result<(ArrayRef, bool)> {
+    let data = array.to_data();


FWIW to_data is not free (it allocates a vec, etc)

But I see the need to traverse a nested array and gc the whole thing. Short of adding specific code for each array type I think using ArrayData is the best we can do

alamb · 2026-04-16T18:50:41Z


    /// Appends a `RecordBatch` to the spill file, initializing the writer if necessary.
    ///
+    /// Before writing, performs GC on StringView/BinaryView arrays to compact backing


FWIW I think the same general approach might be useful / needed for other "view" type arrays -- I am thinking sliced LIstView for example as well as sliced ListArray and sliced Utf8 🤔

Maybe as a follow on PR

Yes I think there's a larger discussion around the footgun of view arrays sharing data and how slicing in general can result in fragmentation of data in arrays and wasted memory. I don't know what the big picture story is but personally I feel it would be reasonable to have some heuristic like "if an array becomes more than 50% dead references gc it"

adriangb · 2026-04-16T18:59:58Z

I would personally recommend also doing some sort "end to end" test -- specifically setup a sort of StringView data that was mostly sliced and ensure the spill files ar enot huge

I don't know that there is anything like that in DataFusion. Do you think it's needed for this PR or just a general idea?

Anecdotally we've had this PR cherry picked for months and it does fix the issues we've seen in production.

alamb · 2026-04-16T19:06:55Z

I don't know that there is anything like that in DataFusion. Do you think it's needed for this PR or just a general idea?

I don't think it is needed for this PR

I was just thinking that if size of spill files is important (which I think it is) we should also be testing that somehow (to avoid regressions, etc)

adriangb · 2026-04-16T19:08:32Z

I opened #21683 to track

github-actions bot added the physical-plan Changes to the physical-plan crate label Apr 14, 2026

adriangb mentioned this pull request Apr 14, 2026

Fix massive spill files for StringView/BinaryView columns #19444

Closed

adriangb changed the title ~~fix: gc StringView/BinaryView arrays before spilling to prevent write amplification~~ Fix massive spill files for StringView/BinaryView columns rev2 Apr 14, 2026

adriangb changed the title ~~Fix massive spill files for StringView/BinaryView columns rev2~~ Fix massive spill files for StringView/BinaryView columns II Apr 14, 2026

adriangb requested a review from alamb April 14, 2026 22:40

martin-g reviewed Apr 15, 2026

View reviewed changes

Comment thread datafusion/physical-plan/src/spill/mod.rs Outdated

martin-g approved these changes Apr 16, 2026

View reviewed changes

EeshanBembi and others added 9 commits April 16, 2026 08:54

Address PR review feedback for StringView/BinaryView GC

cd8ecda

- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes

Apply cargo fmt

06b0df1

fix: remove unused import Array to fix clippy

7fd21a5

Use get_buffer_memory_size in should_gc_view_array

92a286b

Address PR review: avoid duplicating data-buffer size calculation by deriving it from get_buffer_memory_size minus the views buffer.

Address review feedback on spill view GC

1530ba7

adriangb force-pushed the fix-stringview-spill-gc-2 branch from bd7fa4c to 1530ba7 Compare April 16, 2026 13:54

alamb approved these changes Apr 16, 2026

View reviewed changes

adriangb mentioned this pull request Apr 16, 2026

Add tests for spill file sizes #21683

Open

adriangb added this pull request to the merge queue Apr 16, 2026

Merged via the queue into apache:main with commit 4b8c1d9 Apr 16, 2026
40 checks passed

adriangb deleted the fix-stringview-spill-gc-2 branch April 16, 2026 19:10

Conversation

adriangb commented Apr 14, 2026 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-g Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adriangb commented Apr 16, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb commented Apr 16, 2026

Uh oh!

alamb commented Apr 16, 2026

Uh oh!

adriangb commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adriangb commented Apr 14, 2026 •

edited by alamb

Loading