Skip to content

Fix massive spill files for StringView/BinaryView columns II#21633

Merged
adriangb merged 9 commits intoapache:mainfrom
pydantic:fix-stringview-spill-gc-2
Apr 16, 2026
Merged

Fix massive spill files for StringView/BinaryView columns II#21633
adriangb merged 9 commits intoapache:mainfrom
pydantic:fix-stringview-spill-gc-2

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented Apr 14, 2026

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Apr 14, 2026
@adriangb adriangb changed the title fix: gc StringView/BinaryView arrays before spilling to prevent write amplification Fix massive spill files for StringView/BinaryView columns rev2 Apr 14, 2026
@adriangb adriangb changed the title Fix massive spill files for StringView/BinaryView columns rev2 Fix massive spill files for StringView/BinaryView columns II Apr 14, 2026
@adriangb adriangb requested a review from alamb April 14, 2026 22:40

[dependencies]
arrow = { workspace = true }
arrow-data = { workspace = true }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency seems unused.
The only occurrence of arrow_data is at https://github.com/apache/datafusion/pull/21633/changes#diff-1f7d15c867929af294664ebbde4e8c9038186222cbb95ed86e527406cf066e84R463 for a test helper.

// on top of the sliced size for views buffer. This matches the intended semantics of
// "bytes needed if we materialized exactly this slice into fresh buffers".
// This is a workaround until https://github.com/apache/arrow-rs/issues/8230
if let Some(sv) = array.as_any().downcast_ref::<StringViewArray>() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is needed for BinaryViewArray, no ?

Comment thread datafusion/physical-plan/src/spill/mod.rs Outdated
Comment thread datafusion/physical-plan/src/spill/mod.rs
Comment thread datafusion/physical-plan/src/spill/mod.rs Outdated
EeshanBembi and others added 9 commits April 16, 2026 08:54
Add garbage collection for StringView and BinaryView arrays before spilling
to disk. This prevents sliced arrays from carrying their entire original
buffers when written to spill files.

Changes:
- Add gc_view_arrays() function to apply GC on view arrays
- Integrate GC into InProgressSpillFile::append_batch()
- Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size)

Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files
instead of 33MB due to sliced arrays maintaining references to original buffers.

Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.
- Replace row count heuristic with 10KB memory threshold
- Improve documentation and add inline comments
- Remove redundant test_exact_clickbench_issue_19414
- Maintains 96% reduction in spill file sizes
The SpillManager now handles GC for StringView/BinaryView arrays internally
via gc_view_arrays(), making the organize_stringview_arrays() function in
external sort redundant.

Changes:
- Remove organize_stringview_arrays() call and function from sort.rs
- Use batch.clone() for early return (cheaper than creating new batch)
- Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant
- Update comment in spill_manager.rs to reference gc_view_arrays()
Address review comments from PR apache#19444:
- Replace row count heuristic with 10KB memory threshold
- Add comprehensive documentation explaining GC rationale and mechanism
- Use direct array parameter for better type safety
- Maintain early return optimization for non-view arrays

The GC now triggers based on actual buffer memory usage rather than
row counts, providing more accurate and efficient garbage collection
for sliced StringView/BinaryView arrays during spilling.

Tests confirm 80%+ reduction in spill file sizes for pathological cases
like ClickBench (820MB -> 33MB).
- Return post-GC sliced size from append_batch so callers use the
  correct post-GC size for memory accounting (fixes cetra3's
  CHANGES_REQUESTED: max_record_batch_size was measured pre-GC in
  sort.rs and spill_manager.rs)
- Fix incorrect comment claiming Arrow gc() is a no-op; it always
  allocates new compact buffers
- Add comment in should_gc_view_array explaining why we sum
  data_buffers directly instead of using get_buffer_memory_size()
- Enhance append_batch doc comment with GC rationale per reviewer request
- Reduce row counts in heavy GC tests
Address PR review: avoid duplicating data-buffer size calculation by
deriving it from get_buffer_memory_size minus the views buffer.
@adriangb adriangb force-pushed the fix-stringview-spill-gc-2 branch from bd7fa4c to 1530ba7 Compare April 16, 2026 13:54
@adriangb
Copy link
Copy Markdown
Contributor Author

cc @alamb in case you want to re-review before merging.

otherwise I plan to merge this in a day or so

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me @adriangb -- thank you (and thaink you @martin-g for the review)

I would personally recommend also doing some sort "end to end" test -- specifically setup a sort of StringView data that was mostly sliced and ensure the spill files ar enot huge

It was not clear to me if we have tested the spill file size


/// Size of a single view structure in StringView/BinaryView arrays (in bytes).
/// Each view is 16 bytes: 4 bytes length + 4 bytes prefix + 8 bytes buffer ID/offset.
const VIEW_SIZE_BYTES: usize = 16;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related constant here: https://docs.rs/arrow-data/58.1.0/arrow_data/constant.MAX_INLINE_VIEW_LEN.html

I think arrow uses std::mem::size_of<u128> for this value as each view is a u128

}

fn gc_array_children(array: &ArrayRef) -> Result<(ArrayRef, bool)> {
let data = array.to_data();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW to_data is not free (it allocates a vec, etc)

But I see the need to traverse a nested array and gc the whole thing. Short of adding specific code for each array type I think using ArrayData is the best we can do


/// Appends a `RecordBatch` to the spill file, initializing the writer if necessary.
///
/// Before writing, performs GC on StringView/BinaryView arrays to compact backing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think the same general approach might be useful / needed for other "view" type arrays -- I am thinking sliced LIstView for example as well as sliced ListArray and sliced Utf8 🤔

Maybe as a follow on PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think there's a larger discussion around the footgun of view arrays sharing data and how slicing in general can result in fragmentation of data in arrays and wasted memory. I don't know what the big picture story is but personally I feel it would be reasonable to have some heuristic like "if an array becomes more than 50% dead references gc it"

@adriangb
Copy link
Copy Markdown
Contributor Author

I would personally recommend also doing some sort "end to end" test -- specifically setup a sort of StringView data that was mostly sliced and ensure the spill files ar enot huge

I don't know that there is anything like that in DataFusion. Do you think it's needed for this PR or just a general idea?

Anecdotally we've had this PR cherry picked for months and it does fix the issues we've seen in production.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 16, 2026

I don't know that there is anything like that in DataFusion. Do you think it's needed for this PR or just a general idea?

I don't think it is needed for this PR

I was just thinking that if size of spill files is important (which I think it is) we should also be testing that somehow (to avoid regressions, etc)

@adriangb
Copy link
Copy Markdown
Contributor Author

I opened #21683 to track

@adriangb adriangb added this pull request to the merge queue Apr 16, 2026
Merged via the queue into apache:main with commit 4b8c1d9 Apr 16, 2026
40 checks passed
@adriangb adriangb deleted the fix-stringview-spill-gc-2 branch April 16, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] BinaryView/StringView columns are spilled without GC and results in enormous spill files

4 participants