feat(ffi): opt-in buffer alignment in ArrowArrayStreamReader#9951
Closed
dataders wants to merge 1 commit into
Closed
feat(ffi): opt-in buffer alignment in ArrowArrayStreamReader#9951dataders wants to merge 1 commit into
ArrowArrayStreamReader#9951dataders wants to merge 1 commit into
Conversation
# Which issue does this PR close? - Contributes to apache#7136 # Rationale for this change Some Arrow C Data Interface producers — notably the Go ADBC drivers that are in wide use today (e.g. the Snowflake driver) — emit data buffers whose pointers are not aligned to the underlying primitive type's alignment. When such buffers reach `arrow-rs`, downstream typed access in `ScalarBuffer::from` panics with `"Memory pointer from external source (e.g, FFI) is not aligned with the specified scalar type"`, taking the consumer process down. A previous attempt at a fix (apache#7137) made `ArrowArrayStreamReader` unconditionally call `align_buffers()` on every imported batch. Reviewers (@tustvold, @alamb) pushed back: silently reallocating buffers behind users' backs hides bugs in producers and changes behavior for consumers who would prefer a hard error. @alamb suggested the API ought to be opt-in via a builder, e.g. ```rust let stream_reader = ArrowArrayStreamReader::try_new(ffi_stream) .with_align_buffers(true); ``` This PR implements exactly that. # What changes are included in this PR? - Add a private `align_buffers: bool` field to `ArrowArrayStreamReader`, defaulting to `false` so existing behavior is preserved bit-for-bit. - Add a public builder method `ArrowArrayStreamReader::with_align_buffers(self, bool) -> Self`. - In `Iterator::next`, when the flag is set, call `ArrayData::align_buffers()` on the freshly imported `ArrayData` before constructing the `RecordBatch`. `align_buffers` only copies buffers that are actually misaligned, so adequately aligned producers pay no cost beyond a per-buffer alignment check. # Are these changes tested? Yes. Two new tests in `arrow-array/src/ffi_stream.rs`: - `test_unaligned_buffers_default_panics_on_typed_access` — fabricates an `FFI_ArrowArrayStream` whose first batch has a deliberately misaligned `Int32` data buffer (one-byte offset slice). Asserts (via `#[should_panic]`) that the default reader still panics on typed access, preserving the existing invariant that unaligned producers surface as detectable failures. - `test_unaligned_buffers_with_align_buffers_opt_in` — same input, but the reader is configured with `with_align_buffers(true)`. Asserts that the imported buffer is aligned, that typed access does not panic, and that the stream is exhausted after the single batch. Both tests bypass the standard `FFI_ArrowArrayStream::new(reader)` helper because that helper materializes the source `RecordBatch` into typed arrays, which itself panics on misaligned input. Instead they build a minimal custom `FFI_ArrowArrayStream` whose `get_next` callback emits an `FFI_ArrowArray` constructed from an untyped `ArrayData` — closer to what a real foreign producer would do. `cargo test -p arrow-array --features ffi`: 948 passed, 3 ignored, 0 failed. `cargo clippy -p arrow-array --all-features --tests` clean. # Are there any user-facing changes? Yes, but additive only: a new public method `ArrowArrayStreamReader::with_align_buffers`. No existing API or behavior changes. No breaking changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
|
Closing this in favor of fixing the producer side. The misaligned buffers are produced by the ADBC Snowflake C Data export path; the targeted fix is in dbt-labs/arrow-adbc#129 and will be shipped upstream there. If arrow-rs still wants a generic opt-in compatibility layer for other FFI producers, that can be re-opened or reframed separately, but this PR is no longer the root-cause fix for the Snowflake/ADBC issue. |
alamb
reviewed
May 11, 2026
| Self::try_new(unsafe { FFI_ArrowArrayStream::from_raw(raw_stream) }) | ||
| } | ||
|
|
||
| /// Configure whether buffers from imported arrays should be aligned, copying |
Contributor
There was a problem hiding this comment.
FIWIW this seems reasonable to me
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
(See also the prior attempt #7137, which proposed the always-on form of this fix and was closed after maintainer feedback asking for an opt-in API. This PR implements exactly the builder shape @alamb sketched in that thread.)
Rationale for this change
Some Arrow C Data Interface producers — notably the Go ADBC drivers in wide use today (e.g. the Snowflake driver) — emit data buffers whose pointers are not aligned to the underlying primitive type's alignment. When such buffers reach
arrow-rs, downstream typed access inScalarBuffer::frompanics with"Memory pointer from external source (e.g, FFI) is not aligned with the specified scalar type", taking the consumer process down.We hit this in production reading real Snowflake
VARCHARcolumns through the Go ADBC Snowflake driver into a Rust consumer, and we are not the only ones — see e.g. apache/arrow-adbc#2526 and the linked downstream reports.The previous attempt #7137 made
ArrowArrayStreamReaderunconditionally callalign_buffers()on every imported batch. The discussion there (with @tustvold and @alamb) settled on:arrow-rsto fix it up rather than panicking deep in their kernels.@alamb's sketch:
This PR implements that.
What changes are included in this PR?
align_buffers: boolfield toArrowArrayStreamReader, defaulting tofalseso the existing behavior is preserved bit-for-bit.ArrowArrayStreamReader::with_align_buffers(self, bool) -> Self.Iterator::next, when the flag is set, callArrayData::align_buffers()on the freshly importedArrayDatabefore constructing theRecordBatch.align_buffersonly copies buffers that are actually misaligned, so adequately aligned producers pay no cost beyond the per-buffer alignment check.The existing PR #7137's diff was 7 lines and changed default behavior. This PR is opt-in only.
Are these changes tested?
Yes. Two new tests in
arrow-array/src/ffi_stream.rs:test_unaligned_buffers_default_panics_on_typed_access— fabricates anFFI_ArrowArrayStreamwhose first batch has a deliberately misalignedInt32data buffer (a one-byte-offset slice into an aligned backing buffer). Asserts via#[should_panic]that the default reader still panics on typed access, preserving the existing invariant that unaligned producers surface as detectable failures.test_unaligned_buffers_with_align_buffers_opt_in— same input, reader configured withwith_align_buffers(true). Asserts the imported buffer's pointer is aligned toalign_of::<i32>(), that typed access (.values()) does not panic, and that the stream is exhausted after the single batch.Both tests bypass the standard
FFI_ArrowArrayStream::new(reader)helper because that helper materializes the sourceRecordBatchinto typed arrays, which itself panics on misaligned input. Instead they build a minimal customFFI_ArrowArrayStreamwhoseget_nextcallback emits anFFI_ArrowArrayconstructed from an untypedArrayData— closer to what a real foreign producer would do.```
$ cargo test -p arrow-array --features ffi
test result: ok. 948 passed; 3 ignored; 0 failed
$ cargo clippy -p arrow-array --all-features --tests
(clean)
$ cargo fmt -p arrow-array --check
(clean)
```
Are there any user-facing changes?
Yes, but additive only: one new public method on
ArrowArrayStreamReader. No existing API or default behavior changes. No breaking changes.Notes for reviewers
with_align_buffers(mut self, bool) -> Selfmirrors the existingArrayDataBuilder::align_buffers(mut self, bool) -> Self(data.rs:2222) for consistency. If you'd prefer a parameterlesswith_align_buffers(self) -> Self, easy change.