Skip to content

feat(ffi): opt-in buffer alignment in ArrowArrayStreamReader#9951

Closed
dataders wants to merge 1 commit into
apache:mainfrom
dataders:align-buffers-in-ffi-stream-reader
Closed

feat(ffi): opt-in buffer alignment in ArrowArrayStreamReader#9951
dataders wants to merge 1 commit into
apache:mainfrom
dataders:align-buffers-in-ffi-stream-reader

Conversation

@dataders
Copy link
Copy Markdown

@dataders dataders commented May 8, 2026

Which issue does this PR close?

(See also the prior attempt #7137, which proposed the always-on form of this fix and was closed after maintainer feedback asking for an opt-in API. This PR implements exactly the builder shape @alamb sketched in that thread.)

Rationale for this change

Some Arrow C Data Interface producers — notably the Go ADBC drivers in wide use today (e.g. the Snowflake driver) — emit data buffers whose pointers are not aligned to the underlying primitive type's alignment. When such buffers reach arrow-rs, downstream typed access in ScalarBuffer::from panics with "Memory pointer from external source (e.g, FFI) is not aligned with the specified scalar type", taking the consumer process down.

We hit this in production reading real Snowflake VARCHAR columns through the Go ADBC Snowflake driver into a Rust consumer, and we are not the only ones — see e.g. apache/arrow-adbc#2526 and the linked downstream reports.

The previous attempt #7137 made ArrowArrayStreamReader unconditionally call align_buffers() on every imported batch. The discussion there (with @tustvold and @alamb) settled on:

  • silent reallocation hides producer bugs and changes behavior for consumers who would prefer a hard error;
  • but for consumers who know they're talking to a producer that emits unaligned data (e.g. a specific ADBC driver), they need a way to ask arrow-rs to fix it up rather than panicking deep in their kernels.

@alamb's sketch:

let stream_reader = ArrowArrayStreamReader::try_new(ffi_stream)
    .with_align_buffers(true);

This PR implements that.

What changes are included in this PR?

  • Add a private align_buffers: bool field to ArrowArrayStreamReader, defaulting to false so the existing behavior is preserved bit-for-bit.
  • Add a public builder method ArrowArrayStreamReader::with_align_buffers(self, bool) -> Self.
  • In Iterator::next, when the flag is set, call ArrayData::align_buffers() on the freshly imported ArrayData before constructing the RecordBatch. align_buffers only copies buffers that are actually misaligned, so adequately aligned producers pay no cost beyond the per-buffer alignment check.

The existing PR #7137's diff was 7 lines and changed default behavior. This PR is opt-in only.

Are these changes tested?

Yes. Two new tests in arrow-array/src/ffi_stream.rs:

  • test_unaligned_buffers_default_panics_on_typed_access — fabricates an FFI_ArrowArrayStream whose first batch has a deliberately misaligned Int32 data buffer (a one-byte-offset slice into an aligned backing buffer). Asserts via #[should_panic] that the default reader still panics on typed access, preserving the existing invariant that unaligned producers surface as detectable failures.
  • test_unaligned_buffers_with_align_buffers_opt_in — same input, reader configured with with_align_buffers(true). Asserts the imported buffer's pointer is aligned to align_of::<i32>(), that typed access (.values()) does not panic, and that the stream is exhausted after the single batch.

Both tests bypass the standard FFI_ArrowArrayStream::new(reader) helper because that helper materializes the source RecordBatch into typed arrays, which itself panics on misaligned input. Instead they build a minimal custom FFI_ArrowArrayStream whose get_next callback emits an FFI_ArrowArray constructed from an untyped ArrayData — closer to what a real foreign producer would do.

```
$ cargo test -p arrow-array --features ffi
test result: ok. 948 passed; 3 ignored; 0 failed

$ cargo clippy -p arrow-array --all-features --tests
(clean)

$ cargo fmt -p arrow-array --check
(clean)
```

Are there any user-facing changes?

Yes, but additive only: one new public method on ArrowArrayStreamReader. No existing API or default behavior changes. No breaking changes.

Notes for reviewers

  • Marking as draft because the requesting team wants to eyeball it before un-drafting; happy to take any API-shape feedback (e.g. naming the method differently, defaulting differently, adding the same opt-in to other FFI import paths) before that.
  • The signature with_align_buffers(mut self, bool) -> Self mirrors the existing ArrayDataBuilder::align_buffers(mut self, bool) -> Self (data.rs:2222) for consistency. If you'd prefer a parameterless with_align_buffers(self) -> Self, easy change.

# Which issue does this PR close?

- Contributes to apache#7136

# Rationale for this change

Some Arrow C Data Interface producers — notably the Go ADBC drivers
that are in wide use today (e.g. the Snowflake driver) — emit data
buffers whose pointers are not aligned to the underlying primitive
type's alignment. When such buffers reach `arrow-rs`, downstream typed
access in `ScalarBuffer::from` panics with `"Memory pointer from
external source (e.g, FFI) is not aligned with the specified scalar
type"`, taking the consumer process down.

A previous attempt at a fix (apache#7137) made `ArrowArrayStreamReader`
unconditionally call `align_buffers()` on every imported batch.
Reviewers (@tustvold, @alamb) pushed back: silently reallocating
buffers behind users' backs hides bugs in producers and changes
behavior for consumers who would prefer a hard error. @alamb suggested
the API ought to be opt-in via a builder, e.g.

```rust
let stream_reader = ArrowArrayStreamReader::try_new(ffi_stream)
    .with_align_buffers(true);
```

This PR implements exactly that.

# What changes are included in this PR?

- Add a private `align_buffers: bool` field to `ArrowArrayStreamReader`,
  defaulting to `false` so existing behavior is preserved bit-for-bit.
- Add a public builder method
  `ArrowArrayStreamReader::with_align_buffers(self, bool) -> Self`.
- In `Iterator::next`, when the flag is set, call
  `ArrayData::align_buffers()` on the freshly imported `ArrayData`
  before constructing the `RecordBatch`. `align_buffers` only copies
  buffers that are actually misaligned, so adequately aligned producers
  pay no cost beyond a per-buffer alignment check.

# Are these changes tested?

Yes. Two new tests in `arrow-array/src/ffi_stream.rs`:

- `test_unaligned_buffers_default_panics_on_typed_access` — fabricates
  an `FFI_ArrowArrayStream` whose first batch has a deliberately
  misaligned `Int32` data buffer (one-byte offset slice). Asserts (via
  `#[should_panic]`) that the default reader still panics on typed
  access, preserving the existing invariant that unaligned producers
  surface as detectable failures.
- `test_unaligned_buffers_with_align_buffers_opt_in` — same input,
  but the reader is configured with `with_align_buffers(true)`. Asserts
  that the imported buffer is aligned, that typed access does not
  panic, and that the stream is exhausted after the single batch.

Both tests bypass the standard `FFI_ArrowArrayStream::new(reader)`
helper because that helper materializes the source `RecordBatch` into
typed arrays, which itself panics on misaligned input. Instead they
build a minimal custom `FFI_ArrowArrayStream` whose `get_next`
callback emits an `FFI_ArrowArray` constructed from an untyped
`ArrayData` — closer to what a real foreign producer would do.

`cargo test -p arrow-array --features ffi`: 948 passed, 3 ignored, 0
failed. `cargo clippy -p arrow-array --all-features --tests` clean.

# Are there any user-facing changes?

Yes, but additive only: a new public method
`ArrowArrayStreamReader::with_align_buffers`. No existing API or
behavior changes. No breaking changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the arrow Changes to the arrow crate label May 8, 2026
@dataders dataders marked this pull request as ready for review May 8, 2026 21:13
@dataders
Copy link
Copy Markdown
Author

Closing this in favor of fixing the producer side. The misaligned buffers are produced by the ADBC Snowflake C Data export path; the targeted fix is in dbt-labs/arrow-adbc#129 and will be shipped upstream there.

If arrow-rs still wants a generic opt-in compatibility layer for other FFI producers, that can be re-opened or reframed separately, but this PR is no longer the root-cause fix for the Snowflake/ADBC issue.

@dataders dataders closed this May 11, 2026
Self::try_new(unsafe { FFI_ArrowArrayStream::from_raw(raw_stream) })
}

/// Configure whether buffers from imported arrays should be aligned, copying
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIWIW this seems reasonable to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants