Skip to content

Align buffers when importing via from_ffi / ArrowArrayStreamReader #10028

@mbutrovich

Description

@mbutrovich

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I am consuming Arrow data from a JVM producer (arrow-java's Data.exportArrayStream) via arrow::ffi_stream::ArrowArrayStreamReader. When a batch contains a Decimal128 column whose underlying buffer happens to land on an offset that is 8-byte aligned but not 16-byte aligned, ArrowArrayStreamReader::next panics inside ScalarBuffer::<i128>::from(Buffer):

panicked at arrow-buffer/src/buffer/scalar.rs:194:43:
Memory pointer from external source (e.g, FFI) is not aligned with the specified scalar type.
Before importing buffer through FFI, please make sure the allocation is aligned.

The producer is spec-conformant. The Arrow C Data Interface only recommends 8-byte alignment, and arrow-java's VectorUnloader and NettyAllocationManager only guarantee 8-byte alignment. The mismatch is on the consumer side: since Rust 1.77 / LLVM 18, align_of::<i128>() == 16 on x86 (it has always been 16 on ARM), so ScalarBuffer::<i128> requires 16-byte alignment when constructing typed arrays from imported ArrayData.

This is the same root cause as #5553 and PR #5554, which fixed it for the IPC reader by adding IpcReadOptions::require_alignment (triggering a realigning copy on import). The equivalent is missing from the C Data Interface readers.

Describe the solution you'd like

Call ArrayData::align_buffers() unconditionally inside arrow::ffi::from_ffi and arrow::ffi::from_ffi_and_data_type, after consume(). ArrowArrayStreamReader then inherits the fix automatically.

from_ffi is the right layer because:

  • It's the FFI consume entry point; downstream typed-array construction is what panics, so the import path owns the repair.
  • arrow-pyarrow already does this manually (arrow-pyarrow/src/lib.rs:368) — that workaround becomes unnecessary.
  • Direct from_ffi callers hit the same panic today; fixing only the stream reader leaves them broken.
  • align_buffers() is a no-op when buffers are already aligned, so well-behaved producers pay nothing.

This matches the IPC reader's default behavior (auto-realign; require_alignment is opt-in for zero-copy users) established in #5554.

Spec basis

The 16-byte requirement is not in any Arrow spec — it is a consequence of Rust 1.77+ setting align_of::<i128>() == 16. The Columnar format only recommends 8- or 64-byte alignment for primitives; the C Data Interface goes further: "It is recommended, but not required, that the memory addresses of the buffers be aligned… Consumers MAY decide not to support unaligned memory." Auto-realigning on import is explicitly within bounds.

Describe alternatives you've considered

  1. Forcing the JVM producer to allocate decimal buffers with 16-byte alignment. Not portable: there is no alignment hook on arrow-java's BufferAllocator / NettyAllocationManager, and the spec only requires 8-byte alignment of the producer.
  2. Wrapping ArrowArrayStreamReader in user code by replicating its internals (driving FFI_ArrowArrayStream::get_next directly, calling from_ffi, then align_buffers(), then building the typed batch). Workable but duplicates arrow-rs internals; every JVM-Arrow consumer hits this and ends up writing the same wrapper.
  3. Realigning post-import. Not possible from outside the reader because the panic happens inside ArrowArrayStreamReader::next before the caller sees a RecordBatch.

Additional context

Related:

Reproducer shape: any JVM producer that exports a RecordBatch containing a Decimal128 column (or List<Decimal128> / Struct<..., Decimal128>) where the data buffer offset within its slab is 8 mod 16. Triggers ~50% of the time with arrow-java's default NettyAllocationManager.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions