Skip to content

Add Arrow C Data Interface support with defensive resource management#594

Open
robertbuessow wants to merge 11 commits into
apache:mainfrom
RelationalAI:rb-check-released-in-getindex
Open

Add Arrow C Data Interface support with defensive resource management#594
robertbuessow wants to merge 11 commits into
apache:mainfrom
RelationalAI:rb-check-released-in-getindex

Conversation

@robertbuessow
Copy link
Copy Markdown

Summary

This PR adds support for the Arrow C Data Interface, enabling zero-copy data exchange between Julia and C/C++/Python runtimes in the same process.

Key additions and fixes:

  • C Data Interface implementation (src/cdatainterface.jl): Arrow.from_c_data / Arrow.to_c_data for importing and exporting Arrow arrays and tables across the C ABI boundary. Handles all Arrow types: primitive, boolean, list, fixed-size list, map, struct, union, dict-encoded, and nested types.
  • Defensive resource management: CDataHandle tracks C-side memory lifetime. The GC finalizer uses jl_safe_printf instead of @error (task switches are forbidden in finalizers) and an atomic counter instead of a non-thread-safe Ref{Int}. The finalizer also calls the C release callback as a safety net, so C resources are freed even when release_c_data is not called explicitly.
  • ABI compatibility test: compiles a C probe at test time using offsetof() and compares every field offset of ArrowSchema and ArrowArray against Julia's fieldoffset(), catching any struct layout divergence between Julia and the C compiler.
  • Explicit release in tests: all tests that import C data now call release_c_data explicitly, and a final @testset "no unexpected resource leaks" asserts UNRELEASED_HANDLE_COUNT only increases by the one intentional leak test.
  • Bug fixes: double-free on import, type narrowing errors, BoundsError when dict-encoding CategoricalArrays with missing values.

Test plan

  • julia --project -e 'using Pkg; Pkg.test()' passes with no leak warnings
  • @testset "struct field offsets match C ABI" — 19 assertions all green
  • @testset "no unexpected resource leaks" — counter == initial + 1
  • No Arrow.CDataHandle GC'd without explicit release_c_data output during the test run (except inside the redirect_stderr block of the intentional-leak test)

🤖 Generated with Claude Code

robertbuessow and others added 11 commits April 8, 2026 09:47
Implements both directions of the Arrow C Data Interface spec
(https://arrow.apache.org/docs/format/CDataInterface.html):

- `Arrow.from_c_data(schema_ptr, array_ptr)` — import an Arrow array
  from C-owned memory; zero-copy via `unsafe_wrap`; `CDataHandle`
  finalizer calls the C `release` callbacks automatically.
- `Arrow.to_c_data(col)` — export an `ArrowVector` or `Arrow.Table`
  to C; GC roots kept alive via a token-keyed global dict; `@cfunction`
  release callbacks (initialised in `__init__`) delete roots on consumer
  release.

New public types: `ArrowSchema`, `ArrowArray`, `CImportedArray`,
`CImportedTable`, and `release_c_data`.

Supports all Arrow column types: primitives, Bool, String/binary,
List (generic, large, fixed-size), Struct, Map, DenseUnion,
SparseUnion, DictEncoded, Null, and all time/date/duration types.
Handles nullable columns, non-zero array offsets, and custom metadata.

68 new tests in `test/cdatainterface.jl` covering format strings,
buffer layout, validity bitmaps, round-trips, release semantics,
non-zero offsets, and multi-column table import.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove duplicate finalizer registrations in `from_c_data`: the
`CDataHandle` was getting a finalizer attached but `CImportedArray`
already owns and manages the handle's lifetime, causing a potential
double-free when the GC collected the handle.

Also widen `child_types` from `DataType[]` to `Type[]` so that
abstract element types (e.g. Union{...}) are accepted without a
type assertion error when building struct arrays.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CategoricalRefPool uses 0-based indices (0:n) with pool[0] as a missing
sentinel. When passed to arrowvector, ToArrow wraps it with 1-based
iteration (1..length(pool)). Since length(pool) == n+1, the last
iteration calls pool[n+1], which is out of bounds.

Fix: when firstindex(pool) != 1, skip the sentinel to give arrowvector
a standard 1-based view (pool[1:end]). The existing inds adjustment
(inds .-= firstindex(refa)) already produces correct Arrow dict indices
(-1 for missing, 0..n-1 for valid values).

Also add Table(::NamedTuple) constructor for the Arrow C Data path,
and add Arrow as an explicit dep in test/Project.toml so that
`julia --project=test test/runtests.jl` works in a local dev setup.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Extend test/cdatainterface.jl with 13 new testsets covering all
previously untested code paths in src/cdatainterface.jl:
- Generic lists (+l): Int32, missing, String
- Fixed-size lists (+w:N): Float32 and Int64 tuples
- Maps (+m): Dict{String,Int32}
- Dense unions (+ud:) and sparse unions (+us:)
- All four Duration units (tDs/tDm/tDu/tDn)
- Time nanoseconds (ttn)
- Timestamp with UTC timezone
- Interval year-month (tiM) and day-time (tiD)
- Decimal{10,2,Int128} (d:10,2,128)
- Arrow.Table export round-trip via to_c_data
- Bool import with non-byte-aligned bit offset
- release_c_data idempotency (double-release is a no-op)

Writing the tests uncovered three bugs, all fixed in src/cdatainterface.jl:

1. Dense and sparse union import used `child_types = DataType[]`, which
   rejects abstract element types such as `Union{Missing, Int32}`.
   Fixed to `child_types = Type[]` (same fix already applied to the
   struct path in a prior commit).

2. Decimal precision and scale were parsed as Int32 in
   `_fmt_to_storage_type`, producing `Decimal{Int32(10),...}` instead
   of `Decimal{Int64(10),...}`. Since Julia type parameters carry their
   integer type, the two Decimal types compared unequal even with
   identical values. Fixed by using `parse(Int, ...)`.

3. `to_c_data(::Arrow.Table)` called `Arrow.Table.names(tbl)`, which
   crashes because `Table` is a type, not a module. Fixed to
   `Tables.columnnames(tbl)`.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Compile a C probe at test time using offsetof() and compare each field
offset to Julia's fieldoffset(), confirming ABI compatibility with the
Arrow C Data Interface struct layout.

Also release CDataHandle in the finalizer and add release_c_data calls
to all tests to prevent resource leaks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant