Add Arrow C Data Interface support with defensive resource management#594
Open
robertbuessow wants to merge 11 commits into
Open
Add Arrow C Data Interface support with defensive resource management#594robertbuessow wants to merge 11 commits into
robertbuessow wants to merge 11 commits into
Conversation
Implements both directions of the Arrow C Data Interface spec (https://arrow.apache.org/docs/format/CDataInterface.html): - `Arrow.from_c_data(schema_ptr, array_ptr)` — import an Arrow array from C-owned memory; zero-copy via `unsafe_wrap`; `CDataHandle` finalizer calls the C `release` callbacks automatically. - `Arrow.to_c_data(col)` — export an `ArrowVector` or `Arrow.Table` to C; GC roots kept alive via a token-keyed global dict; `@cfunction` release callbacks (initialised in `__init__`) delete roots on consumer release. New public types: `ArrowSchema`, `ArrowArray`, `CImportedArray`, `CImportedTable`, and `release_c_data`. Supports all Arrow column types: primitives, Bool, String/binary, List (generic, large, fixed-size), Struct, Map, DenseUnion, SparseUnion, DictEncoded, Null, and all time/date/duration types. Handles nullable columns, non-zero array offsets, and custom metadata. 68 new tests in `test/cdatainterface.jl` covering format strings, buffer layout, validity bitmaps, round-trips, release semantics, non-zero offsets, and multi-column table import. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove duplicate finalizer registrations in `from_c_data`: the
`CDataHandle` was getting a finalizer attached but `CImportedArray`
already owns and manages the handle's lifetime, causing a potential
double-free when the GC collected the handle.
Also widen `child_types` from `DataType[]` to `Type[]` so that
abstract element types (e.g. Union{...}) are accepted without a
type assertion error when building struct arrays.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CategoricalRefPool uses 0-based indices (0:n) with pool[0] as a missing sentinel. When passed to arrowvector, ToArrow wraps it with 1-based iteration (1..length(pool)). Since length(pool) == n+1, the last iteration calls pool[n+1], which is out of bounds. Fix: when firstindex(pool) != 1, skip the sentinel to give arrowvector a standard 1-based view (pool[1:end]). The existing inds adjustment (inds .-= firstindex(refa)) already produces correct Arrow dict indices (-1 for missing, 0..n-1 for valid values). Also add Table(::NamedTuple) constructor for the Arrow C Data path, and add Arrow as an explicit dep in test/Project.toml so that `julia --project=test test/runtests.jl` works in a local dev setup. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Extend test/cdatainterface.jl with 13 new testsets covering all
previously untested code paths in src/cdatainterface.jl:
- Generic lists (+l): Int32, missing, String
- Fixed-size lists (+w:N): Float32 and Int64 tuples
- Maps (+m): Dict{String,Int32}
- Dense unions (+ud:) and sparse unions (+us:)
- All four Duration units (tDs/tDm/tDu/tDn)
- Time nanoseconds (ttn)
- Timestamp with UTC timezone
- Interval year-month (tiM) and day-time (tiD)
- Decimal{10,2,Int128} (d:10,2,128)
- Arrow.Table export round-trip via to_c_data
- Bool import with non-byte-aligned bit offset
- release_c_data idempotency (double-release is a no-op)
Writing the tests uncovered three bugs, all fixed in src/cdatainterface.jl:
1. Dense and sparse union import used `child_types = DataType[]`, which
rejects abstract element types such as `Union{Missing, Int32}`.
Fixed to `child_types = Type[]` (same fix already applied to the
struct path in a prior commit).
2. Decimal precision and scale were parsed as Int32 in
`_fmt_to_storage_type`, producing `Decimal{Int32(10),...}` instead
of `Decimal{Int64(10),...}`. Since Julia type parameters carry their
integer type, the two Decimal types compared unequal even with
identical values. Fixed by using `parse(Int, ...)`.
3. `to_c_data(::Arrow.Table)` called `Arrow.Table.names(tbl)`, which
crashes because `Table` is a type, not a module. Fixed to
`Tables.columnnames(tbl)`.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Compile a C probe at test time using offsetof() and compare each field offset to Julia's fieldoffset(), confirming ABI compatibility with the Arrow C Data Interface struct layout. Also release CDataHandle in the finalizer and add release_c_data calls to all tests to prevent resource leaks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for the Arrow C Data Interface, enabling zero-copy data exchange between Julia and C/C++/Python runtimes in the same process.
Key additions and fixes:
src/cdatainterface.jl):Arrow.from_c_data/Arrow.to_c_datafor importing and exporting Arrow arrays and tables across the C ABI boundary. Handles all Arrow types: primitive, boolean, list, fixed-size list, map, struct, union, dict-encoded, and nested types.CDataHandletracks C-side memory lifetime. The GC finalizer usesjl_safe_printfinstead of@error(task switches are forbidden in finalizers) and an atomic counter instead of a non-thread-safeRef{Int}. The finalizer also calls the C release callback as a safety net, so C resources are freed even whenrelease_c_datais not called explicitly.offsetof()and compares every field offset ofArrowSchemaandArrowArrayagainst Julia'sfieldoffset(), catching any struct layout divergence between Julia and the C compiler.release_c_dataexplicitly, and a final@testset "no unexpected resource leaks"assertsUNRELEASED_HANDLE_COUNTonly increases by the one intentional leak test.BoundsErrorwhen dict-encodingCategoricalArrays with missing values.Test plan
julia --project -e 'using Pkg; Pkg.test()'passes with no leak warnings@testset "struct field offsets match C ABI"— 19 assertions all green@testset "no unexpected resource leaks"— counter == initial + 1Arrow.CDataHandle GC'd without explicit release_c_dataoutput during the test run (except inside theredirect_stderrblock of the intentional-leak test)🤖 Generated with Claude Code