Add VariantArray extension type and introduce IBinaryArray/IIndexes interfaces#325
Conversation
Rename assembly, project, namespace, and test project: - src/Apache.Arrow.Variant -> src/Apache.Arrow.Scalars - test/Apache.Arrow.Variant.Tests -> test/Apache.Arrow.Scalars.Tests - Update all namespace references across the solution Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move all Variant*.cs files into a Variant/ subdirectory within src/Apache.Arrow.Scalars and update the namespace from Apache.Arrow.Scalars to Apache.Arrow.Scalars.Variant. Update using directives in all consumers: Operations, Tests, and Benchmarks projects. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove net6.0 target from Apache.Arrow - Update Apache.Arrow.Flight.AspNetCore from net6.0 to net8.0 - Add net462 target to Apache.Arrow.Scalars and Apache.Arrow.Scalars.Tests - Add ProjectReference from Apache.Arrow to Apache.Arrow.Scalars - Implement VariantExtensionDefinition, VariantType, and VariantArray for the arrow.parquet.variant extension type, backed by struct<metadata: binary, value: binary> - Storage validation accepts Binary, LargeBinary, or BinaryView fields and looks up fields by name (not index) per spec Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…into VariantArray
There was a problem hiding this comment.
Pull request overview
Adds first-class support for Parquet Variant values in Apache.Arrow via a new VariantArray extension type, and introduces internal abstraction interfaces (IBinaryArray, IIndexes) to decouple decoding/enumeration code from concrete array implementations (including Dictionary and Run-End Encoded layouts).
Changes:
- Added
VariantType/VariantArray(plusVariantArray.Builder) and corresponding unit tests, including IPC round-trip coverage. - Introduced internal
IBinaryArrayandIIndexesinterfaces, and updated binary and integer array types (plusRunEndEncodedArray) to implement them. - Updated decoding helpers to use
IIndexesinstead of type-switching on index array types; updated project files/TFMs and added a project reference toApache.Arrow.Scalars.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Apache.Arrow.Tests/VariantArrayTests.cs | Adds unit tests for building/reading VariantArray and IPC round-trip behavior. |
| test/Apache.Arrow.Scalars.Tests/Apache.Arrow.Scalars.Tests.csproj | Adds net462 test target on Windows and overrides xUnit VS runner version for that TFM. |
| src/Apache.Arrow/Interfaces/IIndexes.cs | Introduces internal index abstraction for dictionary/REE index resolution and sequential enumeration. |
| src/Apache.Arrow/Interfaces/IBinaryArray.cs | Introduces internal binary abstraction to unify GetBytes(...) across binary array variants. |
| src/Apache.Arrow/Extensions/IArrowArrayExtensions.cs | Refactors decoded list support to use IIndexes rather than ArrowTypeId-based casting. |
| src/Apache.Arrow/Arrays/VariantArray.cs | Implements VariantType, VariantArray, extension definition, decoding, and builder. |
| src/Apache.Arrow/Arrays/UInt8Array.cs | Implements IIndexes for dictionary/REE index abstraction. |
| src/Apache.Arrow/Arrays/UInt64Array.cs | Implements IIndexes for dictionary/REE index abstraction (with checked casts). |
| src/Apache.Arrow/Arrays/UInt32Array.cs | Implements IIndexes for dictionary/REE index abstraction (with checked casts). |
| src/Apache.Arrow/Arrays/UInt16Array.cs | Implements IIndexes for dictionary/REE index abstraction. |
| src/Apache.Arrow/Arrays/RunEndEncodedArray.cs | Implements IIndexes to allow physical index mapping via the new abstraction. |
| src/Apache.Arrow/Arrays/LargeBinaryArray.cs | Implements IBinaryArray to unify binary access via GetBytes(...). |
| src/Apache.Arrow/Arrays/Int8Array.cs | Implements IIndexes for dictionary/REE index abstraction. |
| src/Apache.Arrow/Arrays/Int64Array.cs | Implements IIndexes for dictionary/REE index abstraction (with checked casts). |
| src/Apache.Arrow/Arrays/Int32Array.cs | Implements IIndexes for dictionary/REE index abstraction. |
| src/Apache.Arrow/Arrays/Int16Array.cs | Implements IIndexes for dictionary/REE index abstraction. |
| src/Apache.Arrow/Arrays/DictionaryArray.cs | Adds GetIndexes() helper and exposes physical index enumeration via the new abstraction. |
| src/Apache.Arrow/Arrays/BinaryViewArray.cs | Implements IBinaryArray to unify binary access via GetBytes(...). |
| src/Apache.Arrow/Arrays/BinaryArray.cs | Implements IBinaryArray to unify binary access via GetBytes(...). |
| src/Apache.Arrow/Apache.Arrow.csproj | Drops net6.0 TFM and adds project reference to Apache.Arrow.Scalars. |
| src/Apache.Arrow.Scalars/Apache.Arrow.Scalars.csproj | Adds net462 target and required package refs for older frameworks. |
| src/Apache.Arrow.Flight.AspNetCore/Apache.Arrow.Flight.AspNetCore.csproj | Raises target framework from net6.0 to net8.0. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public ReadOnlySpan<byte> GetMetadataBytes(int index) | ||
| { | ||
| int physicalIndex = _metadataIndexes.GetPhysicalIndex(index); | ||
| return _metadataArray.GetBytes(physicalIndex, out bool isNull); | ||
| } |
There was a problem hiding this comment.
GetMetadataBytes computes a physical index via IIndexes; for dictionary/run-end encodings IIndexes can return -1 (null index) or refer to an underlying value that is null. Calling GetBytes with a negative index will throw an ArgumentOutOfRangeException that’s hard to interpret. Consider validating the physicalIndex and/or the returned isNull flag and throwing a clearer InvalidOperationException (or mapping to VariantValue.Null) when the storage is inconsistent.
There was a problem hiding this comment.
I'm fine with this consequence for malformed data but would welcome other opinions.
There was a problem hiding this comment.
Yeah getting an ArgumentOutOfRangeException in this scenario sounds reasonable.
|
|
||
| namespace Apache.Arrow | ||
| { | ||
| internal interface IBinaryArray |
There was a problem hiding this comment.
This could perhaps be public at some point.
adamreeve
left a comment
There was a problem hiding this comment.
This all looks good to me thanks Curt
What's Changed
VariantArrayextension type.Builderclass forVariantArraythat encodesVariantValueinstances into the variant binary format and constructs the backingStructArray.IBinaryArrayandIIndexesto decoupleVariantArrayand other uses from concrete array types.IBinaryArrayunifiesBinaryArray,LargeBinaryArray, andBinaryViewArraybehind a commonGetBytesAPI.IIndexesabstracts index resolution forDictionaryArrayandRunEndEncodedArray, enabling efficient sequential enumeration for REE arrays. Integer array types implementIIndexeswithGetPhysicalIndexandEnumeratePhysicalIndicesmethods.Scalarsassembly and tests