Add Parquet variant shredding support#328
Add Parquet variant shredding support#328CurtHagenlocher wants to merge 1 commit intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end Parquet “variant shredding” support to the Arrow .NET operations layer, plus supporting scalar helpers and a checked-in IPC conformance corpus so CI can validate shredded-variant behavior without requiring a Parquet reader or Python.
Changes:
- Adds
Apache.Arrow.Operations.Shreddingtypes/helpers and options/enums for shredded-variant typed_value handling. - Adds
VariantValueWriter.CopyValue(VariantReader)andVariantMetadataBuilder.CollectFieldNames(VariantReader)to support cross-dictionary transcoding workflows. - Adds
test/shredded_variant_ipcIPC fixtures (and a regen script) for conformance testing.
Reviewed changes
Copilot reviewed 28 out of 165 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Apache.Arrow.Operations/Apache.Arrow.Operations.csproj | Adds Operations → Apache.Arrow project reference needed by shredding types. |
| src/Apache.Arrow.Operations/Shredding/ShreddingHelpers.cs | Internal helper to construct per-row ShreddedVariant slots from element-group structs. |
| src/Apache.Arrow.Operations/Shredding/ShredOptions.cs | Public options for shredding schema inference. |
| src/Apache.Arrow.Operations/Shredding/ShredType.cs | Enum describing typed_value expectations for shredded variant columns. |
| src/Apache.Arrow.Scalars/Variant/VariantMetadataBuilder.cs | Adds recursive field-name collection to support 2-pass metadata + value encoding. |
| src/Apache.Arrow.Scalars/Variant/VariantValueWriter.cs | Adds CopyValue/CopyPrimitive to transcode from a VariantReader into a writer. |
| test/shredded_variant_ipc/regen.py | Script to regenerate IPC fixtures from the parquet-testing shredded_variant corpus. |
| test/shredded_variant_ipc/case-001.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-002.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-004.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-005.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-006.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-007.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-008.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-009.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-010.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-011.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-012.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-013.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-014.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-015.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-016.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-017.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-018.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-019.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-020.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-021.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-022.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-023.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-024.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-025.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-026.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-027.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-028.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-029.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-030.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-031.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-032.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-033.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-034.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-035.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-036.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-037.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-038.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-039.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-040.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-041.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-042.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-043-INVALID.arrow | IPC fixture for shredded-variant conformance (invalid case). |
| test/shredded_variant_ipc/case-044.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-045.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-046.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-047.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-048.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-049.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-050.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-051.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-052.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-053.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-054.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-055.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-056.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-057.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-058.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-059.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-060.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-061.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-062.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-063.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-064.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-065.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-066.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-067.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-068.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-069.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-070.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-071.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-072.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-073.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-074.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-075.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-076.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-077.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-078.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-079.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-080.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-081.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-082.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-083.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-084-INVALID.arrow | IPC fixture for shredded-variant conformance (invalid case). |
| test/shredded_variant_ipc/case-085.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-086.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-087.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-088.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-089.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-090.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-091.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-092.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-093.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-094.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-095.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-096.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-097.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-098.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-099.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-100.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-101.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-102.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-103.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-104.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-105.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-106.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-107.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-108.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-109.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-110.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-111.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-112.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-113.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-114.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-115.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-116.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-117.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-118.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-119.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-120.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-121.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-122.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-123.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-124.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-125-INVALID.arrow | IPC fixture for shredded-variant conformance (invalid case). |
| test/shredded_variant_ipc/case-126.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-127.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-128.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-129.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-130.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-131.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-132.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-133.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-134.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-135.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-136.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-137.arrow | IPC fixture for shredded-variant conformance. |
| test/shredded_variant_ipc/case-138.arrow | IPC fixture for shredded-variant conformance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public double MinTypeConsistency { get; set; } = 0.8; | ||
|
|
||
| /// <summary>Default options.</summary> | ||
| public static readonly ShredOptions Default = new ShredOptions(); |
There was a problem hiding this comment.
ShredOptions.Default is a mutable static instance (the type has settable properties). Any consumer that does ShredOptions.Default.MaxDepth = ... will mutate global state for the entire process, which is easy to do accidentally and hard to debug. Consider making Default return a new instance each time (e.g., => new ShredOptions()), making the options immutable (init-only), or exposing a CreateDefault() factory instead.
| public static readonly ShredOptions Default = new ShredOptions(); | |
| public static ShredOptions Default => new ShredOptions(); |
| src_path = os.path.join(src, pf) | ||
| if not os.path.exists(src_path): | ||
| continue | ||
|
|
There was a problem hiding this comment.
regen.py silently skips a Parquet file when src_path doesn't exist (continue). This can lead to an incomplete regenerated IPC corpus without any signal (e.g., if the parquet-testing subtree isn't fully checked out or a filename in cases.json changes). Consider failing fast or at least emitting a warning/error to stderr when a listed Parquet file is missing, and optionally track a nonzero exit code if any were skipped.
| case VariantPrimitiveType.TimestampTzNanos: WriteTimestampTzNanos(source.GetTimestampTzNanos()); return; | ||
| case VariantPrimitiveType.TimestampNtzNanos: WriteTimestampNtzNanos(source.GetTimestampNtzNanos()); return; | ||
| case VariantPrimitiveType.String: WriteString(source.GetString()); return; | ||
| case VariantPrimitiveType.Binary: WriteBinary(source.GetBinary().ToArray()); return; | ||
| case VariantPrimitiveType.Uuid: WriteUuid(source.GetUuid()); return; |
There was a problem hiding this comment.
CopyPrimitive allocates for binary values via source.GetBinary().ToArray() because WriteBinary only accepts byte[]. For large variants or bulk transcoding this can be a significant overhead. Consider adding a WriteBinary(ReadOnlySpan<byte>)/WriteBinary(ReadOnlyMemory<byte>) overload (or equivalent) and using it here to avoid the intermediate allocation.
| StructType elementGroupType = (StructType)elementGroup.Data.DataType; | ||
| int valueIdx = elementGroupType.GetFieldIndex("value"); | ||
| int typedIdx = elementGroupType.GetFieldIndex("typed_value"); | ||
|
|
||
| IArrowArray valueArr = valueIdx >= 0 ? elementGroup.Fields[valueIdx] : null; | ||
| IArrowArray typedArr = typedIdx >= 0 ? elementGroup.Fields[typedIdx] : null; |
There was a problem hiding this comment.
BuildSlot calls StructType.GetFieldIndex("value") / GetFieldIndex("typed_value") on every invocation. This method is a linear scan (StructType.cs even notes caching if on a hot path), and BuildSlot is used inside per-element loops in ShreddedArray/ShreddedObject. Consider caching these field indices once per element-group StructType (or passing them in) to avoid repeated linear lookups.
|
I need to refactor |
What's Changed
Implements the Parquet variant shredding spec end-to-end in a new
Apache.Arrow.Operations.Shreddingnamespace, alongside minor changes to the base scalar and array types.Operations.Shredding reader side:
ShreddedVariant/ShreddedObject/ShreddedArrayref-struct trio exposing typed columns and residual bytes side-by-side.VariantArrayShreddingExtensionsaddsGetShreddedVariant(i)andGetLogicalVariantValue(i)onVariantArray.ShredSchema.FromArrowTypederives a shredding schema from an Arrow typed_value type, rejecting unsupported types (uint32, fixed-size-binary(N≠16)).Operations.Shredding producer side:
VariantShredderdecomposes a column ofVariantValuesagainst aShredSchemainto shared metadata + per-rowShredResults.ShreddedVariantArrayBuilderassembles those into a shreddedVariantArraywith atyped_valueArrow tree matching the schema.Apache.Arrow changes:
VariantExtensionDefinitionacceptsstruct<metadata, value?, typed_value?>layouts in addition to the plain unshredded form.VariantTypegainsIsShredded/HasValueColumn/HasTypedValueColumn/TypedValueFieldproperties.VariantArray.GetVariantValueandGetVariantReaderthrow on shredded columns with a pointer to theOperations.Shreddingextensions.The public
VariantArray(IArrowArray)constructor now infers theVariantType(shredded or not) from the storage shape.Operations gains a project reference to Apache.Arrow; Apache.Arrow does not reference Operations.
Apache.Arrow.Scalars changes:
VariantValueWriter.CopyValue(VariantReader source)transcodes a reader into this writer, re-resolving field IDs against the writer's metadata dictionary. Supports cross-dictionary transcoding and multi-source merge-into-one-dictionary workflows.VariantMetadataBuilder.CollectFieldNames(VariantReader source)is the two-pass companion that accumulates source field names into the target metadata builder.Validation:
apache/parquet-testing(test/parquet-testing/shredded_variant/).test/shredded_variant_ipc/regen.pyconverts eachcase-NNN.parquetto an Arrow IPC file viapyarrow; 137 resulting .arrow files are checked in so CI needs no Python. All 128 valid conformance cases pass; 6 schema-invalid and data-invalid cases are rejected with clear errors; 3 "spec-invalid but permissive" INVALID cases are documented as read-without-throw.