Skip to content

feat(parquet-variant): add dictionary and run-end encoded support to …#10014

Open
mneetika wants to merge 1 commit into
apache:mainfrom
mneetika:feature/variant-dictionary-support
Open

feat(parquet-variant): add dictionary and run-end encoded support to …#10014
mneetika wants to merge 1 commit into
apache:mainfrom
mneetika:feature/variant-dictionary-support

Conversation

@mneetika
Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

The current unshred_variant layout restoration engine lacks native support for complex nested and run-length arrays (DataType::Dictionary and DataType::RunEndEncoded). Without this capabilities track, downstream analytical engines (such as Apache DataFusion) cannot fully execute end-to-end optimizations on semi-structured Parquet Variant columns when columns use memory-optimized dictionaries or compressed run layouts.

Rather than writing fragmented, redundant iteration code specific to these two complex types, this PR closes the structural type gap by cleanly routing layout handling through the pre-existing, highly efficient ArrowToVariantRowBuilder abstraction framework.

What changes are included in this PR?

  1. Extended UnshredVariantRowBuilder State Machine: Added an Arrow(ArrowUnshredRowBuilder) variant to handle layout schemas that are natively supported by Arrow-to-Variant records but do not possess localized primitive variant implementations.
  2. Plumbed CastOptions Recursively: Propagated CastOptions configuration down through the recursive try_new_opt lifecycle. This guarantees that type conversions remain perfectly aligned across highly nested structs, lists, and view layers.
  3. Optimized Null and Fallback Interception: Hooked the new row builder directly into the core handle_unshredded_case! macro routine to maintain low-overhead tracking for early row nulls or literal byte vector overrides.

Are these changes tested?

Yes, automated unit tests have been added directly to unshred_variant.rs to guarantee complete encoding and decoding fidelity:

  • test_unshred_dictionary_typed_value: Validates dictionary key-to-value resolution paths, null key handoffs, and index repetition offsets.
  • test_unshred_run_end_encoded_typed_value: Verifies run-length array boundary checks and multi-row string value reconstruction.

Are there any user-facing changes?

No. This change is entirely additive and non-breaking. It expands structural coverage for the experimental Parquet Variant processing pipeline without modifying any public-facing function signatures or traits.

@github-actions github-actions Bot added the parquet-variant parquet-variant* crates label May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Add variant_to_arrow Dictionary/REE type support

1 participant