feat(parquet-variant): add dictionary and run-end encoded support to …#10014
Open
mneetika wants to merge 1 commit into
Open
feat(parquet-variant): add dictionary and run-end encoded support to …#10014mneetika wants to merge 1 commit into
mneetika wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
variant_to_arrowDictionary/REEtype support #10013Rationale for this change
The current
unshred_variantlayout restoration engine lacks native support for complex nested and run-length arrays (DataType::DictionaryandDataType::RunEndEncoded). Without this capabilities track, downstream analytical engines (such as Apache DataFusion) cannot fully execute end-to-end optimizations on semi-structured Parquet Variant columns when columns use memory-optimized dictionaries or compressed run layouts.Rather than writing fragmented, redundant iteration code specific to these two complex types, this PR closes the structural type gap by cleanly routing layout handling through the pre-existing, highly efficient
ArrowToVariantRowBuilderabstraction framework.What changes are included in this PR?
UnshredVariantRowBuilderState Machine: Added anArrow(ArrowUnshredRowBuilder)variant to handle layout schemas that are natively supported by Arrow-to-Variant records but do not possess localized primitive variant implementations.CastOptionsRecursively: PropagatedCastOptionsconfiguration down through the recursivetry_new_optlifecycle. This guarantees that type conversions remain perfectly aligned across highly nested structs, lists, and view layers.handle_unshredded_case!macro routine to maintain low-overhead tracking for early row nulls or literal byte vector overrides.Are these changes tested?
Yes, automated unit tests have been added directly to
unshred_variant.rsto guarantee complete encoding and decoding fidelity:test_unshred_dictionary_typed_value: Validates dictionary key-to-value resolution paths, null key handoffs, and index repetition offsets.test_unshred_run_end_encoded_typed_value: Verifies run-length array boundary checks and multi-row string value reconstruction.Are there any user-facing changes?
No. This change is entirely additive and non-breaking. It expands structural coverage for the experimental Parquet Variant processing pipeline without modifying any public-facing function signatures or traits.