-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Harmonize primitive conversions variant unshredding and casting #8521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
scovich
wants to merge
5
commits into
apache:main
Choose a base branch
from
scovich:harmonize-unshred-and-cast
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+633
−61
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
0662e12
[Variant] Define and use unshred_variant function
scovich 62b6299
address reviews
scovich 29cdbad
Merge remote-tracking branch 'oss/main' into unshred-variant
scovich 2831e34
remove stale TODO
scovich 509d432
[Variant] Harmonize primitive conversions in cast_to_variant and unsh…
scovich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I don't love is we now have two complex macros that feel kind of similar. This new macro simplified the
unshred_variant
code but has ~no effect on thecast_to_variant
code in this file. In fact, the builder definition didn't even change (L397 on the left side of the diff):So in retrospect, I'm not sure what duplication this PR actually eliminated?
Seems like it just moved some definitions around.
Any ideas on a better approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this PR doesn't seem to have eliminated much yet
There seems to be two major pieces of functionality
It seems to me like
arrow_to_variant.rs
andunshred_variant.rs
still have non trivial overlap as you mention. What I was hoping was that we could somehow use one set of traits / structs for both those operations which would mean we would getunshred
support "for free" for other types likeUnionArray
sThat was the dream at anyways.
After I spent some time reviewing this PR I think I would like to propose some naming consolidation as we have two functions that are inverses of each other that are confusingly named;
cast_to_variant
which casts arrow arrays to variantvariant_to_arrow
which casts variants to arrow arraysI will make a PR to propose
cast_arrow_to_variant
andcast_variant_to_arrow
for the kernel names and we can keep the modulesvariant_to_arrow
andarrow_to_variant
with the actual codeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast_to_variant
is a function that leverages thearrow_to_variant
module.variant_to_arrow
is a module that thevariant_get
function relies on.I guess
variant_get
could be seen as a superset of a hypotheticalcast_variant_to_arrow
function that is an inverse ofcast_[arrow_]to_variant
?That said, I do think it's a good idea to step back and take a hard look at naming conventions as the code matures and the interactions (or lack thereof) become clearer:
cast_to_variant
- converts fully strongly-typed data to binary variantarrow_to_variant
moduleshred_variant
- shreds a binary variant input according to the requested shredding schemaunshred_variant
- converts shredded variant back to binary variantvariant_get
- can be used to extract fully strongly-typed data from variant (shredded or not)variant_to_arrow
module when necessaryAnd, for good measure, some of the conversions probably should move to
type_conversions
module, and I don't know where theListLikeArray
trait should live?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that is a good way to think about it. I don't think we should add
cast_to_variant_to_arrow
at this time (I was very confused)Or maybe we could consolidate the conversions into
arrow_to_variant
orvariant_to_arrow
🤔There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is the intuitive view, and arrow_to_variant and unshred_variant both fall loosely in category 1/.
So this is tricky -- shredded variant
typed_value
columns must be one of the supported variant shredding types defined by the shredding spec.Unsigned integer types are not on that list, so we'll never need to unshred them (even tho we can shred them, and can even
variant_get
them by converting back).Complex types like Union and Map are also not on that list and so we'll never need to unshred them. But we can still convert them to variant: Whatever union branch is active for each row gets converted to variant, which works fine; maps are trickier -- I think our current code forcibly converts the map key column to string (with a
cast
🙀) and then converts the result to variant object.FixedLenBinary is also tricky, because it's not a valid shredded variant type, but UUID uses
FixedLenBinary(16)
as its physical type. So when converting to variant, all binary types (including fixed len) convert toVariant::Binary
except thatFixedLenBinary(16)
with the UUID extension type would convert toVariant::Uuid
.So one immediate problem is that the two operations have an overlapping but not equivalent set of types. And some physical types that seem the same have different interpretation/semantics. Even the simplest -- the NULL builder -- has different semantics between the two operations.
Another problem is that the definition of "primitive type" differs between the two modules:
unshred_variant
takes the variant perspective, so string, binary, and boolean arrays can all implement the genericAppendToVariantBuilder
trait thatUnshredPrimitiveRowBuilder
relies on. But timestamps, which need extra state (timezone info) are not primitive and need their own builder.arrow_to_variant
takes the arrow perspective, so string, binary and boolean are not primitive types and thus need their own builder implementations. Additionally, all decimal and temporal types need special treatment and they get customer builders as well.Another problem is the need to handle
value
column when unshredding, which is not needed when converting strongly typed data to variant.Overall... I couldn't find a way to slice this better, in spite of my intuition screaming it should be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and casting failures also... converting arrow value to variant can fail, and cast options decides whether the failure produces
Variant::Null
or an error. In theory, unshredding is infallible and any failure there is due to invalid data (so should always produce an error).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an LLM-generated compare/contrast, in case that's helpful:
Analysis of Current Consolidation Effort
Current State
The diff shows an attempt to consolidate primitive type handling by:
ArrowToVariant
trait - A zero-cost trait that both modules can usedefine_arrow_to_variant
macro - A simpler macro for basic type conversionsshared_timestamp_to_variant
functionKey Differences Between Approaches
arrow_to_variant.rs (Cast semantics)
define_row_builder!
- Complex, feature-rich macro supporting:CastOptions
,scale
)Option<T>
return typesT::Native: Into<Variant>
constraintVariant::Binary
unshred_variant.rs (Unshred semantics)
define_arrow_to_variant!
- Simpler macro supporting:ArrowToVariant
traitVariant::Uuid
Areas of Redundancy
1. Primitive Type Implementations
Both modules have nearly identical logic for:
2. Enum Variants and Match Arms
Both have large enums with similar variants:
3. Factory Pattern Logic
Both have similar DataType matching logic to create appropriate builders.
Semantic Differences That Prevent Full Sharing
FixedSizeBinary Handling:
Variant::Binary
Variant::Uuid
Error Handling Philosophy:
Variant::Null
Decimal Support:
Method Signatures:
append_row(builder, index)
append_row(builder, metadata, index)
- needs metadata for unshredded valuesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my updated suggestion (basically just update comments)