-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Implement new VariantValueArrayBuilder #8360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
|
||
/// Creates a builder-specific parent state | ||
pub fn parent_state<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This is pub
to compensate for the fact that VariantValueArrayBuilder
cannot impl VariantBuilderExt
(because the metadata builder is created on-demand instead of once up front). See PR description for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also see the updated unit test for a realistic example of how to use it)
@alamb -- this (along with the other small PR posted today) is the final groundwork for a new It turns out that true shredding is just different enough from Implementing an actual shredding function also bypasses our current lack of an extension type (so there's actually no way to request shredding via |
Tracked as #8361 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scovich -- this looks great to me
I had some small test suggestions, but otherwise this looks great to me
/// the value column. It's useful for scenarios like variant unshredding, data | ||
/// transformation, or filtering where you want to reuse existing metadata. | ||
/// | ||
/// The builder produces a [`BinaryViewArray`] that can be combined with existing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
/// let mut builder = VariantValueArrayBuilder::new(10); | ||
/// | ||
/// // Append some values with their corresponding metadata | ||
/// // In practice, some of the variant values would be objects with internal metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the metadata would be pre-existing (not "internal")?
/// // In practice, some of the variant values would be objects with internal metadata. | |
/// // In practice, some of the variant values would be objects with pre-existing metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually it would be nice to show an example of appending a pre-pre-existing Variant
value, but I don't think it is necessary here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure I understand? Do you mean that I should call append_value
on the result of VariantArray::value
, instead of creating the variant values from thin air?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the metadata would be pre-existing (not "internal")?
I was trying to say that Variant::metadata()
might return Some, if it's a complex type (vs. these primitive values that have no metadata). I'll try rewording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to show an example of appending a pre-existing
Variant
value
Check out the updated unit test?
0a76ed6
to
f286c52
Compare
// TODO: We need a way to convert a Variant directly to bytes. In particular, we want to | ||
// just copy across the underlying value byte slice of a `Variant::Object` or | ||
// `Variant::List`, without any interaction with a `VariantMetadata` (because the shredding | ||
// spec requires us to reuse the existing metadata when unshredding). | ||
// | ||
// One could _probably_ emulate this with parquet_variant::VariantBuilder, but it would do a | ||
// lot of unnecessary work and would also create a new metadata column we don't need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impl WritableMetadataBuilder { | ||
/// Upsert field name to dictionary, return its ID | ||
fn upsert_field_name(&mut self, field_name: &str) -> u32 { | ||
pub fn upsert_field_name(&mut self, field_name: &str) -> u32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably should have been public all along, but now it's needed.
parquet-variant/src/variant.rs
Outdated
/// | ||
/// Returns `Some(&VariantMetadata)` for object and list variants, | ||
pub fn metadata(&self) -> Option<&'m VariantMetadata<'_>> { | ||
pub fn metadata(&self) -> Option<&VariantMetadata<'m>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This had the wrong lifetime specification, which led to weird borrow checker issues in this PR because the resulting VariantMetadata
was tied to the lifetime of self
instead of 'm
-- even if cloned.
CC @codephage2020 @klion26 -- you might be interested in this change |
I'll plan to merge this tomorrow unless @klion26 or @codephage2020 would like additional time to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
/// builder.append_value(Variant::from(42)); | ||
/// ``` | ||
pub fn append_value(&mut self, value: Variant<'_, '_>) { | ||
let metadata = value.metadata().cloned().unwrap_or(EMPTY_VARIANT_METADATA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not related to the current pr. Currently, we'll return None
for Variant::metadata()
if it's not object
or list
, do we need to return EMPTY_VARIANT_METDATA
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great question!
Until now, that method was only used in tests, so it wasn't clear what the best semantics might be (plus, we didn't have the constant back then). I like the idea a lot tho -- it would eliminate the unwrap_or
(which my next PR also has to use).
} | ||
|
||
#[test] | ||
fn test_variant_value_array_builder_with_objects() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test!
Thanks everyone -- I'll merge this PR to keep the code moving and we can keep iterating with future PRs. |
Which issue does this PR close?
shred_variant
function #8361Rationale for this change
There is currently no good way to populate a new variant array with variant values that reference an existing metadata, but that functionality is needed when transforming existing variant data (e.g. for shredding and unshredding operations).
What changes are included in this PR?
Add a new
VariantValueArrayBuilder
that does not try to create new metadata; instead, it wraps aReadOnlyMetadata
around theVariantMetadata
instance of theVariant
value being inserted. This takes advantage of the new genericParentState
capability.NOTE: The new array builder does not impl
VariantBuilderExt
because it does not have aMetadataBuilder
instance -- the instance is created on demand as part of the insertion itself. Instead, callers can directly invokeVariantValueArrayBuilder::parent_state()
. This approach avoids the considerable complexity of keeping an internal metadata column index in sync with whatever external indexing might produce the variant value to be appended. It also doesn't seem to matter -- I did some pathfinding of variant shredding (going from binary to shredded variant based on some target schema), and theVariantBuilderExt
does not seem especially helpful for that code.Are these changes tested?
New unit tests.
Are there any user-facing changes?
New class.