Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Sep 16, 2025

Which issue does this PR close?

Rationale for this change

There is currently no good way to populate a new variant array with variant values that reference an existing metadata, but that functionality is needed when transforming existing variant data (e.g. for shredding and unshredding operations).

What changes are included in this PR?

Add a new VariantValueArrayBuilder that does not try to create new metadata; instead, it wraps a ReadOnlyMetadata around the VariantMetadata instance of the Variant value being inserted. This takes advantage of the new generic ParentState capability.

NOTE: The new array builder does not impl VariantBuilderExt because it does not have a MetadataBuilder instance -- the instance is created on demand as part of the insertion itself. Instead, callers can directly invoke VariantValueArrayBuilder::parent_state(). This approach avoids the considerable complexity of keeping an internal metadata column index in sync with whatever external indexing might produce the variant value to be appended. It also doesn't seem to matter -- I did some pathfinding of variant shredding (going from binary to shredded variant based on some target schema), and the VariantBuilderExt does not seem especially helpful for that code.

Are these changes tested?

New unit tests.

Are there any user-facing changes?

New class.

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Sep 16, 2025
}

/// Creates a builder-specific parent state
pub fn parent_state<'a>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This is pub to compensate for the fact that VariantValueArrayBuilder cannot impl VariantBuilderExt (because the metadata builder is created on-demand instead of once up front). See PR description for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also see the updated unit test for a realistic example of how to use it)

@scovich
Copy link
Contributor Author

scovich commented Sep 16, 2025

@alamb -- this (along with the other small PR posted today) is the final groundwork for a new shred_variant function I've been prototyping. That function takes a binary variant as input, along with a target shredding schema (such as a primitive or struct). It converts the requested schema into an actual physical shredding schema with value and typed_value columns, and then attempts to shred the data into a new variant array with that schema.

It turns out that true shredding is just different enough from variant_get that it's worth implementing separately.

Implementing an actual shredding function also bypasses our current lack of an extension type (so there's actually no way to request shredding via variant_get).

@scovich
Copy link
Contributor Author

scovich commented Sep 16, 2025

this (along with the other small PR posted today) is the final groundwork for a new shred_variant function I've been prototyping. That function takes a binary variant as input, along with a target shredding schema (such as a primitive or struct). It converts the requested schema into an actual physical shredding schema with value and typed_value columns, and then attempts to shred the data into a new variant column with that schema.

Tracked as #8361

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scovich -- this looks great to me

I had some small test suggestions, but otherwise this looks great to me

/// the value column. It's useful for scenarios like variant unshredding, data
/// transformation, or filtering where you want to reuse existing metadata.
///
/// The builder produces a [`BinaryViewArray`] that can be combined with existing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

/// let mut builder = VariantValueArrayBuilder::new(10);
///
/// // Append some values with their corresponding metadata
/// // In practice, some of the variant values would be objects with internal metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the metadata would be pre-existing (not "internal")?

Suggested change
/// // In practice, some of the variant values would be objects with internal metadata.
/// // In practice, some of the variant values would be objects with pre-existing metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually it would be nice to show an example of appending a pre-pre-existing Variant value, but I don't think it is necessary here

Copy link
Contributor Author

@scovich scovich Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I understand? Do you mean that I should call append_value on the result of VariantArray::value, instead of creating the variant values from thin air?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the metadata would be pre-existing (not "internal")?

I was trying to say that Variant::metadata() might return Some, if it's a complex type (vs. these primitive values that have no metadata). I'll try rewording?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to show an example of appending a pre-existing Variant value

Check out the updated unit test?

@scovich scovich force-pushed the variant-value-builder branch from 0a76ed6 to f286c52 Compare September 17, 2025 04:12
@scovich scovich requested a review from alamb September 17, 2025 04:13
Comment on lines -267 to -273
// TODO: We need a way to convert a Variant directly to bytes. In particular, we want to
// just copy across the underlying value byte slice of a `Variant::Object` or
// `Variant::List`, without any interaction with a `VariantMetadata` (because the shredding
// spec requires us to reuse the existing metadata when unshredding).
//
// One could _probably_ emulate this with parquet_variant::VariantBuilder, but it would do a
// lot of unnecessary work and would also create a new metadata column we don't need.
Copy link
Contributor Author

@scovich scovich Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb -- I just realized while putting up #8366 that the new VariantValueArrayBuilder is perfect for fixing this TODO (in addition to supporting shredding).

So now it has an immediate use, instead of only being a building block for the next PR.

impl WritableMetadataBuilder {
/// Upsert field name to dictionary, return its ID
fn upsert_field_name(&mut self, field_name: &str) -> u32 {
pub fn upsert_field_name(&mut self, field_name: &str) -> u32 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably should have been public all along, but now it's needed.

///
/// Returns `Some(&VariantMetadata)` for object and list variants,
pub fn metadata(&self) -> Option<&'m VariantMetadata<'_>> {
pub fn metadata(&self) -> Option<&VariantMetadata<'m>> {
Copy link
Contributor Author

@scovich scovich Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had the wrong lifetime specification, which led to weird borrow checker issues in this PR because the resulting VariantMetadata was tied to the lifetime of self instead of 'm -- even if cloned.

@scovich
Copy link
Contributor Author

scovich commented Sep 17, 2025

CC @codephage2020 @klion26 -- you might be interested in this change

@alamb
Copy link
Contributor

alamb commented Sep 17, 2025

I'll plan to merge this tomorrow unless @klion26 or @codephage2020 would like additional time to review

Copy link
Member

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

/// builder.append_value(Variant::from(42));
/// ```
pub fn append_value(&mut self, value: Variant<'_, '_>) {
let metadata = value.metadata().cloned().unwrap_or(EMPTY_VARIANT_METADATA);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not related to the current pr. Currently, we'll return None for Variant::metadata() if it's not object or list, do we need to return EMPTY_VARIANT_METDATA?

Copy link
Contributor Author

@scovich scovich Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question!

Until now, that method was only used in tests, so it wasn't clear what the best semantics might be (plus, we didn't have the constant back then). I like the idea a lot tho -- it would eliminate the unwrap_or (which my next PR also has to use).

}

#[test]
fn test_variant_value_array_builder_with_objects() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test!

@alamb
Copy link
Contributor

alamb commented Sep 18, 2025

Thanks everyone -- I'll merge this PR to keep the code moving and we can keep iterating with future PRs.

@alamb alamb merged commit f4840f6 into apache:main Sep 18, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants