Skip to content

[Arrow] Add API to check if Field has a valid ExtensionType#9677

Open
sdf-jkl wants to merge 4 commits intoapache:mainfrom
sdf-jkl:field-extension-type
Open

[Arrow] Add API to check if Field has a valid ExtensionType#9677
sdf-jkl wants to merge 4 commits intoapache:mainfrom
sdf-jkl:field-extension-type

Conversation

@sdf-jkl
Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl commented Apr 8, 2026

Which issue does this PR close?

Rationale for this change

Check issue

What changes are included in this PR?

  • Added a has_valid_extension_type API to Field
  • Added a unit test to save behavior

Are these changes tested?

  • yes, unit test

Are there any user-facing changes?

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate parquet-variant parquet-variant* crates labels Apr 8, 2026
@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Apr 8, 2026

@alamb @scovich I was looking at #8474 and the doc update in #8475.

The current recommendation in #8475 is good:

if field.extension_type_name() == Some(MyExtensionType::NAME) {
    if let Ok(extension_type) = field.try_extension_type::<MyExtensionType>() {
        // ...
    }
}

Checking the name first avoids the two name-related error paths in try_new_from_field_metadata (missing/mismatch).

For a full validity check, though, we currently still have to go through try_new_from_field_metadata, which means:

  1. deserialize_metadata(...)
  2. try_new(...)

Both may return Err, and both are part of validation today.

I think a clean follow-up would be a dedicated validate API on ExtensionType, with a default implementation that simply delegates to try_new_from_field_metadata (or equivalent). That gives us a clearer API now without changing behavior, and leaves room for specialized implementations later if needed. For extensions that override it, this could also avoid allocation/work from constructing an ExtensionType when we only need validation.

@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Apr 8, 2026

I could add ExtensionType::validate here

@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Apr 8, 2026

It was easy to do right away here.

Now instead of building the ExtensionType we only follow the simple validation steps from ExtensionType::try_new.

For these types (Bool8, Json, Uuid, Opaque, TimestampWithOffset, VariantType, RowNumber, RowGroupIndex) it's a simple DataType check.

FixedShapeTensor and VariableShapeTensor don't implement validate yet, but they don't really need it since we don't do a boolean check for them anywhere.

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, but one design question.

/// The default implementation delegates to [`Self::try_new`]. Extension
/// types may override this to validate without constructing `Self`.
fn validate(data_type: &DataType, metadata: Self::Metadata) -> Result<(), ArrowError> {
Self::try_new(data_type, metadata).map(|_| ())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear we have an API clash here:

  • supports_data_type receives &self, in order to allow configurable extension types based on their metadata
  • validate receives metadata but can only use it by instantiating Self (which requires the very allocation we wanted to avoid).

Ideally, supports_data_type should be implemented in terms of validate instead... but I guess that would be a breaking change?

The next best would be to chase down every extension type that actually has metadata, and implement the two methods in the correct direction.

Does any impl in the arrow crate actually use this provided method? Or is it just a safety net for third party impl to avoid a breaking change?

Copy link
Copy Markdown
Contributor Author

@sdf-jkl sdf-jkl Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a safety net now. No extension type is using the default validate impl

let ext_metadata = self
.metadata()
.get(EXTENSION_TYPE_METADATA_KEY)
.map(|s| s.as_str());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit?

Suggested change
.map(|s| s.as_str());
.as_deref();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.as_deref() does Option<String> → Option<&str>

We have Option<&String> → Option<&str> situation here.

Could do:

- .map(|s| s.as_str());
+ .map(String::as_str)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Arrow] Consider faster way to check if a Field has an extension type

2 participants