Preserve PyArrow extension metadata when chaining Python scalar UDFs #1287

kosiew · 2025-10-22T10:41:05Z

Which issue does this PR close?

Closes #1172

Rationale for this change

When multiple scalar UDFs are chained in Python, the intermediate results lose PyArrow extension metadata.
This happens because the existing binding only passed arrow::datatypes::DataType to Rust’s ScalarUDF, discarding extension information embedded in pyarrow.Field.

This patch ensures that DataFusion’s Python UDF layer preserves the complete field metadata, allowing extension arrays (e.g. arrow.uuid, custom logical types) to survive round-trips between Python and Rust.

What changes are included in this PR?

🔧 Python (`python/datafusion/user_defined.py`)

Introduced PyArrowArray and PyArrowArrayT aliases for unified typing of Array and ChunkedArray.
Added normalization utilities:
- _normalize_field, _normalize_input_fields, _normalize_return_field
- _wrap_extension_value and _wrap_udf_function to automatically re-wrap extension arrays on UDF input/output.
Updated ScalarUDF constructor and decorator overloads to accept both pa.Field and pa.DataType objects.
Ensured ScalarUDF passes fully qualified Field objects (with metadata) to the internal layer.

🧰 Rust (`src/udf.rs`)

Added a new PySimpleScalarUDF implementing ScalarUDFImpl:
- Preserves arrow::datatypes::Field for inputs and return values.
- Implements return_field_from_args to keep field names and extension metadata.
Updated the PyO3 binding to accept and expose Vec<Field> instead of Vec<DataType>.
Refactored construction to use ScalarUDF::new_from_impl().

🤖 Tests (`python/tests/test_udf.py`)

Added test_uuid_extension_chain verifying that:
- Chained UDFs correctly round-trip arrow.uuid arrays.
- Empty extension arrays are handled without type loss.
- UDF input/output extension metadata remains intact.

Are these changes tested?

✅ Yes.
The new test suite test_uuid_extension_chain explicitly covers:

Chaining of UUID extension UDFs.
Handling of empty extension arrays.
Type preservation between UDF boundaries.
Existing decorator and parameterized UDF tests remain intact and continue to pass.

Are there any user-facing changes?

Yes — enhanced behavior for PyArrow extension arrays in Python UDFs.

Users can now declare input_types and return_type as either pa.DataType or pa.Field.
Chained scalar UDFs now preserve PyArrow extension metadata (e.g. arrow.uuid, custom registered extensions).
Existing non-extension UDFs continue to function unchanged.

No breaking API changes are introduced — the update is fully backward-compatible while extending functionality.

Enhance scalar UDF definitions to retain Arrow Field information, including extension metadata, in DataFusion. Normalize Python UDF signatures to accept pyarrow.Field objects, ensuring metadata survives the Rust bindings roundtrip. Add a regression test for UUID-backed UDFs to verify that the second UDF correctly receives a pyarrow.ExtensionArray, preventing past metadata loss.

Wrap scalar UDF inputs/outputs to maintain extension types during execution. Enhance UUID extension regression test to ensure metadata retention and normalize results for accurate comparison.

…concrete Callable[..., Any] and pa.DataType | pa.Field annotations, removing the lingering references to the deleted _R type variable.

… checkers surface support for pyarrow.Field inputs when defining UDFs

Introduce a shared alias for PyArrowArray and update the extension wrapping helpers to ensure scalar UDF return types are preserved when handling PyArrow arrays. Enhance ScalarUDF signatures, overloads, and documentation to align with the PyArrow array contract for Python scalar UDFs.

Implement a feature flag to check for UUID helper in pyarrow. Add conditional skip to the UUID extension UDF chaining test when the helper is unavailable, retaining original assertions for supported environments.

Ensure collected UUID results are extension arrays or chunked arrays with the UUID extension type before comparison to expected values, preserving end-to-end metadata validation.

Return a wrapped empty extension array for chunked storage arrays with no chunks, preserving extension metadata. Expand UUID UDF regression to support chunked inputs, test empty chunked returns, and ensure UUID extension type remains intact through UDF chaining.

…d improve type-checking

timsaucer

I'll take another look at it later. I found the text in user_defined.py to be hard to understand and a lot of it looks like LLM generated text more oriented at a developer rather than user oriented. I noted a couple of things.

I'll take another look when I can dedicate some more time to understand what is going on here.

timsaucer · 2025-10-22T14:48:09Z

src/udf.rs

+    }
+
+    fn return_type(&self, _arg_types: &[DataType]) -> datafusion::error::Result<DataType> {
+        Ok(self.return_field.data_type().clone())


Best practice is to not implement this method. Per the upstream datafusion documentation:

If you provide an implementation for Self::return_field_from_args, DataFusion will not call return_type (this function). In such cases is recommended to return DataFusionError::Internal.

Omitting implementation prevents compilation.

error[E0046]: not all trait items implemented, missing: `return_type` --> src/udf.rs:124:1 | 124 | impl ScalarUDFImpl for PySimpleScalarUDF { | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ missing `return_type` in implementation

I amended it to return

Err(DataFusionError::Internal( "return_type should be unreachable when return_field_from_args is implemented" .to_string(),

timsaucer · 2025-10-22T14:52:00Z

python/datafusion/user_defined.py

+                This list must be of the same length as the number of arguments. Pass
+                :class:`pyarrow.Field` instances to preserve extension metadata.
+            return_type (pa.DataType | pa.Field): The return type of the function. Use a
+                :class:`pyarrow.Field` to preserve metadata on extension arrays.


I think this comment is misleading. I do not think there is any guarantee that the output field metadata will be preserved. Instead this should be the way in which you can set output metadata. I think it is entirely possible that a UDF implemented like this can still lose the metadata. One case is where you want to consume it on the input side and output some different kind of metadata on your output side.

timsaucer · 2025-10-22T14:53:08Z

python/tests/test_udf.py

+@pytest.mark.skipif(
+    not UUID_EXTENSION_AVAILABLE,
+    reason="PyArrow uuid extension helper unavailable",
+)


Can we make the uuid extension a requirement in our developer dependencies?

Removed @pytest.mark.skipif

…ty check from tests

Ensure return_field_from_args is the only metadata source by having PySimpleScalarUDF::return_type raise an internal error. This aligns with DataFusion guidance. Enhance Python UDF helper documentation to clarify how callers can declare extension metadata on both arguments and results.

kosiew · 2025-10-23T09:08:20Z

pyproject.toml

 dev = [
    "maturin>=1.8.1",
    "numpy>1.25.0",
+    "pyarrow>=19.0.0",


This is the change added to ensure pa.uuid() is available for test_udf.py.

https://arrow.apache.org/docs/19.0/python/generated/pyarrow.uuid.html is the lowest version which contains pyarrow.uuid.

The rest are VSCode automatic formatting.

timsaucer · 2025-10-28T12:15:18Z

python/datafusion/user_defined.py

+                Usage:
+                        - As a function: ``udaf(accum, input_types, return_type, state_type,``
+                            ``volatility, name)``.
+                        - As a decorator: ``@udaf(input_types, return_type, state_type,``
+                            ``volatility, name)``.
+                            When using ``udaf`` as a decorator, do not pass ``accum`` explicitly.


It looks like the formatting got changed. Is this intentional?

kosiew added 11 commits October 22, 2025 18:34

Clone pyarrow.Field objects for FFI handoff

2325993

Wrap scalar UDF inputs/outputs to maintain extension types during execution. Enhance UUID extension regression test to ensure metadata retention and normalize results for accurate comparison.

Updated the _function and _decorator helpers in ScalarUDF.udf to use …

e0bce84

…concrete Callable[..., Any] and pa.DataType | pa.Field annotations, removing the lingering references to the deleted _R type variable.

Updated the internal ScalarUDF.udf helper signatures so IDEs and type…

5fb08d6

… checkers surface support for pyarrow.Field inputs when defining UDFs

Refactor imports in udf.rs for improved organization and clarity

6a89977

Add feature flag for pyarrow UUID helper detection

bae5d54

Implement a feature flag to check for UUID helper in pyarrow. Add conditional skip to the UUID extension UDF chaining test when the helper is unavailable, retaining original assertions for supported environments.

Add assertions for UUID extension type in tests

153b5f1

Ensure collected UUID results are extension arrays or chunked arrays with the UUID extension type before comparison to expected values, preserving end-to-end metadata validation.

Refactor type alias for PyArrowArray to use Union for better clarity

308e774

Fix Ruff errors

16224e2

kosiew force-pushed the udf-1172 branch from 3472168 to 16224e2 Compare October 22, 2025 10:53

Enhance documentation for PyArrowArray type alias to clarify usage an…

7b9ced0

…d improve type-checking

kosiew marked this pull request as ready for review October 22, 2025 13:28

timsaucer reviewed Oct 22, 2025

View reviewed changes

kosiew added 2 commits October 23, 2025 16:37

Add dev dependency pyarrow >= 19 and remove UUID extension availabili…

0f28465

…ty check from tests

kosiew commented Oct 23, 2025

View reviewed changes

Fix ruff errors

1ea75fd

timsaucer reviewed Oct 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserve PyArrow extension metadata when chaining Python scalar UDFs #1287

Preserve PyArrow extension metadata when chaining Python scalar UDFs #1287

kosiew commented Oct 22, 2025

Uh oh!

timsaucer left a comment

Uh oh!

timsaucer Oct 22, 2025

Uh oh!

kosiew Oct 23, 2025

Uh oh!

timsaucer Oct 22, 2025

Uh oh!

kosiew Oct 23, 2025

Uh oh!

timsaucer Oct 22, 2025

Uh oh!

kosiew Oct 23, 2025

Uh oh!

kosiew Oct 23, 2025 •

edited

Loading

Uh oh!

timsaucer Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Preserve PyArrow extension metadata when chaining Python scalar UDFs #1287

Are you sure you want to change the base?

Preserve PyArrow extension metadata when chaining Python scalar UDFs #1287

Conversation

kosiew commented Oct 22, 2025

Which issue does this PR close?

Closes #1172

Rationale for this change

What changes are included in this PR?

🔧 Python (python/datafusion/user_defined.py)

🧰 Rust (src/udf.rs)

🤖 Tests (python/tests/test_udf.py)

Are these changes tested?

Are there any user-facing changes?

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

timsaucer Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

timsaucer Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

timsaucer Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timsaucer Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔧 Python (`python/datafusion/user_defined.py`)

🧰 Rust (`src/udf.rs`)

🤖 Tests (`python/tests/test_udf.py`)

kosiew Oct 23, 2025 •

edited

Loading