-
Notifications
You must be signed in to change notification settings - Fork 133
Preserve PyArrow extension metadata when chaining Python scalar UDFs #1287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Enhance scalar UDF definitions to retain Arrow Field information, including extension metadata, in DataFusion. Normalize Python UDF signatures to accept pyarrow.Field objects, ensuring metadata survives the Rust bindings roundtrip. Add a regression test for UUID-backed UDFs to verify that the second UDF correctly receives a pyarrow.ExtensionArray, preventing past metadata loss.
Wrap scalar UDF inputs/outputs to maintain extension types during execution. Enhance UUID extension regression test to ensure metadata retention and normalize results for accurate comparison.
…concrete Callable[..., Any] and pa.DataType | pa.Field annotations, removing the lingering references to the deleted _R type variable.
… checkers surface support for pyarrow.Field inputs when defining UDFs
Introduce a shared alias for PyArrowArray and update the extension wrapping helpers to ensure scalar UDF return types are preserved when handling PyArrow arrays. Enhance ScalarUDF signatures, overloads, and documentation to align with the PyArrow array contract for Python scalar UDFs.
Implement a feature flag to check for UUID helper in pyarrow. Add conditional skip to the UUID extension UDF chaining test when the helper is unavailable, retaining original assertions for supported environments.
Ensure collected UUID results are extension arrays or chunked arrays with the UUID extension type before comparison to expected values, preserving end-to-end metadata validation.
Return a wrapped empty extension array for chunked storage arrays with no chunks, preserving extension metadata. Expand UUID UDF regression to support chunked inputs, test empty chunked returns, and ensure UUID extension type remains intact through UDF chaining.
…d improve type-checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take another look at it later. I found the text in user_defined.py to be hard to understand and a lot of it looks like LLM generated text more oriented at a developer rather than user oriented. I noted a couple of things.
I'll take another look when I can dedicate some more time to understand what is going on here.
src/udf.rs
Outdated
| } | ||
|
|
||
| fn return_type(&self, _arg_types: &[DataType]) -> datafusion::error::Result<DataType> { | ||
| Ok(self.return_field.data_type().clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best practice is to not implement this method. Per the upstream datafusion documentation:
If you provide an implementation for Self::return_field_from_args, DataFusion will not call return_type (this function). In such cases is recommended to return DataFusionError::Internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omitting implementation prevents compilation.
error[E0046]: not all trait items implemented, missing: `return_type`
--> src/udf.rs:124:1
|
124 | impl ScalarUDFImpl for PySimpleScalarUDF {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ missing `return_type` in implementation
I amended it to return
Err(DataFusionError::Internal(
"return_type should be unreachable when return_field_from_args is implemented"
.to_string(),
python/datafusion/user_defined.py
Outdated
| This list must be of the same length as the number of arguments. Pass | ||
| :class:`pyarrow.Field` instances to preserve extension metadata. | ||
| return_type (pa.DataType | pa.Field): The return type of the function. Use a | ||
| :class:`pyarrow.Field` to preserve metadata on extension arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is misleading. I do not think there is any guarantee that the output field metadata will be preserved. Instead this should be the way in which you can set output metadata. I think it is entirely possible that a UDF implemented like this can still lose the metadata. One case is where you want to consume it on the input side and output some different kind of metadata on your output side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended.
python/tests/test_udf.py
Outdated
| @pytest.mark.skipif( | ||
| not UUID_EXTENSION_AVAILABLE, | ||
| reason="PyArrow uuid extension helper unavailable", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the uuid extension a requirement in our developer dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed @pytest.mark.skipif
…ty check from tests
Ensure return_field_from_args is the only metadata source by having PySimpleScalarUDF::return_type raise an internal error. This aligns with DataFusion guidance. Enhance Python UDF helper documentation to clarify how callers can declare extension metadata on both arguments and results.
| dev = [ | ||
| "maturin>=1.8.1", | ||
| "numpy>1.25.0", | ||
| "pyarrow>=19.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the change added to ensure pa.uuid() is available for test_udf.py.
https://arrow.apache.org/docs/19.0/python/generated/pyarrow.uuid.html is the lowest version which contains pyarrow.uuid.
The rest are VSCode automatic formatting.
| Usage: | ||
| - As a function: ``udaf(accum, input_types, return_type, state_type,`` | ||
| ``volatility, name)``. | ||
| - As a decorator: ``@udaf(input_types, return_type, state_type,`` | ||
| ``volatility, name)``. | ||
| When using ``udaf`` as a decorator, do not pass ``accum`` explicitly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the formatting got changed. Is this intentional?
Which issue does this PR close?
Closes #1172
Rationale for this change
When multiple scalar UDFs are chained in Python, the intermediate results lose PyArrow extension metadata.
This happens because the existing binding only passed
arrow::datatypes::DataTypeto Rust’sScalarUDF, discarding extension information embedded inpyarrow.Field.This patch ensures that DataFusion’s Python UDF layer preserves the complete field metadata, allowing extension arrays (e.g.
arrow.uuid, custom logical types) to survive round-trips between Python and Rust.What changes are included in this PR?
🔧 Python (
python/datafusion/user_defined.py)Introduced
PyArrowArrayandPyArrowArrayTaliases for unified typing ofArrayandChunkedArray.Added normalization utilities:
_normalize_field,_normalize_input_fields,_normalize_return_field_wrap_extension_valueand_wrap_udf_functionto automatically re-wrap extension arrays on UDF input/output.Updated
ScalarUDFconstructor and decorator overloads to accept bothpa.Fieldandpa.DataTypeobjects.Ensured
ScalarUDFpasses fully qualifiedFieldobjects (with metadata) to the internal layer.🧰 Rust (
src/udf.rs)Added a new
PySimpleScalarUDFimplementingScalarUDFImpl:arrow::datatypes::Fieldfor inputs and return values.return_field_from_argsto keep field names and extension metadata.Updated the PyO3 binding to accept and expose
Vec<Field>instead ofVec<DataType>.Refactored construction to use
ScalarUDF::new_from_impl().🤖 Tests (
python/tests/test_udf.py)Added
test_uuid_extension_chainverifying that:arrow.uuidarrays.Are these changes tested?
✅ Yes.
The new test suite
test_uuid_extension_chainexplicitly covers:Existing decorator and parameterized UDF tests remain intact and continue to pass.
Are there any user-facing changes?
Yes — enhanced behavior for PyArrow extension arrays in Python UDFs.
input_typesandreturn_typeas eitherpa.DataTypeorpa.Field.arrow.uuid, custom registered extensions).No breaking API changes are introduced — the update is fully backward-compatible while extending functionality.