[SPARK-56166][PYTHON] Use ArrowBatchTransformer.enforce_schema to replace column-wise type coercion logic by Yicong-Huang · Pull Request #54967 · apache/spark

Yicong-Huang · 2026-03-23T22:32:50Z

What changes were proposed in this pull request?

Replace manual column-by-column type coercion with ArrowBatchTransformer.enforce_schema in three places:

ArrowStreamArrowUDTFSerializer.apply_type_coercion in serializers.py
ArrowStreamArrowUDFSerializer.create_batch in serializers.py
scalar Arrow iter UDF process_results in worker.py

Also:

Add arrow_cast parameter to enforce_schema for strict type matching mode
Add KeyError handling in enforce_schema for missing columns with user-friendly error

Why are the changes needed?

These three places duplicated the same coerce-and-reassemble logic that enforce_schema already provides.

Does this PR introduce any user-facing change?

Error messages for type/schema mismatches in Arrow UDTFs are slightly changed to be consistent with other Arrow UDF error messages.

How was this patch tested?

Existing tests in test_arrow_udtf.py and test_arrow_udf_scalar.py.

Was this patch authored or co-authored using generative AI tooling?

No.

…lace manual type coercion logic ### What changes were proposed in this pull request? Replace manual column-by-column type coercion with `ArrowBatchTransformer.enforce_schema` in three places: 1. `ArrowStreamArrowUDTFSerializer.apply_type_coercion` in serializers.py 2. `ArrowStreamArrowUDFSerializer.create_batch` in serializers.py 3. `process_results` in worker.py (scalar Arrow iter UDF path) Also: - Add `arrow_cast` parameter to `enforce_schema` for strict type matching mode - Add `KeyError` handling in `enforce_schema` for missing columns with user-friendly error - Remove now-unused `coerce_arrow_array` imports from serializers.py and worker.py ### Why are the changes needed? These three places duplicated the same coerce-and-reassemble logic that `enforce_schema` already provides. Consolidating reduces code duplication and ensures consistent error handling. ### Does this PR introduce _any_ user-facing change? Error messages for type/schema mismatches in Arrow UDTFs are slightly changed to be consistent with other Arrow UDF error messages. ### How was this patch tested? Existing tests in `test_arrow_udtf.py` and `test_arrow_udf_scalar.py`. ### Was this patch authored or co-authored using generative AI tooling? Yes.

zhengruifeng · 2026-03-24T01:27:24Z

python/pyspark/sql/conversion.py

        # If so, use index-based access (faster than name lookup).
        batch_names = [batch.schema.field(i).name for i in range(batch.num_columns)]
        target_names = [field.name for field in arrow_schema]
        use_index = batch_names == target_names


on use_index, do we need to take care of nested cases?

Good question! I believe the answer is no, as use_index only controls top-level column lookup. When column names are in the same order it uses positional index for speed, otherwise it falls back to name-based lookup.

Nested type coercion is handled by pa.Array.cast(), which Arrow applies recursively, so nested struct/list fields are covered regardless of the use_index path. I added a unit test (test_enforce_schema_nested_cast) to verify this.

zhengruifeng

Overall this is a clean refactoring that consolidates three duplicated coerce-and-reassemble code paths into enforce_schema. The coerce_arrow_array removal is complete (no dangling references), the keyword-only signature change is safe for all callers, and the new unit tests cover the added parameters well. One minor nit inline.

python/pyspark/sql/conversion.py

zhengruifeng · 2026-03-26T05:57:43Z

merged to master

Yicong-Huang force-pushed the SPARK-56166/enforce-schema branch from c537147 to e6a55c9 Compare March 23, 2026 23:43

Yicong-Huang changed the title ~~[SPARK-56166][PYTHON] Use ArrowBatchTransformer.enforce_schema to replace manual type coercion logic~~ [SPARK-56166][PYTHON] Use ArrowBatchTransformer.enforce_schema to replace column-wise type coercion logic Mar 24, 2026

zhengruifeng reviewed Mar 24, 2026

View reviewed changes

Yicong-Huang added 2 commits March 24, 2026 17:59

fix: apply ruff format to serializers.py and worker.py

08ee410

test: add unit tests for ArrowBatchTransformer.enforce_schema

5f12756

Yicong-Huang force-pushed the SPARK-56166/enforce-schema branch from 166fbf0 to 5f12756 Compare March 24, 2026 18:15

fix: apply ruff format to test_conversion.py

35e2125

Yicong-Huang requested a review from zhengruifeng March 25, 2026 06:52

Yicong-Huang mentioned this pull request Mar 25, 2026

[SPARK-56123][PYTHON] Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF #54992

Closed

zhengruifeng reviewed Mar 26, 2026

View reviewed changes

python/pyspark/sql/conversion.py Show resolved Hide resolved

zhengruifeng approved these changes Mar 26, 2026

View reviewed changes

zhengruifeng closed this in 5c83086 Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56166][PYTHON] Use ArrowBatchTransformer.enforce_schema to replace column-wise type coercion logic#54967

[SPARK-56166][PYTHON] Use ArrowBatchTransformer.enforce_schema to replace column-wise type coercion logic#54967
Yicong-Huang wants to merge 4 commits intoapache:masterfrom
Yicong-Huang:SPARK-56166/enforce-schema

Yicong-Huang commented Mar 23, 2026 •

edited

Loading

Uh oh!

zhengruifeng Mar 24, 2026

Uh oh!

Yicong-Huang Mar 24, 2026

Uh oh!

zhengruifeng left a comment

Uh oh!

Uh oh!

zhengruifeng commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yicong-Huang commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhengruifeng commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Mar 23, 2026 •

edited

Loading