[SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in memory for Pandas 3 by zhengruifeng · Pull Request #55793 · apache/spark

zhengruifeng · 2026-05-11T07:50:08Z

What changes were proposed in this pull request?

In PandasUDFInputTypeTests._compare_or_generate_golden, after loading the golden CSV, replace 'object' with 'str' in the Python Type column for rows whose Spark Type is string, but only when running under Pandas >= 3.0. The golden file on disk is unchanged.

Why are the changes needed?

The daily-scheduled Build / Python-only (master, Python 3.12, Pandas 3) workflow is failing test_pandas_input_type_coercion_vanilla:

line mismatch: expects ['string_values', 'string', "['abc', '', 'hello']", "['object', 'object', 'object']", ...] but got [..., "['str', 'str', 'str']", ...]
line mismatch: expects ['string_null', 'string', "[None, 'test']", "['object', 'object']", ...] but got [..., "['str', 'str']", ...]

Pandas 3.0 changed the default representation of string columns from numpy object dtype to the dedicated str dtype, so the pandas UDF that records str(series.dtype) now reports 'str' for string inputs and the recorded Python Type column no longer matches the golden. The values themselves are unchanged.

See https://github.com/apache/spark/actions/runs/25611987987/job/75184547983 for the failing run.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Ran the full test module under a Pandas 3 conda env (Python 3.13.12, Pandas 3.0.2, PyArrow 23.0.1, NumPy 2.4.3):

$ python/run-tests --testnames "pyspark.sql.tests.coercion.test_pandas_udf_input_type"
Tests passed in 17 seconds

The Pandas < 3.0 branch is a no-op, so existing behaviour is preserved.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

…emory for Pandas 3 In `_compare_or_generate_golden`, when running under Pandas >= 3.0, replace 'object' with 'str' in the Python Type column for rows whose Spark Type is `string`. Pandas 3 reports the dedicated `str` dtype for string columns where earlier versions reported `object`, so the same golden CSV works under both versions without regenerating. Generated-by: Claude Code (claude-opus-4-7)

HyukjinKwon · 2026-05-11T22:08:51Z

Merged to master and branch-4.x.

…den in memory for Pandas 3 ### What changes were proposed in this pull request? In `PandasUDFInputTypeTests._compare_or_generate_golden`, after loading the golden CSV, replace `'object'` with `'str'` in the `Python Type` column for rows whose `Spark Type` is `string`, but only when running under Pandas `>= 3.0`. The golden file on disk is unchanged. ### Why are the changes needed? The daily-scheduled `Build / Python-only (master, Python 3.12, Pandas 3)` workflow is failing `test_pandas_input_type_coercion_vanilla`: ``` line mismatch: expects ['string_values', 'string', "['abc', '', 'hello']", "['object', 'object', 'object']", ...] but got [..., "['str', 'str', 'str']", ...] line mismatch: expects ['string_null', 'string', "[None, 'test']", "['object', 'object']", ...] but got [..., "['str', 'str']", ...] ``` Pandas 3.0 changed the default representation of string columns from numpy `object` dtype to the dedicated `str` dtype, so the pandas UDF that records `str(series.dtype)` now reports `'str'` for string inputs and the recorded `Python Type` column no longer matches the golden. The values themselves are unchanged. See https://github.com/apache/spark/actions/runs/25611987987/job/75184547983 for the failing run. ### Does this PR introduce _any_ user-facing change? No. Test-only change. ### How was this patch tested? Ran the full test module under a Pandas 3 conda env (Python 3.13.12, Pandas 3.0.2, PyArrow 23.0.1, NumPy 2.4.3): ``` $ python/run-tests --testnames "pyspark.sql.tests.coercion.test_pandas_udf_input_type" Tests passed in 17 seconds ``` The Pandas `< 3.0` branch is a no-op, so existing behaviour is preserved. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-opus-4-7) Closes #55793 from zhengruifeng/pandas3-pandas-udf-input-type-coercion. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5d21070) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

zhengruifeng marked this pull request as ready for review May 11, 2026 11:26

zhengruifeng changed the title ~~[WIP][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in memory for Pandas 3~~ [SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in memory for Pandas 3 May 11, 2026

zhengruifeng requested a review from HyukjinKwon May 11, 2026 11:26

Yicong-Huang approved these changes May 11, 2026

View reviewed changes

HyukjinKwon approved these changes May 11, 2026

View reviewed changes

HyukjinKwon closed this in 5d21070 May 11, 2026

zhengruifeng deleted the pandas3-pandas-udf-input-type-coercion branch May 11, 2026 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in memory for Pandas 3#55793

[SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in memory for Pandas 3#55793
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:pandas3-pandas-udf-input-type-coercion

zhengruifeng commented May 11, 2026

Uh oh!

HyukjinKwon commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhengruifeng commented May 11, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants