Skip to content

[SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in memory for Pandas 3#55793

Closed
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:pandas3-pandas-udf-input-type-coercion
Closed

[SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in memory for Pandas 3#55793
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:pandas3-pandas-udf-input-type-coercion

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

In PandasUDFInputTypeTests._compare_or_generate_golden, after loading the golden CSV, replace 'object' with 'str' in the Python Type column for rows whose Spark Type is string, but only when running under Pandas >= 3.0. The golden file on disk is unchanged.

Why are the changes needed?

The daily-scheduled Build / Python-only (master, Python 3.12, Pandas 3) workflow is failing test_pandas_input_type_coercion_vanilla:

line mismatch: expects ['string_values', 'string', "['abc', '', 'hello']", "['object', 'object', 'object']", ...] but got [..., "['str', 'str', 'str']", ...]
line mismatch: expects ['string_null', 'string', "[None, 'test']", "['object', 'object']", ...] but got [..., "['str', 'str']", ...]

Pandas 3.0 changed the default representation of string columns from numpy object dtype to the dedicated str dtype, so the pandas UDF that records str(series.dtype) now reports 'str' for string inputs and the recorded Python Type column no longer matches the golden. The values themselves are unchanged.

See https://github.com/apache/spark/actions/runs/25611987987/job/75184547983 for the failing run.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Ran the full test module under a Pandas 3 conda env (Python 3.13.12, Pandas 3.0.2, PyArrow 23.0.1, NumPy 2.4.3):

$ python/run-tests --testnames "pyspark.sql.tests.coercion.test_pandas_udf_input_type"
Tests passed in 17 seconds

The Pandas < 3.0 branch is a no-op, so existing behaviour is preserved.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

…emory for Pandas 3

In `_compare_or_generate_golden`, when running under Pandas >= 3.0,
replace 'object' with 'str' in the Python Type column for rows whose
Spark Type is `string`. Pandas 3 reports the dedicated `str` dtype for
string columns where earlier versions reported `object`, so the same
golden CSV works under both versions without regenerating.

Generated-by: Claude Code (claude-opus-4-7)
@zhengruifeng zhengruifeng marked this pull request as ready for review May 11, 2026 11:26
@zhengruifeng zhengruifeng changed the title [WIP][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in memory for Pandas 3 [SPARK-56816][PYTHON][TESTS] Patch pandas UDF input-type coercion golden in memory for Pandas 3 May 11, 2026
@zhengruifeng zhengruifeng requested a review from HyukjinKwon May 11, 2026 11:26
@HyukjinKwon
Copy link
Copy Markdown
Member

Merged to master and branch-4.x.

HyukjinKwon pushed a commit that referenced this pull request May 11, 2026
…den in memory for Pandas 3

### What changes were proposed in this pull request?

In `PandasUDFInputTypeTests._compare_or_generate_golden`, after loading the golden CSV, replace `'object'` with `'str'` in the `Python Type` column for rows whose `Spark Type` is `string`, but only when running under Pandas `>= 3.0`. The golden file on disk is unchanged.

### Why are the changes needed?

The daily-scheduled `Build / Python-only (master, Python 3.12, Pandas 3)` workflow is failing `test_pandas_input_type_coercion_vanilla`:

```
line mismatch: expects ['string_values', 'string', "['abc', '', 'hello']", "['object', 'object', 'object']", ...] but got [..., "['str', 'str', 'str']", ...]
line mismatch: expects ['string_null', 'string', "[None, 'test']", "['object', 'object']", ...] but got [..., "['str', 'str']", ...]
```

Pandas 3.0 changed the default representation of string columns from numpy `object` dtype to the dedicated `str` dtype, so the pandas UDF that records `str(series.dtype)` now reports `'str'` for string inputs and the recorded `Python Type` column no longer matches the golden. The values themselves are unchanged.

See https://github.com/apache/spark/actions/runs/25611987987/job/75184547983 for the failing run.

### Does this PR introduce _any_ user-facing change?

No. Test-only change.

### How was this patch tested?

Ran the full test module under a Pandas 3 conda env (Python 3.13.12, Pandas 3.0.2, PyArrow 23.0.1, NumPy 2.4.3):

```
$ python/run-tests --testnames "pyspark.sql.tests.coercion.test_pandas_udf_input_type"
Tests passed in 17 seconds
```

The Pandas `< 3.0` branch is a no-op, so existing behaviour is preserved.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

Closes #55793 from zhengruifeng/pandas3-pandas-udf-input-type-coercion.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 5d21070)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@zhengruifeng zhengruifeng deleted the pandas3-pandas-udf-input-type-coercion branch May 11, 2026 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants