[SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3#55974
Closed
zhengruifeng wants to merge 4 commits into
Closed
[SPARK-56936][PYTHON][TESTS] Fix PandasUDFReturnTypeTests for pandas 3#55974zhengruifeng wants to merge 4 commits into
zhengruifeng wants to merge 4 commits into
Conversation
…24.04 tzdata - Switch the tz-aware fixture from the legacy alias `US/Eastern` to its canonical IANA name `America/New_York`. On Ubuntu 24.04 the system `tzdata` package no longer ships the legacy `US/*` aliases (those moved to `tzdata-legacy`), so under pandas >= 3.0 (which resolves tz via stdlib zoneinfo instead of bundled pytz), the previous fixture raised `ZoneInfoNotFoundError` in CI. - Remap the loaded golden DataFrame in memory when running under pandas >= 3.0 so the pandas-2-generated golden columns still line up: `datetime64[ns]` -> `[us]` and `Categorical` categories `object` -> `str`. Only the column keys are remapped; the on-disk golden file is unchanged. Generated-by: Claude Code
HyukjinKwon
approved these changes
May 19, 2026
Extend the pandas-3 in-memory adapter so the value comparisons also line up: - Scale 13+ digit integers in cells of datetime64 / Timedelta-list columns by 1/1000. Pandas 3 returns microseconds where pandas 2 returned nanoseconds for the same cast, e.g. bigint <- pd.date_range(...).values flips from 86_400_000_000_000 to 86_400_000_000. - Override the single decimal(10,0) x ['12','34']@list cell, which flipped from "X" (pandas 2 errored) to [Decimal('12'), Decimal('34')] (pandas 3 succeeds). Test now passes under both pandas 2.3.3 (spark-dev-313) and pandas 3.0.2 (spark-dev-313-p3) locally. Generated-by: Claude Code
No behavior change. Folds _patch_golden_for_pandas3 directly into the loader block where it is used, since it is only called once. Also replaces the local re.sub helper with Series.str.replace(regex=True) to drop the `import re`. Generated-by: Claude Code
No behavior change. Use self.repr_value(value) and self.repr_type(...) to derive both rename and scale targets directly from self.test_data and the affected Spark type, instead of grep-matching the golden column names. Single loop over test_data builds both rename and scale_cols. Generated-by: Claude Code
zhengruifeng
added a commit
that referenced
this pull request
May 20, 2026
### What changes were proposed in this pull request?
Make `pyspark.sql.tests.coercion.test_pandas_udf_return_type.PandasUDFReturnTypeTests` work under pandas >= 3.0 and on systems whose `tzdata` package no longer ships the legacy `US/*` aliases (e.g. Ubuntu 24.04 / noble).
1. **Switch the tz-aware fixture from `US/Eastern` to `America/New_York`.** The values returned by `pd.date_range(...).values` are identical for the two aliases (same zone, same DST rules), so the on-disk golden file does not need to be regenerated.
2. **Patch the loaded golden DataFrame in memory for pandas >= 3.0.** The golden file was generated under pandas 2 and the on-disk content is unchanged. At load time, when running under pandas >= 3.0, the test:
- Renames column keys whose representation differs between the two versions: datetime64 ndarrays default to `[us]` instead of `[ns]`, and `pd.Categorical` keeps `str`-dtyped categories instead of `object`.
- Scales 13+ digit integers in cells of datetime64 / Timedelta-list columns by 1/1000. Pandas 3 returns microseconds where pandas 2 returned nanoseconds for the same cast (e.g. `bigint <- pd.date_range(...).values` flips from `86_400_000_000_000` to `86_400_000_000`).
- Overrides the single `decimal(10,0) x ['12','34']list` cell, which flipped from `X` (pandas 2 errored) to `[Decimal('12'), Decimal('34')]` (pandas 3 succeeds at the string -> Decimal coercion).
### Why are the changes needed?
The scheduled CI run on the `python-312-pandas-3` image fails in this suite, e.g. https://github.com/apache/spark/actions/runs/26002965955/job/76430490989. Root causes:
- `pd.date_range("19700101", periods=2, tz="US/Eastern").values` raises `zoneinfo._common.ZoneInfoNotFoundError: 'No time zone found with key US/Eastern'`. Pandas 3 dropped `pytz` as a hard dependency and now resolves tz names through stdlib `zoneinfo`, which on Ubuntu 24.04 cannot find `US/Eastern` because Ubuntu moved the legacy aliases out of `tzdata` into a separate `tzdata-legacy` package that the CI image does not install.
- After the alias fix, `golden.loc[str_t, str_v]` raises `KeyError` because the column keys in the golden file are pandas-2-shaped (`datetime64[ns]`, `Categorical(..., object)`) but the lookup keys built at runtime are pandas-3-shaped (`datetime64[us]`, `Categorical(..., str)`).
- After the key rename, assertions still fail because the cast result values themselves changed: nanoseconds -> microseconds for datetime / Timedelta inputs, and one cell where pandas 3 now succeeds where pandas 2 errored.
### Does this PR introduce _any_ user-facing change?
No. Test-only change.
### How was this patch tested?
Ran the suite locally under two envs:
```
# pandas 2.3.3 / Python 3.13
$ python/run-tests --testnames "pyspark.sql.tests.coercion.test_pandas_udf_return_type PandasUDFReturnTypeTests"
...
Tests passed in 31 seconds
# pandas 3.0.2 / Python 3.13
$ python/run-tests --testnames "pyspark.sql.tests.coercion.test_pandas_udf_return_type PandasUDFReturnTypeTests"
...
Tests passed in 30 seconds
```
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code
Closes #55974 from zhengruifeng/SPARK-fix-tz-uneastern.
Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit c57777e)
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Contributor
Author
|
merged to master/4.x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Make
pyspark.sql.tests.coercion.test_pandas_udf_return_type.PandasUDFReturnTypeTestswork under pandas >= 3.0 and on systems whosetzdatapackage no longer ships the legacyUS/*aliases (e.g. Ubuntu 24.04 / noble).Switch the tz-aware fixture from
US/EasterntoAmerica/New_York. The values returned bypd.date_range(...).valuesare identical for the two aliases (same zone, same DST rules), so the on-disk golden file does not need to be regenerated.Patch the loaded golden DataFrame in memory for pandas >= 3.0. The golden file was generated under pandas 2 and the on-disk content is unchanged. At load time, when running under pandas >= 3.0, the test:
[us]instead of[ns], andpd.Categoricalkeepsstr-dtyped categories instead ofobject.bigint <- pd.date_range(...).valuesflips from86_400_000_000_000to86_400_000_000).decimal(10,0) x ['12','34']@listcell, which flipped fromX(pandas 2 errored) to[Decimal('12'), Decimal('34')](pandas 3 succeeds at the string -> Decimal coercion).Why are the changes needed?
The scheduled CI run on the
python-312-pandas-3image fails in this suite, e.g. https://github.com/apache/spark/actions/runs/26002965955/job/76430490989. Root causes:pd.date_range("19700101", periods=2, tz="US/Eastern").valuesraiseszoneinfo._common.ZoneInfoNotFoundError: 'No time zone found with key US/Eastern'. Pandas 3 droppedpytzas a hard dependency and now resolves tz names through stdlibzoneinfo, which on Ubuntu 24.04 cannot findUS/Easternbecause Ubuntu moved the legacy aliases out oftzdatainto a separatetzdata-legacypackage that the CI image does not install.golden.loc[str_t, str_v]raisesKeyErrorbecause the column keys in the golden file are pandas-2-shaped (datetime64[ns],Categorical(..., object)) but the lookup keys built at runtime are pandas-3-shaped (datetime64[us],Categorical(..., str)).Does this PR introduce any user-facing change?
No. Test-only change.
How was this patch tested?
Ran the suite locally under two envs:
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code