Skip to content

[SPARK-56081][PS] Align idxmax and idxmin NA handling with pandas 3#54908

Closed
ueshin wants to merge 4 commits intoapache:masterfrom
ueshin:issues/SPARK-56081/idxmax_idxmin
Closed

[SPARK-56081][PS] Align idxmax and idxmin NA handling with pandas 3#54908
ueshin wants to merge 4 commits intoapache:masterfrom
ueshin:issues/SPARK-56081/idxmax_idxmin

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Mar 19, 2026

What changes were proposed in this pull request?

This PR updates pandas-on-Spark idxmax and idxmin behavior to follow the pandas 3 semantics when NA values are involved.

In python/pyspark/pandas/series.py, Series.idxmax(skipna=False) and Series.idxmin(skipna=False) now raise ValueError on pandas 3 instead of returning np.nan with a FutureWarning, while preserving the existing behavior on pandas versions below 3.0.

In python/pyspark/pandas/frame.py, DataFrame.idxmax(axis=1) and DataFrame.idxmin(axis=1) now raise when pandas 3 would reject rows with all-NA values, instead of silently materializing None.

The related pandas-on-Spark tests were updated to assert the version-specific behavior in:

  • python/pyspark/pandas/tests/computation/test_idxmax_idxmin.py
  • python/pyspark/pandas/tests/series/test_index.py

Why are the changes needed?

pandas 3 changed the NA handling for idxmax and idxmin. The existing pandas-on-Spark implementation still followed the older behavior in these paths, which caused mismatches against pandas 3 expectations.

Aligning these code paths keeps pandas-on-Spark behavior consistent with the pandas version it is running against and makes the failure mode explicit instead of returning a deprecated result.

Does this PR introduce any user-facing change?

Yes.

When pandas 3 is used:

  • Series.idxmax(skipna=False) and Series.idxmin(skipna=False) now raise ValueError when an NA is encountered instead of returning np.nan.
  • DataFrame.idxmax(axis=1) and DataFrame.idxmin(axis=1) now raise on rows with all NA values instead of returning a null result.

For pandas versions below 3.0, the existing behavior is preserved.

How was this patch tested?

Updated the related tests for the pandas-version-specific idxmax and idxmin behavior.

Was this patch authored or co-authored using generative AI tooling?

No.

@ueshin
Copy link
Member Author

ueshin commented Mar 19, 2026

@ueshin ueshin closed this Mar 19, 2026
@ueshin ueshin reopened this Mar 19, 2026
@HyukjinKwon
Copy link
Member

Merged to master.

terana pushed a commit to terana/spark that referenced this pull request Mar 23, 2026
### What changes were proposed in this pull request?

This PR updates pandas-on-Spark `idxmax` and `idxmin` behavior to follow the pandas 3 semantics when NA values are involved.

In `python/pyspark/pandas/series.py`, `Series.idxmax(skipna=False)` and `Series.idxmin(skipna=False)` now raise `ValueError` on pandas 3 instead of returning `np.nan` with a `FutureWarning`, while preserving the existing behavior on pandas versions below 3.0.

In `python/pyspark/pandas/frame.py`, `DataFrame.idxmax(axis=1)` and `DataFrame.idxmin(axis=1)` now raise when pandas 3 would reject rows with all-NA values, instead of silently materializing `None`.

The related pandas-on-Spark tests were updated to assert the version-specific behavior in:
- `python/pyspark/pandas/tests/computation/test_idxmax_idxmin.py`
- `python/pyspark/pandas/tests/series/test_index.py`

### Why are the changes needed?

pandas 3 changed the NA handling for `idxmax` and `idxmin`. The existing pandas-on-Spark implementation still followed the older behavior in these paths, which caused mismatches against pandas 3 expectations.

Aligning these code paths keeps pandas-on-Spark behavior consistent with the pandas version it is running against and makes the failure mode explicit instead of returning a deprecated result.

### Does this PR introduce _any_ user-facing change?

Yes.

When pandas 3 is used:
- `Series.idxmax(skipna=False)` and `Series.idxmin(skipna=False)` now raise `ValueError` when an NA is encountered instead of returning `np.nan`.
- `DataFrame.idxmax(axis=1)` and `DataFrame.idxmin(axis=1)` now raise on rows with all NA values instead of returning a null result.

For pandas versions below 3.0, the existing behavior is preserved.

### How was this patch tested?

Updated the related tests for the pandas-version-specific `idxmax` and `idxmin` behavior.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#54908 from ueshin/issues/SPARK-56081/idxmax_idxmin.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants