[SPARK-56041][PS][TESTS] Normalize ndarray values in apply_batch typed result comparison for pandas 3 by ueshin · Pull Request #54873 · apache/spark

ueshin · 2026-03-17T21:20:30Z

What changes were proposed in this pull request?

This PR updates the pandas-on-Spark apply_batch typed test in python/pyspark/pandas/tests/computation/test_apply_func.py.

With pandas 3.x, the typed apply_batch cases that return array-like values can produce pandas results whose nested values are np.ndarray, while the expected pandas DataFrame still contains Python lists. The values are equivalent, but the comparison is stricter and the test fails.

To make the test compare the intended semantics instead of the container representation, this PR normalizes nested np.ndarray values to lists before asserting equality in test_apply_batch_with_type. The normalization is scoped to pandas 3.x and only applied in the affected test cases.

Why are the changes needed?

The failure is caused by stricter comparison behavior in pandas 3.x rather than a functional regression in pandas-on-Spark apply_batch.

In the failing cases, the expected and actual results contain the same data, but one side may use Python lists and the other may use np.ndarray for nested array values. Relaxing the production code path would broaden behavior beyond the test. Updating the test keeps runtime behavior unchanged and makes the assertion reflect the intended equivalence.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Ran the related pandas-on-Spark computation test module in both the default Python 3.12 environment and the pandas 3 environment, and both passed.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

…andas 3

ueshin · 2026-03-17T21:20:43Z

cc @gaogaotiantian @HyukjinKwon @zhengruifeng

dongjoon-hyun · 2026-03-17T22:13:19Z

python/pyspark/pandas/tests/computation/test_apply_func.py

    def test_apply_batch_with_type(self):
+        using_pandas3 = LooseVersion(pd.__version__) >= "3.0.0"
+
+        def normalize_array_values(pdf: pd.DataFrame) -> pd.DataFrame:


Is this method only usable here (for now and for the future), @ueshin ? It's just a question because this looks like a general and useful function.

dongjoon-hyun

+1, LGTM (Pending CIs).

Normalize ndarray values in apply_batch typed result comparison for p…

8bf70a0

…andas 3

HyukjinKwon approved these changes Mar 17, 2026

View reviewed changes

dongjoon-hyun reviewed Mar 17, 2026

View reviewed changes

dongjoon-hyun approved these changes Mar 17, 2026

View reviewed changes

dongjoon-hyun closed this in 5a07d8d Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56041][PS][TESTS] Normalize ndarray values in apply_batch typed result comparison for pandas 3#54873

[SPARK-56041][PS][TESTS] Normalize ndarray values in apply_batch typed result comparison for pandas 3#54873
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-56041/ndarray

ueshin commented Mar 17, 2026

Uh oh!

ueshin commented Mar 17, 2026

Uh oh!

dongjoon-hyun Mar 17, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ueshin commented Mar 17, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ueshin commented Mar 17, 2026

Uh oh!

dongjoon-hyun Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun Mar 17, 2026 •

edited

Loading