Skip to content

[SPARK-56041][PS][TESTS] Normalize ndarray values in apply_batch typed result comparison for pandas 3#54873

Closed
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-56041/ndarray
Closed

[SPARK-56041][PS][TESTS] Normalize ndarray values in apply_batch typed result comparison for pandas 3#54873
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-56041/ndarray

Conversation

@ueshin
Copy link
Copy Markdown
Member

@ueshin ueshin commented Mar 17, 2026

What changes were proposed in this pull request?

This PR updates the pandas-on-Spark apply_batch typed test in python/pyspark/pandas/tests/computation/test_apply_func.py.

With pandas 3.x, the typed apply_batch cases that return array-like values can produce pandas results whose nested values are np.ndarray, while the expected pandas DataFrame still contains Python lists. The values are equivalent, but the comparison is stricter and the test fails.

To make the test compare the intended semantics instead of the container representation, this PR normalizes nested np.ndarray values to lists before asserting equality in test_apply_batch_with_type. The normalization is scoped to pandas 3.x and only applied in the affected test cases.

Why are the changes needed?

The failure is caused by stricter comparison behavior in pandas 3.x rather than a functional regression in pandas-on-Spark apply_batch.

In the failing cases, the expected and actual results contain the same data, but one side may use Python lists and the other may use np.ndarray for nested array values. Relaxing the production code path would broaden behavior beyond the test. Updating the test keeps runtime behavior unchanged and makes the assertion reflect the intended equivalence.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Ran the related pandas-on-Spark computation test module in both the default Python 3.12 environment and the pandas 3 environment, and both passed.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

@ueshin
Copy link
Copy Markdown
Member Author

ueshin commented Mar 17, 2026

def test_apply_batch_with_type(self):
using_pandas3 = LooseVersion(pd.__version__) >= "3.0.0"

def normalize_array_values(pdf: pd.DataFrame) -> pd.DataFrame:
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method only usable here (for now and for the future), @ueshin ? It's just a question because this looks like a general and useful function.

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants