[SPARK-54665][PS] Fix comparison between boolean Series and string literals by singhpraveen2010 · Pull Request #54628 · apache/spark

singhpraveen2010 · 2026-03-04T21:13:14Z

What changes were proposed in this pull request?

This resolves the open issue:
https://issues.apache.org/jira/browse/SPARK-54665 (https://issues.apache.org/jira/browse/SPARK-54665)

This PR resolves an inconsistency in pandas-on-Spark (pyspark.pandas) where comparing a boolean Series to a string literal returns True for equality (==) and False for inequality (!=).

This PR explicitly overrides the eq and ne methods in BooleanOps (python/pyspark/pandas/data_type_ops/boolean_ops.py) to correctly handle string comparisons by returning a boolean series filled with False (for eq) and True (for ne), aligning the behavior with native pandas semantics.

Why are the changes needed?

This change is needed to maintain strict API and behavioral parity with native pandas.

Currently, in pyspark.pandas 4.0.1:
ps.Series([True, False]) == 'True' incorrectly yields [True, True].
In native pandas:
pd.Series([True, False]) == 'True' correctly yields [False, False].

This divergence causes silent logical errors when migrating existing pandas workloads to Spark. By fixing this, we ensure predictable and correct comparison logic for end-users.

Does this PR introduce any user-facing change?

Yes, this fixes a bug in behavioral semantics.

Previous Behavior (pandas-on-Spark):

import pyspark.pandas as ps
s = ps.Series([True, False])
s == "True"
# 0    True
# 1    True

New Behavior (Matching native pandas):

import pyspark.pandas as ps
s = ps.Series([True, False])
s == "True"
# 0    False
# 1    False

How was this patch tested?

This patch was tested by adding explicit unit tests verifying equality (==) and inequality (!=) comparisons between boolean series and string literals.

Added tests in python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py:

BooleanOpsTests.test_eq_ne_with_string (indirectly via test_eq and test_ne updates)
BooleanExtensionOpsTestsMixin updates to ensure extension types are also covered.

Tests can be run locally via:

./python/run-tests --modules pyspark-pandas-slow --testnames "pyspark.pandas.tests.data_type_ops.test_boolean_ops"

Was this patch authored or co-authored using generative AI tooling?

co-authored using: Gemini CLI (Model: Gemini 2.5 Pro)

…terals

singhpraveen2010 added 2 commits March 5, 2026 02:26

[SPARK-54665][PS] Fix comparison between boolean Series and string li…

2d94944

…terals

Trigger CI

a5290cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54665][PS] Fix comparison between boolean Series and string literals #54628

[SPARK-54665][PS] Fix comparison between boolean Series and string literals #54628
singhpraveen2010 wants to merge 2 commits intoapache:masterfrom
singhpraveen2010:SPARK-54665-fix-bool-string-comparison

singhpraveen2010 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

singhpraveen2010 commented Mar 4, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant