Skip to content

[SPARK-54665][PS] Fix comparison between boolean Series and string literals #54628

Open
singhpraveen2010 wants to merge 2 commits intoapache:masterfrom
singhpraveen2010:SPARK-54665-fix-bool-string-comparison
Open

[SPARK-54665][PS] Fix comparison between boolean Series and string literals #54628
singhpraveen2010 wants to merge 2 commits intoapache:masterfrom
singhpraveen2010:SPARK-54665-fix-bool-string-comparison

Conversation

@singhpraveen2010
Copy link

What changes were proposed in this pull request?

This resolves the open issue:
https://issues.apache.org/jira/browse/SPARK-54665 (https://issues.apache.org/jira/browse/SPARK-54665)

This PR resolves an inconsistency in pandas-on-Spark (pyspark.pandas) where comparing a boolean Series to a string literal returns True for equality (==) and False for inequality (!=).

This PR explicitly overrides the eq and ne methods in BooleanOps (python/pyspark/pandas/data_type_ops/boolean_ops.py) to correctly handle string comparisons by returning a boolean series filled with False (for eq) and True (for ne), aligning the behavior with native pandas semantics.

Why are the changes needed?

This change is needed to maintain strict API and behavioral parity with native pandas.

Currently, in pyspark.pandas 4.0.1:
ps.Series([True, False]) == 'True' incorrectly yields [True, True].
In native pandas:
pd.Series([True, False]) == 'True' correctly yields [False, False].

This divergence causes silent logical errors when migrating existing pandas workloads to Spark. By fixing this, we ensure predictable and correct comparison logic for end-users.

Does this PR introduce any user-facing change?

Yes, this fixes a bug in behavioral semantics.

Previous Behavior (pandas-on-Spark):

import pyspark.pandas as ps
s = ps.Series([True, False])
s == "True"
# 0    True
# 1    True

New Behavior (Matching native pandas):

import pyspark.pandas as ps
s = ps.Series([True, False])
s == "True"
# 0    False
# 1    False

How was this patch tested?

This patch was tested by adding explicit unit tests verifying equality (==) and inequality (!=) comparisons between boolean series and string literals.

Added tests in python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py:

  • BooleanOpsTests.test_eq_ne_with_string (indirectly via test_eq and test_ne updates)
  • BooleanExtensionOpsTestsMixin updates to ensure extension types are also covered.

Tests can be run locally via:

./python/run-tests --modules pyspark-pandas-slow --testnames "pyspark.pandas.tests.data_type_ops.test_boolean_ops"

Was this patch authored or co-authored using generative AI tooling?

co-authored using: Gemini CLI (Model: Gemini 2.5 Pro)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant