Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37039][PS] Fix Series.astype to work properly with missing value #44570

Closed
wants to merge 4 commits into from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Jan 3, 2024

What changes were proposed in this pull request?

This PR proposes to fix Series.astype to work properly with missing value.

Why are the changes needed?

To follow the behavior of latest Pandas.

Does this PR introduce any user-facing change?

Yes, the bug is fixed to follow the behavior of Pandas:

Before

>>> psser = ps.Series([decimal.Decimal(1), decimal.Decimal(2), decimal.Decimal(np.nan)])
>>> psser.astype(bool)
0    True
1    True
2    False
dtype: bool

After

>>> psser = ps.Series([decimal.Decimal(1), decimal.Decimal(2), decimal.Decimal(np.nan)])
>>> psser.astype(bool)
0    True
1    True
2    True
dtype: bool

How was this patch tested?

Enable the existing UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

@@ -54,10 +54,7 @@ def test_astype(self):
lambda: psser.astype(int_type),
)

# TODO(SPARK-37039): the np.nan series.astype(bool) should be True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add this into migration guide though.

dongjoon-hyun
dongjoon-hyun previously approved these changes Jan 3, 2024
@dongjoon-hyun
Copy link
Member

Oh, it seems the the failures are relevant, @itholic .

pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
Left:
bool
Right:
bool

@HyukjinKwon
Copy link
Member

@itholic can you retrigger and/or fix the tests?

@itholic
Copy link
Contributor Author

itholic commented Jan 4, 2024

Oh.. seems like we should separately handle the boolean data type. I got some personal errands right now, so let me take a look tomorrow. Thanks!

@itholic itholic changed the title [SPARK-37039][PS] Fix Series.astype to work properly with missing value [WIP][SPARK-37039][PS] Fix Series.astype to work properly with missing value Jan 5, 2024
@itholic itholic marked this pull request as draft January 5, 2024 00:32
@itholic itholic marked this pull request as ready for review January 9, 2024 08:12
@itholic
Copy link
Contributor Author

itholic commented Jan 9, 2024

CI passed. @dongjoon-hyun @HyukjinKwon FYI

@itholic itholic changed the title [WIP][SPARK-37039][PS] Fix Series.astype to work properly with missing value [SPARK-37039][PS] Fix Series.astype to work properly with missing value Jan 9, 2024
@HyukjinKwon
Copy link
Member

Merged to master.

yaooqinn pushed a commit that referenced this pull request Jan 11, 2024
### What changes were proposed in this pull request?

This PR followup for #44570 to add migration guide for behavior change.

### Why are the changes needed?

We should notice user about any behavior change

### Does this PR introduce _any_ user-facing change?

No API change.

### How was this patch tested?

The existing CI should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44684 from itholic/SPARK-37039-foolowup.

Authored-by: Haejoon Lee <haejoon.lee@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants