Skip to content

[SPARK-43024][PYTHON] Upgrade pandas to 2.0.0#41211

Closed
itholic wants to merge 4 commits intoapache:masterfrom
itholic:pandas_2
Closed

[SPARK-43024][PYTHON] Upgrade pandas to 2.0.0#41211
itholic wants to merge 4 commits intoapache:masterfrom
itholic:pandas_2

Conversation

@itholic
Copy link
Contributor

@itholic itholic commented May 18, 2023

What changes were proposed in this pull request?

This PR proposes to upgrade pandas to 2.0.0.

Why are the changes needed?

To support latest pandas.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Addressed the existing UTs.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +66 to +69
Copy link
Contributor Author

@itholic itholic May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me skip the tests for now, since there are numerous changes introduced in pandas 2.0.0, and most of them are related to behavior changes in the pandas API on Spark.

In the current PR, our primary goal will be to prioritize support for pandas 2.0.0 in PySpark, and I don't want to introduce any behavior changes(There should be no behavior changes unless it is a major release).

I already created bunch of tickets for enabling tests from pandas 2.0.0, so let me handle them in the follow-ups.

Read #40658 (comment) for additional detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, we will need some behaviour changes because of pandas 2.0.0 - we have some API that rely on pandas API on Spark. But I am fine with skipping them for now, and work on them separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... behaviour changes because of pandas 2.0.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, because we depend on calling pandas right away in several API in pandas API on Spark. For example, this one https://github.com/apache/spark/pull/40658/files#diff-3d6777133e0de63c6dd3761a15d631d0c6520b918babf74522b0d95580954e72L908

         >>> ser.rename("a").to_frame().set_index("a").index.astype('int64')
-        Int64Index([1, 2], dtype='int64', name='a')
+        Index([1, 2], dtype='int64', name='a')

Pandas returns a Index instead instead of Int64Index (see also https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#index-can-now-hold-numpy-numeric-dtypes).

There is no Int64Index instance anymore in pandas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I believe these changes would be inevitable .. but should avoid breaking changes. @itholic is working on listing up all these breaking changes, and we will internally discuss first, and let it out to the community.

Copy link
Contributor

@bjornjorgensen bjornjorgensen May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh I'm a bit puzzled and wonder if there is a misunderstanding of the language here.
"There should be no behavior changes unless it is a major release" where does this come from?
What @gatorsmile said was "we should not remove these API before the major release Spark 4.0"
did he mean that functions that are well-functioning but have been removed in pandas version 2.0 should not be removed? He has also written this under the def append which was meant to be removed in your previous PR.
The whole point of the pandas API on spark is that it should be as similar as possible to pandas and then there must also be some behavior changes, because it has happened in pandas.
Are there any other ways than one person having to fix everything in one and the same PR? e.g. that you mark all tests that fail with @pytest.mark.skip(reason="see JIRA XXXX for updating to pandas 2.0") and you also create a JIRA for that. So that can more people can help upgrade the pandas API on spark to version 2.0?

Copy link
Contributor Author

@itholic itholic May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"There should be no behavior changes unless it is a major release" where does this come from?

AFAIK, it appears to be primarily intended to adhere to the "Considerations when breaking APIs" mentioned in Versioning Policy. It also serves as a rule followed uniformly across all components other than the pandas API on Spark.

The whole point of the pandas API on spark is that it should be as similar as possible to pandas and then there must also be some behavior changes, because it has happened in pandas.

Yes, as @HyukjinKwon mentioned in #41211 (comment), that's why we need to discuss whether to allow breaking changes for some APIs in the pandas API on Spark exceptional compared to other components.

Are there any other ways than one person having to fix everything in one and the same PR? e.g. that you mark all tests that fail with @pytest.mark.skip(reason="see JIRA XXXX for updating to pandas 2.0") and you also create a JIRA for that. So that can more people can help upgrade the pandas API on spark to version 2.0?

Yes, as I mentioned in #41211 (comment), I am currently skipping all failing tests and creating tickets to allow others to participate in the pandas upgrade. The current PR serves as the groundwork for that, and once the PR is completed, anyone who wishes to contribute can help with the pandas upgrade.

So after this PR is merged, anyone who wants to contribute to the pandas API on Spark can pick a ticket from the list at SPARK-42618.

Comment on lines 4033 to 4045
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I will address the skipped tests like this, since we can't change the behavior for now.

@itholic itholic marked this pull request as draft May 18, 2023 23:50
@itholic itholic marked this pull request as ready for review May 26, 2023 02:57
@itholic
Copy link
Contributor Author

itholic commented May 26, 2023

@HyukjinKwon @dongjoon-hyun CI passed.
Also cc @zhengruifeng @ueshin @xinrong-meng FYI.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep your approach conssitent. if you're skipping, just skip all for now.

@bjornjorgensen
Copy link
Contributor

@HyukjinKwon
Copy link
Member

Merged to master.

Regarding https://pandas.pydata.org/pandas-docs/version/2.0.2/whatsnew/v2.0.2.html, @itholic is working on the list. Separate PR will be created.

@dongjoon-hyun
Copy link
Member

Nice. Thank you, @itholic and all!

czxm pushed a commit to czxm/spark that referenced this pull request Jun 12, 2023
### What changes were proposed in this pull request?

This PR proposes to upgrade pandas to 2.0.0.

### Why are the changes needed?

To support latest pandas.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Addressed the existing UTs.

Closes apache#41211 from itholic/pandas_2.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@itholic itholic deleted the pandas_2 branch November 20, 2023 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants