[SPARK-43024][PYTHON] Upgrade pandas to 2.0.0 by itholic · Pull Request #41211 · apache/spark

itholic · 2023-05-18T06:26:18Z

What changes were proposed in this pull request?

This PR proposes to upgrade pandas to 2.0.0.

Why are the changes needed?

To support latest pandas.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Addressed the existing UTs.

dongjoon-hyun

Nice!

itholic · 2023-05-18T06:33:45Z

python/pyspark/pandas/tests/data_type_ops/test_date_ops.py

Let me skip the tests for now, since there are numerous changes introduced in pandas 2.0.0, and most of them are related to behavior changes in the pandas API on Spark.

In the current PR, our primary goal will be to prioritize support for pandas 2.0.0 in PySpark, and I don't want to introduce any behavior changes(There should be no behavior changes unless it is a major release).

I already created bunch of tickets for enabling tests from pandas 2.0.0, so let me handle them in the follow-ups.

Read #40658 (comment) for additional detail.

For the record, we will need some behaviour changes because of pandas 2.0.0 - we have some API that rely on pandas API on Spark. But I am fine with skipping them for now, and work on them separately.

Oh... behaviour changes because of pandas 2.0.0?

Yeah, because we depend on calling pandas right away in several API in pandas API on Spark. For example, this one https://github.com/apache/spark/pull/40658/files#diff-3d6777133e0de63c6dd3761a15d631d0c6520b918babf74522b0d95580954e72L908

>>> ser.rename("a").to_frame().set_index("a").index.astype('int64') - Int64Index([1, 2], dtype='int64', name='a') + Index([1, 2], dtype='int64', name='a')

Pandas returns a Index instead instead of Int64Index (see also https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#index-can-now-hold-numpy-numeric-dtypes).

There is no Int64Index instance anymore in pandas.

So, I believe these changes would be inevitable .. but should avoid breaking changes. @itholic is working on listing up all these breaking changes, and we will internally discuss first, and let it out to the community.

eh I'm a bit puzzled and wonder if there is a misunderstanding of the language here.
"There should be no behavior changes unless it is a major release" where does this come from?
What @gatorsmile said was "we should not remove these API before the major release Spark 4.0"
did he mean that functions that are well-functioning but have been removed in pandas version 2.0 should not be removed? He has also written this under the def append which was meant to be removed in your previous PR.
The whole point of the pandas API on spark is that it should be as similar as possible to pandas and then there must also be some behavior changes, because it has happened in pandas.
Are there any other ways than one person having to fix everything in one and the same PR? e.g. that you mark all tests that fail with @pytest.mark.skip(reason="see JIRA XXXX for updating to pandas 2.0") and you also create a JIRA for that. So that can more people can help upgrade the pandas API on spark to version 2.0?

"There should be no behavior changes unless it is a major release" where does this come from?

AFAIK, it appears to be primarily intended to adhere to the "Considerations when breaking APIs" mentioned in Versioning Policy. It also serves as a rule followed uniformly across all components other than the pandas API on Spark.

The whole point of the pandas API on spark is that it should be as similar as possible to pandas and then there must also be some behavior changes, because it has happened in pandas.

Yes, as @HyukjinKwon mentioned in #41211 (comment), that's why we need to discuss whether to allow breaking changes for some APIs in the pandas API on Spark exceptional compared to other components.

Are there any other ways than one person having to fix everything in one and the same PR? e.g. that you mark all tests that fail with @pytest.mark.skip(reason="see JIRA XXXX for updating to pandas 2.0") and you also create a JIRA for that. So that can more people can help upgrade the pandas API on spark to version 2.0?

Yes, as I mentioned in #41211 (comment), I am currently skipping all failing tests and creating tickets to allow others to participate in the pandas upgrade. The current PR serves as the groundwork for that, and once the PR is completed, anyone who wishes to contribute can help with the pandas upgrade.

So after this PR is merged, anyone who wants to contribute to the pandas API on Spark can pick a ticket from the list at SPARK-42618.

itholic · 2023-05-18T06:45:06Z

python/pyspark/pandas/tests/test_dataframe.py

FYI: I will address the skipped tests like this, since we can't change the behavior for now.

itholic · 2023-05-26T03:11:47Z

@HyukjinKwon @dongjoon-hyun CI passed.
Also cc @zhengruifeng @ueshin @xinrong-meng FYI.

HyukjinKwon

Let's keep your approach conssitent. if you're skipping, just skip all for now.

python/pyspark/pandas/frame.py

python/pyspark/pandas/tests/indexes/test_datetime.py

python/pyspark/pandas/tests/test_categorical.py

python/pyspark/pandas/tests/test_dataframe.py

bjornjorgensen · 2023-05-29T11:52:21Z

https://pandas.pydata.org/pandas-docs/version/2.0.2/whatsnew/v2.0.2.html

HyukjinKwon · 2023-05-30T00:02:44Z

Merged to master.

Regarding https://pandas.pydata.org/pandas-docs/version/2.0.2/whatsnew/v2.0.2.html, @itholic is working on the list. Separate PR will be created.

dongjoon-hyun · 2023-05-30T22:13:42Z

Nice. Thank you, @itholic and all!

### What changes were proposed in this pull request? This PR proposes to upgrade pandas to 2.0.0. ### Why are the changes needed? To support latest pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Addressed the existing UTs. Closes apache#41211 from itholic/pandas_2. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added BUILD CONNECT CORE PANDAS API ON SPARK PYTHON SQL labels May 18, 2023

itholic mentioned this pull request May 18, 2023

[SPARK-XXXXX][PS] Matching the behavior of pandas API on Spark to pandas 2.0.0 #40658

Closed

dongjoon-hyun reviewed May 18, 2023

View reviewed changes

itholic commented May 18, 2023

View reviewed changes

itholic marked this pull request as draft May 18, 2023 23:50

[SPARK-43024][PYTHON] Upgrade pandas to 2.0.0

08b6694

itholic force-pushed the pandas_2 branch from 52594f2 to 08b6694 Compare May 25, 2023 23:44

itholic marked this pull request as ready for review May 26, 2023 02:57

HyukjinKwon reviewed May 26, 2023

View reviewed changes

itholic added 3 commits May 26, 2023 17:44

rollback the testing

b8e1448

remove TODOs in docstring

d542ff4

Rebase to master

93cdfef

zhengruifeng approved these changes May 29, 2023

View reviewed changes

HyukjinKwon closed this in 1da7b7f May 30, 2023

itholic deleted the pandas_2 branch November 20, 2023 01:35

Conversation

itholic commented May 18, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

itholic May 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 18, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 18, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 18, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 18, 2023

Choose a reason for hiding this comment

Uh oh!

bjornjorgensen May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic May 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic May 18, 2023

Choose a reason for hiding this comment

Uh oh!

itholic commented May 26, 2023

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bjornjorgensen commented May 29, 2023

Uh oh!

HyukjinKwon commented May 30, 2023

Uh oh!

dongjoon-hyun commented May 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

itholic May 18, 2023 •

edited

Loading

bjornjorgensen May 19, 2023 •

edited

Loading

itholic May 22, 2023 •

edited

Loading