[SPARK-43024][PYTHON] Upgrade pandas to 2.0.0#41211
[SPARK-43024][PYTHON] Upgrade pandas to 2.0.0#41211itholic wants to merge 4 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Let me skip the tests for now, since there are numerous changes introduced in pandas 2.0.0, and most of them are related to behavior changes in the pandas API on Spark.
In the current PR, our primary goal will be to prioritize support for pandas 2.0.0 in PySpark, and I don't want to introduce any behavior changes(There should be no behavior changes unless it is a major release).
I already created bunch of tickets for enabling tests from pandas 2.0.0, so let me handle them in the follow-ups.
Read #40658 (comment) for additional detail.
There was a problem hiding this comment.
For the record, we will need some behaviour changes because of pandas 2.0.0 - we have some API that rely on pandas API on Spark. But I am fine with skipping them for now, and work on them separately.
There was a problem hiding this comment.
Oh... behaviour changes because of pandas 2.0.0?
There was a problem hiding this comment.
Yeah, because we depend on calling pandas right away in several API in pandas API on Spark. For example, this one https://github.com/apache/spark/pull/40658/files#diff-3d6777133e0de63c6dd3761a15d631d0c6520b918babf74522b0d95580954e72L908
>>> ser.rename("a").to_frame().set_index("a").index.astype('int64')
- Int64Index([1, 2], dtype='int64', name='a')
+ Index([1, 2], dtype='int64', name='a')Pandas returns a Index instead instead of Int64Index (see also https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#index-can-now-hold-numpy-numeric-dtypes).
There is no Int64Index instance anymore in pandas.
There was a problem hiding this comment.
So, I believe these changes would be inevitable .. but should avoid breaking changes. @itholic is working on listing up all these breaking changes, and we will internally discuss first, and let it out to the community.
There was a problem hiding this comment.
eh I'm a bit puzzled and wonder if there is a misunderstanding of the language here.
"There should be no behavior changes unless it is a major release" where does this come from?
What @gatorsmile said was "we should not remove these API before the major release Spark 4.0"
did he mean that functions that are well-functioning but have been removed in pandas version 2.0 should not be removed? He has also written this under the def append which was meant to be removed in your previous PR.
The whole point of the pandas API on spark is that it should be as similar as possible to pandas and then there must also be some behavior changes, because it has happened in pandas.
Are there any other ways than one person having to fix everything in one and the same PR? e.g. that you mark all tests that fail with @pytest.mark.skip(reason="see JIRA XXXX for updating to pandas 2.0") and you also create a JIRA for that. So that can more people can help upgrade the pandas API on spark to version 2.0?
There was a problem hiding this comment.
"There should be no behavior changes unless it is a major release" where does this come from?
AFAIK, it appears to be primarily intended to adhere to the "Considerations when breaking APIs" mentioned in Versioning Policy. It also serves as a rule followed uniformly across all components other than the pandas API on Spark.
The whole point of the pandas API on spark is that it should be as similar as possible to pandas and then there must also be some behavior changes, because it has happened in pandas.
Yes, as @HyukjinKwon mentioned in #41211 (comment), that's why we need to discuss whether to allow breaking changes for some APIs in the pandas API on Spark exceptional compared to other components.
Are there any other ways than one person having to fix everything in one and the same PR? e.g. that you mark all tests that fail with @pytest.mark.skip(reason="see JIRA XXXX for updating to pandas 2.0") and you also create a JIRA for that. So that can more people can help upgrade the pandas API on spark to version 2.0?
Yes, as I mentioned in #41211 (comment), I am currently skipping all failing tests and creating tickets to allow others to participate in the pandas upgrade. The current PR serves as the groundwork for that, and once the PR is completed, anyone who wishes to contribute can help with the pandas upgrade.
So after this PR is merged, anyone who wants to contribute to the pandas API on Spark can pick a ticket from the list at SPARK-42618.
There was a problem hiding this comment.
FYI: I will address the skipped tests like this, since we can't change the behavior for now.
|
@HyukjinKwon @dongjoon-hyun CI passed. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Let's keep your approach conssitent. if you're skipping, just skip all for now.
|
Merged to master. Regarding https://pandas.pydata.org/pandas-docs/version/2.0.2/whatsnew/v2.0.2.html, @itholic is working on the list. Separate PR will be created. |
|
Nice. Thank you, @itholic and all! |
### What changes were proposed in this pull request? This PR proposes to upgrade pandas to 2.0.0. ### Why are the changes needed? To support latest pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Addressed the existing UTs. Closes apache#41211 from itholic/pandas_2. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR proposes to upgrade pandas to 2.0.0.
Why are the changes needed?
To support latest pandas.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Addressed the existing UTs.