Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 #24867

Conversation

BryanCutler
Copy link
Member

What changes were proposed in this pull request?

This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error Pandas >= 0.23.2 must be installed; however, your version was 0.XX. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed.

How was this patch tested?

Existing Tests

wrong_schema = StructType(fields)
with QuietTest(self.sc):
with self.assertRaisesRegexp(Exception, ".*cast.*[s|S]tring.*timestamp.*"):
with self.assertRaisesRegexp(Exception, "integer.*required.*got.*str"):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the workaround changed this error message and it seemed more clear for the test to swap int field instead of timestamp

@BryanCutler
Copy link
Member Author

I thought we were testing in Jenkins with Pandas 0.23.2, but from this comment #24298 (comment), it looks like 0.24.2.

I think 0.24.2 might be too new to use as a minimum supported version, but it makes using 0.23.2 kind of arbitrary. What do others think as a good version to use?
cc @HyukjinKwon @viirya @ueshin @felixcheung @shaneknapp ?

@HyukjinKwon
Copy link
Member

0.23.2 sounds fine to me. but can we quickly discuss this one in dev mailing list?

@SparkQA
Copy link

SparkQA commented Jun 14, 2019

Test build #106492 has finished for PR 24867 at commit cfaa0a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably good to keep the min version lower at 23.2 rather than 24.2

@viirya
Copy link
Member

viirya commented Jun 14, 2019

Agreed. 0.23.2 sounds good.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, @BryanCutler . Could you update pandas_udf function note together?

Can we have an item to the upgrade migration doc from 2.4 to 3.0 explicitly? For 0.19.2, we had one line for this at Upgrading From Spark SQL 2.3 to 2.4.

@BryanCutler
Copy link
Member Author

Could you update pandas_udf function note together?

Thanks @dongjoon-hyun , that one is referencing the table of Pandas/PyArrow conversions, so the data would have to be rerun. @HyukjinKwon would you be able to do this as a followup?

Can we have an item to the upgrade migration doc from 2.4 to 3.0 explicitly? For 0.19.2, we had one line for this at Upgrading From Spark SQL 2.3 to 2.4.

Yup, good idea. I'll add that.

- Since Spark 3.0, PySpark requires a Pandas version of 0.23.2 or higher to use Pandas related functionality, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc.

- Since Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as `pandas_udf`, `toPandas` and `createDataFrame` with "spark.sql.execution.arrow.enabled=true", etc.

Copy link
Member Author

@BryanCutler BryanCutler Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about the minimum pyarrow version. Further down here https://github.com/apache/spark/pull/24867/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9L58 we talk about safe casting, which I think is still relevant so I won't modify it, unless it seems confusing to talk about versions < 0.12.1?

@SparkQA
Copy link

SparkQA commented Jun 14, 2019

Test build #106531 has finished for PR 24867 at commit 13b6ed4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Yup, will update it as a followup.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@HyukjinKwon
Copy link
Member

Looks good to me too but can we hold it for few days just to let more people read the discussion (since now it's weekends)?

@HyukjinKwon
Copy link
Member

Merged to master.

@BryanCutler BryanCutler deleted the pyspark-increase-min-pandas-SPARK-28041 branch June 20, 2019 21:10
rshkv pushed a commit to palantir/spark that referenced this pull request May 23, 2020
This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error `Pandas >= 0.23.2 must be installed; however, your version was 0.XX`. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed.

Existing Tests

Closes apache#24867 from BryanCutler/pyspark-increase-min-pandas-SPARK-28041.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@rshkv rshkv mentioned this pull request May 23, 2020
9 tasks
rshkv pushed a commit to palantir/spark that referenced this pull request Jun 5, 2020
This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error `Pandas >= 0.23.2 must be installed; however, your version was 0.XX`. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed.

Existing Tests

Closes apache#24867 from BryanCutler/pyspark-increase-min-pandas-SPARK-28041.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants