[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 #24867

BryanCutler · 2019-06-14T00:29:32Z

What changes were proposed in this pull request?

This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error Pandas >= 0.23.2 must be installed; however, your version was 0.XX. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed.

How was this patch tested?

Existing Tests

BryanCutler · 2019-06-14T00:32:22Z

python/pyspark/sql/tests/test_arrow.py

        wrong_schema = StructType(fields)
        with QuietTest(self.sc):
-            with self.assertRaisesRegexp(Exception, ".*cast.*[s|S]tring.*timestamp.*"):
+            with self.assertRaisesRegexp(Exception, "integer.*required.*got.*str"):


removing the workaround changed this error message and it seemed more clear for the test to swap int field instead of timestamp

BryanCutler · 2019-06-14T00:34:31Z

I thought we were testing in Jenkins with Pandas 0.23.2, but from this comment #24298 (comment), it looks like 0.24.2.

I think 0.24.2 might be too new to use as a minimum supported version, but it makes using 0.23.2 kind of arbitrary. What do others think as a good version to use?
cc @HyukjinKwon @viirya @ueshin @felixcheung @shaneknapp ?

HyukjinKwon · 2019-06-14T00:36:32Z

0.23.2 sounds fine to me. but can we quickly discuss this one in dev mailing list?

SparkQA · 2019-06-14T01:04:40Z

Test build #106492 has finished for PR 24867 at commit cfaa0a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

probably good to keep the min version lower at 23.2 rather than 24.2

viirya · 2019-06-14T03:12:43Z

Agreed. 0.23.2 sounds good.

dongjoon-hyun

Nice, @BryanCutler . Could you update pandas_udf function note together?

https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L3146

Can we have an item to the upgrade migration doc from 2.4 to 3.0 explicitly? For 0.19.2, we had one line for this at Upgrading From Spark SQL 2.3 to 2.4.

https://github.com/apache/spark/blame/master/docs/sql-migration-guide-upgrade.md#L411

BryanCutler · 2019-06-14T17:12:52Z

Could you update pandas_udf function note together?

Thanks @dongjoon-hyun , that one is referencing the table of Pandas/PyArrow conversions, so the data would have to be rerun. @HyukjinKwon would you be able to do this as a followup?

Can we have an item to the upgrade migration doc from 2.4 to 3.0 explicitly? For 0.19.2, we had one line for this at Upgrading From Spark SQL 2.3 to 2.4.

Yup, good idea. I'll add that.

BryanCutler · 2019-06-14T21:29:50Z

docs/sql-migration-guide-upgrade.md

+  - Since Spark 3.0, PySpark requires a Pandas version of 0.23.2 or higher to use Pandas related functionality, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc.
+
+  - Since Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as `pandas_udf`, `toPandas` and `createDataFrame` with "spark.sql.execution.arrow.enabled=true", etc.
+


Added a note about the minimum pyarrow version. Further down here https://github.com/apache/spark/pull/24867/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9L58 we talk about safe casting, which I think is still relevant so I won't modify it, unless it seems confusing to talk about versions < 0.12.1?

SparkQA · 2019-06-14T22:12:49Z

Test build #106531 has finished for PR 24867 at commit 13b6ed4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-06-15T00:26:17Z

Yup, will update it as a followup.

dongjoon-hyun

+1, LGTM.

HyukjinKwon · 2019-06-15T09:04:20Z

Looks good to me too but can we hold it for few days just to let more people read the discussion (since now it's weekends)?

HyukjinKwon · 2019-06-18T00:10:31Z

Merged to master.

This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error `Pandas >= 0.23.2 must be installed; however, your version was 0.XX`. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed. Existing Tests Closes apache#24867 from BryanCutler/pyspark-increase-min-pandas-SPARK-28041. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

BryanCutler added 2 commits June 13, 2019 16:50

increase minimum pandas to 0.23.2

a3611f2

remove workaround to cast timestamps for pandas 0.19.2

cfaa0a0

BryanCutler commented Jun 14, 2019

View reviewed changes

dongjoon-hyun added the IMPROVEMENT label Jun 14, 2019

felixcheung approved these changes Jun 14, 2019

View reviewed changes

dongjoon-hyun reviewed Jun 14, 2019

View reviewed changes

dongjoon-hyun added PYSPARK and removed IMPROVEMENT labels Jun 14, 2019

Added note in migration guidea

13b6ed4

BryanCutler commented Jun 14, 2019

View reviewed changes

dongjoon-hyun approved these changes Jun 15, 2019

View reviewed changes

srowen approved these changes Jun 15, 2019

View reviewed changes

HyukjinKwon approved these changes Jun 18, 2019

View reviewed changes

HyukjinKwon mentioned this pull request Jun 18, 2019

[SPARK-28003][PYTHON] Allow NaT values when creating Spark dataframe from pandas with Arrow #24844

Closed

HyukjinKwon closed this in 90f8039 Jun 18, 2019

BryanCutler deleted the pyspark-increase-min-pandas-SPARK-28041 branch June 20, 2019 21:10

rshkv mentioned this pull request May 23, 2020

Misc PyArrow fixes palantir/spark#684

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 #24867

[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 #24867

BryanCutler commented Jun 14, 2019

BryanCutler Jun 14, 2019

BryanCutler commented Jun 14, 2019

HyukjinKwon commented Jun 14, 2019

SparkQA commented Jun 14, 2019

felixcheung left a comment •

edited

Loading

viirya commented Jun 14, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

BryanCutler commented Jun 14, 2019

BryanCutler Jun 14, 2019 •

edited

Loading

SparkQA commented Jun 14, 2019

HyukjinKwon commented Jun 15, 2019

dongjoon-hyun left a comment

HyukjinKwon commented Jun 15, 2019

HyukjinKwon commented Jun 18, 2019

		- Since Spark 3.0, PySpark requires a Pandas version of 0.23.2 or higher to use Pandas related functionality, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc.

		- Since Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as `pandas_udf`, `toPandas` and `createDataFrame` with "spark.sql.execution.arrow.enabled=true", etc.

[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 #24867

[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 #24867

Conversation

BryanCutler commented Jun 14, 2019

What changes were proposed in this pull request?

How was this patch tested?

BryanCutler Jun 14, 2019

Choose a reason for hiding this comment

BryanCutler commented Jun 14, 2019

HyukjinKwon commented Jun 14, 2019

SparkQA commented Jun 14, 2019

felixcheung left a comment • edited Loading

Choose a reason for hiding this comment

viirya commented Jun 14, 2019 • edited Loading

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

BryanCutler commented Jun 14, 2019

BryanCutler Jun 14, 2019 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Jun 14, 2019

HyukjinKwon commented Jun 15, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 15, 2019

HyukjinKwon commented Jun 18, 2019

felixcheung left a comment •

edited

Loading

viirya commented Jun 14, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

BryanCutler Jun 14, 2019 •

edited

Loading