[SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 #23657

BryanCutler · 2019-01-26T00:05:28Z

What changes were proposed in this pull request?

Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0

Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:

Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
Java, Reduce heap usage for variable width vectors, ARROW-4147
Binary identity cast not implemented, ARROW-4101
pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
conversion to date object no longer needed, ARROW-3910
Error reading IPC file with no record batches, ARROW-3894
Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790
from_pandas gives incorrect results when converting floating point to bool, ARROW-3428
Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048
Java update to official Flatbuffers version 1.9.0, ARROW-3175

complete list here

PySpark requires the following fixes to work with PyArrow 0.12.0

Encrypted pyspark worker fails due to ChunkedStream missing closed property
pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
ArrowTests fails due to difference in raised error message
pyarrow.open_stream deprecated
tests fail because groupby adds index column with duplicate name

How was this patch tested?

Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0

SparkQA · 2019-01-26T04:28:29Z

Test build #101698 has finished for PR 23657 at commit a0497ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Seems like a good idea to keep Arrow up to date for Spark 3

srowen · 2019-01-26T15:27:15Z

python/pyspark/sql/types.py

+    import pyarrow
+    from distutils.version import LooseVersion
+    # As of Arrow 0.12.0, date_as_objects is True by default, see ARROW-3910
+    if LooseVersion(pyarrow.__version__) < LooseVersion("0.12.0") and type(data_type) == DateType:


Not a big deal, but does this get called a lot and would it be better to check if Arrow is < 0.12 once and save that?

Yup, it will be called a lot (but not per record at least but per batch), and I also think it should be called once ideally.

I roughly guess that this has been done in this way so far because I guess we're not sure about the versions in worker side and driver side .. For instance, both versions in both codes can be different as far as I know because we don't have a check for it (correct me if I am mistaken).

Probably we should add a check like we do for Python version check between driver and worker, and have few global checks. Of course, we could do it separately I guess.

BTW, I was thinking about targeting to upgrade minimum PyArrow version in Spark 3.0.0 since the codes are being complicated for those if-else.

Yes, these are called per-batch and wouldn't add overhead that would be noticeable. I think these check will be temporary and could be removed once we change the minimum version and as Arrow gets more mature. For now, it's probably best to just make sure these kind of checks are easy to track.

Probably we should add a check like we do for Python version check between driver and worker, and have few global checks. Of course, we could do it separately I guess.

Yeah, we could do this but it might not really be too big of deal. I think eventually it will be sort of like Pandas versions, if they are close there will probably be no issues. But with major versions might not be completely compatible.

felixcheung

cool!

python/pyspark/serializers.py

gatorsmile · 2019-01-28T06:49:06Z

Will our Jenkin run Arrow 0.12?

HyukjinKwon · 2019-01-28T07:07:10Z

For PyArrow 0.12.0, I don't think so but it will run with Arrow 0.12.0 + PyArrow 0.8.0 combination. I needs to manual upgrade. Currently, IIRC, the version is 0.8.0 to test minimum PyArrow version.

I was thinking about bumping up the minimal PyArrow version at Spark 3.0 so that we can reduce such if-else branches, and reduce such overhead.

felixcheung · 2019-01-28T07:43:44Z

probably a good idea - arrow moves quickly; 0.10 is kinda "dated"

BryanCutler · 2019-01-28T19:40:18Z

I was thinking about bumping up the minimal PyArrow version at Spark 3.0 so that we can reduce such if-else branches, and reduce such overhead.

Sounds good to me, we might want to double-check on what dependencies 0.12.0 requires though

SparkQA · 2019-01-29T01:35:46Z

Test build #101773 has finished for PR 23657 at commit 62e803c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-29T01:37:16Z

retest this please

SparkQA · 2019-01-29T05:19:54Z

Test build #101783 has finished for PR 23657 at commit 62e803c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-29T06:08:16Z

Test build #101784 has finished for PR 23657 at commit 62e803c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-29T06:15:43Z

Merged to master.

I need to test SPARK-26759 against arrow upgrade as well.

BryanCutler · 2019-01-29T18:38:44Z

Thanks @HyukjinKwon @felixcheung and @srowen !

## What changes were proposed in this pull request? Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0 Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users: * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 * Java, Reduce heap usage for variable width vectors, ARROW-4147 * Binary identity cast not implemented, ARROW-4101 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 * Error reading IPC file with no record batches, ARROW-3894 * Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790 * from_pandas gives incorrect results when converting floating point to bool, ARROW-3428 * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048 * Java update to official Flatbuffers version 1.9.0, ARROW-3175 complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0) PySpark requires the following fixes to work with PyArrow 0.12.0 * Encrypted pyspark worker fails due to ChunkedStream missing closed property * pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64 * ArrowTests fails due to difference in raised error message * pyarrow.open_stream deprecated * tests fail because groupby adds index column with duplicate name ## How was this patch tested? Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0 Closes apache#23657 from BryanCutler/arrow-upgrade-012. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

melin · 2019-05-09T09:34:33Z

Currently only spark 3.0 is supported, and can support spark 2.4?

HyukjinKwon · 2019-05-09T14:01:14Z

nope this change only goes into Spark 3.0 but I think Arrow 0.12.x is still supported in Spark 2.4 too

melin · 2019-05-14T01:30:35Z

The following error occurred in spark 2.4 Expected schema message in stream, was null or length 0 ``` df=pd.DataFrame([1,2,3,4,5, 6,8,9], columns=['test']).astype(str) test_data = sparkSession.createDataFrame(df) def mapfunc(row): value = 1 + float(row['test']) return Row(test1=value) aa = test_data.rdd.map(mapfunc).toDF().toPandas() ``` ![image](https://user-images.githubusercontent.com/1145830/57664558-3a949c80-762b-11e9-9a98-efb102129824.png) ![image](https://user-images.githubusercontent.com/1145830/57664569-42ecd780-762b-11e9-87f4-5a616e21be73.png)

BryanCutler · 2019-05-14T16:59:21Z

@melin I tried out your example with pyspark branch-2.4, pyarrow 0.12.1 and pandas 0.24.0 and did not reproduce the error. Your pandas version is a bit old, but I don't think that is the problem. Does the error happen in local mode or only in a cluster?

HyukjinKwon · 2019-05-15T00:47:28Z

@melin, please file a JIRA or ask it to the mailing list separately.

melin · 2019-05-15T02:31:20Z

Pyarrow version 0.12.1, arrow jar version 0.10, can run correctly. Pyarrow version 0.121, arrow jar version 0.12, this exception occurs Bryan Cutler <notifications@github.com> 于2019年5月15日周三上午1:02写道：

…

@melin <https://github.com/melin> I tried out your example with pyspark branch-2.4, pyarrow 0.12.1 and pandas 0.24.0 and did not reproduce the error. Your pandas version is a bit old, but I don't think that is the problem. Does the error happen in local mode or only in a cluster? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23657?email_source=notifications&email_token=AAIXXZTW3YGXH7NDSB3BJLTPVLWAXA5CNFSM4GSPNWO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVMD35Q#issuecomment-492322294>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIXXZVPSIBZAYVTTUNYIR3PVLWAXANCNFSM4GSPNWOQ> .

HyukjinKwon · 2019-05-15T02:44:53Z

In Spark 2.4, the jar should be 0.10. Can we move this topic into the mailing list, @melin?
You're talking about something orthogonal with this change at all.

Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0 Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users: * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 * Java, Reduce heap usage for variable width vectors, ARROW-4147 * Binary identity cast not implemented, ARROW-4101 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 * Error reading IPC file with no record batches, ARROW-3894 * Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790 * from_pandas gives incorrect results when converting floating point to bool, ARROW-3428 * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048 * Java update to official Flatbuffers version 1.9.0, ARROW-3175 complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0) PySpark requires the following fixes to work with PyArrow 0.12.0 * Encrypted pyspark worker fails due to ChunkedStream missing closed property * pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64 * ArrowTests fails due to difference in raised error message * pyarrow.open_stream deprecated * tests fail because groupby adds index column with duplicate name Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0 Closes apache#23657 from BryanCutler/arrow-upgrade-012. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

BryanCutler added 6 commits January 25, 2019 15:48

upgrade Arrow Java in POM

9a3007f

fixes for pyspark tests with arrow 012

d1e2a5e

ChunkedStream used in EncryptionArrowTests needed closed property

210578b

changed pa.open_stream to pa.ipc namespace due to deprecation

ec43c83

fix groupby to not add an index with duplicate name 'id'

ce5410a

updated dependency manifests

a0497ab

BryanCutler mentioned this pull request Jan 26, 2019

ARROW-3367: [INTEGRATION] Port Spark integration test to the docker-compose setup apache/arrow#3300

Closed

srowen reviewed Jan 26, 2019

View reviewed changes

felixcheung approved these changes Jan 27, 2019

View reviewed changes

HyukjinKwon approved these changes Jan 27, 2019

View reviewed changes

HyukjinKwon reviewed Jan 27, 2019

View reviewed changes

python/pyspark/serializers.py Show resolved Hide resolved

Add note about ChunkedStream.closed usage

62e803c

asfgit closed this in 16990f9 Jan 29, 2019

BryanCutler deleted the arrow-upgrade-012 branch January 29, 2019 19:49

[SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 #23657

[SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 #23657

Uh oh!

Conversation

BryanCutler commented Jan 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 26, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

srowen Jan 26, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 27, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 27, 2019

Choose a reason for hiding this comment

Uh oh!

BryanCutler Jan 28, 2019

Choose a reason for hiding this comment

Uh oh!

BryanCutler Jan 28, 2019

Choose a reason for hiding this comment

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gatorsmile commented Jan 28, 2019

Uh oh!

HyukjinKwon commented Jan 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixcheung commented Jan 28, 2019

Uh oh!

BryanCutler commented Jan 28, 2019

Uh oh!

SparkQA commented Jan 29, 2019

Uh oh!

HyukjinKwon commented Jan 29, 2019

Uh oh!

SparkQA commented Jan 29, 2019

Uh oh!

SparkQA commented Jan 29, 2019

Uh oh!

HyukjinKwon commented Jan 29, 2019

Uh oh!

BryanCutler commented Jan 29, 2019

Uh oh!

melin commented May 9, 2019

Uh oh!

HyukjinKwon commented May 9, 2019

Uh oh!

melin commented May 14, 2019 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented May 14, 2019

Uh oh!

HyukjinKwon commented May 15, 2019

Uh oh!

melin commented May 15, 2019 via email

Uh oh!

HyukjinKwon commented May 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

BryanCutler commented Jan 26, 2019 •

edited

Loading

HyukjinKwon commented Jan 28, 2019 •

edited

Loading

melin commented May 14, 2019 via email •

edited

Loading