Skip to content

Conversation

@BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Jan 26, 2019

What changes were proposed in this pull request?

Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0

Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:

  • Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
  • Java, Reduce heap usage for variable width vectors, ARROW-4147
  • Binary identity cast not implemented, ARROW-4101
  • pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
  • conversion to date object no longer needed, ARROW-3910
  • Error reading IPC file with no record batches, ARROW-3894
  • Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790
  • from_pandas gives incorrect results when converting floating point to bool, ARROW-3428
  • Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048
  • Java update to official Flatbuffers version 1.9.0, ARROW-3175

complete list here

PySpark requires the following fixes to work with PyArrow 0.12.0

  • Encrypted pyspark worker fails due to ChunkedStream missing closed property
  • pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
  • ArrowTests fails due to difference in raised error message
  • pyarrow.open_stream deprecated
  • tests fail because groupby adds index column with duplicate name

How was this patch tested?

Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0

@SparkQA
Copy link

SparkQA commented Jan 26, 2019

Test build #101698 has finished for PR 23657 at commit a0497ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea to keep Arrow up to date for Spark 3

import pyarrow
from distutils.version import LooseVersion
# As of Arrow 0.12.0, date_as_objects is True by default, see ARROW-3910
if LooseVersion(pyarrow.__version__) < LooseVersion("0.12.0") and type(data_type) == DateType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but does this get called a lot and would it be better to check if Arrow is < 0.12 once and save that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it will be called a lot (but not per record at least but per batch), and I also think it should be called once ideally.

I roughly guess that this has been done in this way so far because I guess we're not sure about the versions in worker side and driver side .. For instance, both versions in both codes can be different as far as I know because we don't have a check for it (correct me if I am mistaken).

Probably we should add a check like we do for Python version check between driver and worker, and have few global checks. Of course, we could do it separately I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I was thinking about targeting to upgrade minimum PyArrow version in Spark 3.0.0 since the codes are being complicated for those if-else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are called per-batch and wouldn't add overhead that would be noticeable. I think these check will be temporary and could be removed once we change the minimum version and as Arrow gets more mature. For now, it's probably best to just make sure these kind of checks are easy to track.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should add a check like we do for Python version check between driver and worker, and have few global checks. Of course, we could do it separately I guess.

Yeah, we could do this but it might not really be too big of deal. I think eventually it will be sort of like Pandas versions, if they are close there will probably be no issues. But with major versions might not be completely compatible.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

@gatorsmile
Copy link
Member

Will our Jenkin run Arrow 0.12?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 28, 2019

For PyArrow 0.12.0, I don't think so but it will run with Arrow 0.12.0 + PyArrow 0.8.0 combination. I needs to manual upgrade. Currently, IIRC, the version is 0.8.0 to test minimum PyArrow version.

I was thinking about bumping up the minimal PyArrow version at Spark 3.0 so that we can reduce such if-else branches, and reduce such overhead.

@felixcheung
Copy link
Member

probably a good idea - arrow moves quickly; 0.10 is kinda "dated"

@BryanCutler
Copy link
Member Author

I was thinking about bumping up the minimal PyArrow version at Spark 3.0 so that we can reduce such if-else branches, and reduce such overhead.

Sounds good to me, we might want to double-check on what dependencies 0.12.0 requires though

@SparkQA
Copy link

SparkQA commented Jan 29, 2019

Test build #101773 has finished for PR 23657 at commit 62e803c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 29, 2019

Test build #101783 has finished for PR 23657 at commit 62e803c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 29, 2019

Test build #101784 has finished for PR 23657 at commit 62e803c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

I need to test SPARK-26759 against arrow upgrade as well.

@asfgit asfgit closed this in 16990f9 Jan 29, 2019
@BryanCutler
Copy link
Member Author

Thanks @HyukjinKwon @felixcheung and @srowen !

@BryanCutler BryanCutler deleted the arrow-upgrade-012 branch January 29, 2019 19:49
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0

Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:

* Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
* Java, Reduce heap usage for variable width vectors, ARROW-4147
* Binary identity cast not implemented, ARROW-4101
* pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
* conversion to date object no longer needed, ARROW-3910
* Error reading IPC file with no record batches, ARROW-3894
* Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790
* from_pandas gives incorrect results when converting floating point to bool, ARROW-3428
* Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048
* Java update to official Flatbuffers version 1.9.0, ARROW-3175

complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0)

PySpark requires the following fixes to work with PyArrow 0.12.0

* Encrypted pyspark worker fails due to ChunkedStream missing closed property
* pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
* ArrowTests fails due to difference in raised error message
* pyarrow.open_stream deprecated
* tests fail because groupby adds index column with duplicate name

## How was this patch tested?

Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0

Closes apache#23657 from BryanCutler/arrow-upgrade-012.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@melin
Copy link

melin commented May 9, 2019

Currently only spark 3.0 is supported, and can support spark 2.4?

@HyukjinKwon
Copy link
Member

nope this change only goes into Spark 3.0 but I think Arrow 0.12.x is still supported in Spark 2.4 too

@melin
Copy link

melin commented May 14, 2019 via email

@BryanCutler
Copy link
Member Author

@melin I tried out your example with pyspark branch-2.4, pyarrow 0.12.1 and pandas 0.24.0 and did not reproduce the error. Your pandas version is a bit old, but I don't think that is the problem. Does the error happen in local mode or only in a cluster?

@HyukjinKwon
Copy link
Member

@melin, please file a JIRA or ask it to the mailing list separately.

@melin
Copy link

melin commented May 15, 2019 via email

@HyukjinKwon
Copy link
Member

In Spark 2.4, the jar should be 0.10. Can we move this topic into the mailing list, @melin?
You're talking about something orthogonal with this change at all.

martinxsliu pushed a commit to opendoor-labs/spark that referenced this pull request Oct 27, 2021
Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0

Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:

* Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
* Java, Reduce heap usage for variable width vectors, ARROW-4147
* Binary identity cast not implemented, ARROW-4101
* pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
* conversion to date object no longer needed, ARROW-3910
* Error reading IPC file with no record batches, ARROW-3894
* Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790
* from_pandas gives incorrect results when converting floating point to bool, ARROW-3428
* Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048
* Java update to official Flatbuffers version 1.9.0, ARROW-3175

complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0)

PySpark requires the following fixes to work with PyArrow 0.12.0

* Encrypted pyspark worker fails due to ChunkedStream missing closed property
* pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
* ArrowTests fails due to difference in raised error message
* pyarrow.open_stream deprecated
* tests fail because groupby adds index column with duplicate name

Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0

Closes apache#23657 from BryanCutler/arrow-upgrade-012.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
martinxsliu pushed a commit to opendoor-labs/spark that referenced this pull request Nov 26, 2021
Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0

Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:

* Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
* Java, Reduce heap usage for variable width vectors, ARROW-4147
* Binary identity cast not implemented, ARROW-4101
* pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
* conversion to date object no longer needed, ARROW-3910
* Error reading IPC file with no record batches, ARROW-3894
* Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790
* from_pandas gives incorrect results when converting floating point to bool, ARROW-3428
* Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048
* Java update to official Flatbuffers version 1.9.0, ARROW-3175

complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0)

PySpark requires the following fixes to work with PyArrow 0.12.0

* Encrypted pyspark worker fails due to ChunkedStream missing closed property
* pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
* ArrowTests fails due to difference in raised error message
* pyarrow.open_stream deprecated
* tests fail because groupby adds index column with duplicate name

Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0

Closes apache#23657 from BryanCutler/arrow-upgrade-012.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants