[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 #26133

BryanCutler · 2019-10-15T18:33:52Z

What changes were proposed in this pull request?

Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also.

Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users:

ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes
ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype
ARROW-5579 - [Java] shade flatbuffer dependency
ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount
ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
ARROW-5893 - [C++] Remove arrow::Column class from C++ library
ARROW-5970 - [Java] Provide pointer to Arrow buffer
ARROW-6070 - [Java] Avoid creating new schema before IPC sending
ARROW-6279 - [Python] Add Table.slice method or allow slices in __getitem__
ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files.
ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table
ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime
ARROW-1261 - [Java] Add container type for Map logical type
ARROW-1207 - [C++] Implement Map logical type

Changelog can be seen at https://arrow.apache.org/release/0.15.0.html

Why are the changes needed?

Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests, manually tested with Python 3.7, 3.8

SparkQA · 2019-10-15T18:40:39Z

Test build #112126 has finished for PR 26133 at commit 25763c1.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-10-15T18:55:12Z

I'm marking this as a WIP because we need to decide if we will increase the minimum required version of PyArrow to 0.15.0 as well. The main issue is that there is a change in the Arrow binary IPC format.

If we do not increase the minimum pyarrow, it will not work by default. There will need to be some configuration set so python and java are using the same IPC format, which can be done 2 ways:

Have both python and java write the legacy IPC format by default. This would not require too many additional changes, but not ideal because at some point the ability to write the legacy format will be dropped from pyarrow and no longer work.
Detect the version of pyarrow being used and configure the Java writers accordingly. This will keep compatibility with all pyarrow <= 0.14.1 and future versions too. The problem is it could be messy to dynamically configure Java writers for pandas_udfs because the data is written before we even import pyarrow to be able to tell the version.

If we increase the minimum required version of PyArrow, then all python and java writers will use the default settings, which is the cleanest change and will be compatible with future versions of PyArrow. Also, I believe SparkR will require Arrow 0.15.0 also so this would keep Python and R inline.

NOTE: this change in the Arrow binary IPC format only affects writing temporary binary data with Python <-> Java and is not present in any long-term storage.

BryanCutler · 2019-10-15T19:03:34Z

My vote is to increase the minimum version of PyArrow since it will be the cleanest change, which is done currently in this PR. I can followup with a message on the dev list if others agree. Before that though, it looks like 0.15.1 is already in the works to fix some bugs, so we should wait a little to see if that includes anything important to Spark.

cc @HyukjinKwon @felixcheung @viirya @shaneknapp

shaneknapp · 2019-10-16T18:20:11Z

My vote is to increase the minimum version of PyArrow since it will be the cleanest change, which is done currently in this PR. I can followup with a message on the dev list if others agree. Before that though, it looks like 0.15.1 is already in the works to fix some bugs, so we should wait a little to see if that includes anything important to Spark.

cc @HyukjinKwon @felixcheung @viirya @shaneknapp

ACK

HyukjinKwon · 2019-10-16T21:08:10Z

Yeah, increasing minimum version is fine to me too. I'll help double check after the Spark summit (and of this week or next week)

viirya · 2019-10-16T22:57:19Z

I also think increasing minimum version is fine. For option 1, we need to address an issue when legacy format is dropped. Option 2 sounds too complicated.

HyukjinKwon · 2019-10-22T02:10:16Z

Sorry for my late response. +1 for increasing minimum version but would you mind if I ask to send an email to dev mailing list, @BryanCutler? I will reply on the thread right away as well.

From what I know, multiple major companies rely on Arrow ones and just wanted to make sure everybody is happy.

HyukjinKwon · 2019-10-22T02:11:59Z

cc @ueshin too

BryanCutler · 2019-10-22T17:16:42Z

Thanks everyone. It looks like 0.15.1 will be cut soon with some bug fixes, and we should probably use that as the minimum. @HyukjinKwon I will send a note to the mailing list as soon as I verify tests pass with it.

SparkQA · 2019-11-05T04:05:26Z

Test build #113229 has finished for PR 26133 at commit 3e267c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-11-05T07:03:52Z

cc @MaxGekk since he is trying to remove joda-time with this PR.

dongjoon-hyun · 2019-11-05T07:05:02Z

HI, @BryanCutler and @HyukjinKwon .
Can we remove [WIP]? 😄

BryanCutler · 2019-11-05T18:10:43Z

Can we remove [WIP]?

The arrow tests are being skipped right now, @shaneknapp when you have time could you do some of your magic and get a worker with pyarrow 0.15.1 to test this PR?

…rmat

BryanCutler · 2019-11-14T21:15:17Z

ping @shaneknapp do you think you can get pyarrow 0.15.1 installed on the python 3.6 worker envs?

shaneknapp · 2019-11-14T23:16:47Z

ping @shaneknapp do you think you can get pyarrow 0.15.1 installed on the python 3.6 worker envs?

sure, i can probably get that done today!

shaneknapp · 2019-11-14T23:35:19Z

done!

BryanCutler · 2019-11-14T23:44:34Z

awesome thanks @shaneknapp !

BryanCutler · 2019-11-14T23:44:54Z

retest this please

SparkQA · 2019-11-14T23:46:53Z

Test build #113809 has finished for PR 26133 at commit 4b8555c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-15T02:17:28Z

Test build #113820 has finished for PR 26133 at commit 4b8555c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-11-15T03:34:52Z

pyarrow with 0.15.1 tests have run for python 3.6, so this is good to merge. I'll be able to do it in a little bit.

dongjoon-hyun · 2019-11-15T04:41:42Z

It's great! Thank you so much!

BryanCutler · 2019-11-15T04:43:51Z

Thanks @shaneknapp @HyukjinKwon @viirya and @dongjoon-hyun !

dongjoon-hyun · 2019-11-15T21:14:16Z

Hi, All.

Unfortunately, Jenkins had been broken due to Python3 transition until today. And, the first recovered run failed with this.

Previous exception in task: 
sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available&#010;
io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)&#010;
io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)&#010;
io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)&#010;
io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)&#010;
org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)&#010;

dongjoon-hyun · 2019-11-15T23:14:42Z

Since it fails twice consecutively, I made a PR to verify the problem and the solution. The usual suspect is -Dio.netty.tryReflectionSetAccessible=true, but it has been for a while. I'm not sure why this fails suddenly at 0.15.1.

[SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+ #26552

dongjoon-hyun · 2019-11-15T23:15:07Z

cc @srowen

srowen · 2019-11-16T01:05:25Z

Hm, looks like https://issues.apache.org/jira/browse/ARROW-5412 and your PR does set the applicable system property. Let's see if that passes. Maybe @BryanCutler knows more. Let me continue on your PR.

dongjoon-hyun · 2019-11-16T07:03:09Z

Another thing is AppVeyor. AppVeyor seems to fail continuously on JDK8 due to OOM. I saw this failure in some irrelevant PRs now.

Warnings -----------------------------------------------------------------------
26281. createDataFrame/collect Arrow optimization (@test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space
2629	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.jav

Since this is JDK8 issue, I created another WIP PR to verify this.

[SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API #26555

This might be Arrow version mismatch between 0.14 and 0.15. AppVeyor still installs Arrow 0.14.

… Arrow on JDK9+ ### What changes were proposed in this pull request? This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow. Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](apache/arrow#5078)). > #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true". > This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty. ### Why are the changes needed? After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After #26133, JDK11 Jenkins job seem to fail. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/ ```scala Previous exception in task: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available
 io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)
 io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
 io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
 io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
 org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)
 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with JDK11. Closes #26552 from dongjoon-hyun/SPARK-ARROW-JDK11. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2019-11-16T17:22:19Z

Hi, @shaneknapp , @felixcheung , @shivaram , @BryanCutler , @HyukjinKwon , @srowen .

It turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking API changes at ARROW-5505, and AppVeyor was the only having SparkR Arrow tests but it's broken now.

Jenkins

Skipped ------------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (@test_sparkSQL_arrow.R#25) - arrow not installed

2. createDataFrame/collect Arrow optimization - many partitions (partition order test) (@test_sparkSQL_arrow.R#48) - arrow not installed

3. createDataFrame/collect Arrow optimization - type specification (@test_sparkSQL_arrow.R#64) - arrow not installed

4. dapply() Arrow optimization (@test_sparkSQL_arrow.R#94) - arrow not installed

5. dapply() Arrow optimization - type specification (@test_sparkSQL_arrow.R#134) - arrow not installed

6. dapply() Arrow optimization - type specification (date and timestamp) (@test_sparkSQL_arrow.R#169) - arrow not installed

7. gapply() Arrow optimization (@test_sparkSQL_arrow.R#188) - arrow not installed

8. gapply() Arrow optimization - type specification (@test_sparkSQL_arrow.R#237) - arrow not installed

9. gapply() Arrow optimization - type specification (date and timestamp) (@test_sparkSQL_arrow.R#276) - arrow not installed

10. Arrow optimization - unsupported types (@test_sparkSQL_arrow.R#297) - arrow not installed

11. sparkJars tag in SparkContext (@test_Windows.R#22) - This test is only for Windows, skipped

For AppVeyor, currently, it failed with OOM due to protocol mismatch between versions.
After I matched the version, we hit ARROW-5505 finally.

In short, SparkR Arrow integration is broken now. We need to update it with new Arrow R interface.

I'll follow-up with a few PRs.

BryanCutler · 2019-11-16T17:41:51Z

That's strange I didn't see any failure in the checks here, was the AppVeyor check not running at the time?

dongjoon-hyun · 2019-11-16T18:36:24Z

Yes. AppVeyor has a rule to trigger. That seems too specific to catch this PR.

### What changes were proposed in this pull request? [[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken. ### Why are the changes needed? First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now. **Jenkins** ``` Skipped ------------------------------------------------------------------------ 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25) - arrow not installed ``` Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14. ``` Warnings ----------------------------------------------------------------------- 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243) ``` It is due to the version mismatch. ```java int messageLength = MessageSerializer.bytesToInt(buffer.array()); if (messageLength == IPC_CONTINUATION_TOKEN) { buffer.clear(); // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length if (in.readFully(buffer) == 4) { messageLength = MessageSerializer.bytesToInt(buffer.array()); } } // Length of 0 indicates end of stream if (messageLength != 0) { // Read the message into the buffer. ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength); ``` After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the AppVeyor. This PR passed here. - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044 ``` SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. ................ ``` Closes #26555 from dongjoon-hyun/SPARK-R-TEST. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

shaneknapp · 2019-11-19T19:00:28Z

as per my comment here (#26555 (comment)), i will add the R Arrow package on the jenkins workers some time this week.

HyukjinKwon · 2019-11-20T01:42:28Z

(I answered to that comment in #26555 (comment) too)

Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also. Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users: * ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes * ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype * ARROW-5579 - [Java] shade flatbuffer dependency * ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount * ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits * ARROW-5893 - [C++] Remove arrow::Column class from C++ library * ARROW-5970 - [Java] Provide pointer to Arrow buffer * ARROW-6070 - [Java] Avoid creating new schema before IPC sending * ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_ * ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files. * ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table * ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime * ARROW-1261 - [Java] Add container type for Map logical type * ARROW-1207 - [C++] Implement Map logical type Changelog can be seen at https://arrow.apache.org/release/0.15.0.html Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow. No Existing tests, manually tested with Python 3.7, 3.8 Closes apache#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

[[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](apache#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken. First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now. **Jenkins** ``` Skipped ------------------------------------------------------------------------ 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25) - arrow not installed ``` Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14. ``` Warnings ----------------------------------------------------------------------- 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243) ``` It is due to the version mismatch. ```java int messageLength = MessageSerializer.bytesToInt(buffer.array()); if (messageLength == IPC_CONTINUATION_TOKEN) { buffer.clear(); // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length if (in.readFully(buffer) == 4) { messageLength = MessageSerializer.bytesToInt(buffer.array()); } } // Length of 0 indicates end of stream if (messageLength != 0) { // Read the message into the buffer. ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength); ``` After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue. No. Pass the AppVeyor. This PR passed here. - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044 ``` SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. ................ ``` Closes apache#26555 from dongjoon-hyun/SPARK-R-TEST. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also. Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users: * ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes * ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype * ARROW-5579 - [Java] shade flatbuffer dependency * ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount * ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits * ARROW-5893 - [C++] Remove arrow::Column class from C++ library * ARROW-5970 - [Java] Provide pointer to Arrow buffer * ARROW-6070 - [Java] Avoid creating new schema before IPC sending * ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_ * ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files. * ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table * ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime * ARROW-1261 - [Java] Add container type for Map logical type * ARROW-1207 - [C++] Implement Map logical type Changelog can be seen at https://arrow.apache.org/release/0.15.0.html Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow. No Existing tests, manually tested with Python 3.7, 3.8 Closes apache#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

[[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](apache#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken. First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now. **Jenkins** ``` Skipped ------------------------------------------------------------------------ 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25) - arrow not installed ``` Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14. ``` Warnings ----------------------------------------------------------------------- 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243) ``` It is due to the version mismatch. ```java int messageLength = MessageSerializer.bytesToInt(buffer.array()); if (messageLength == IPC_CONTINUATION_TOKEN) { buffer.clear(); // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length if (in.readFully(buffer) == 4) { messageLength = MessageSerializer.bytesToInt(buffer.array()); } } // Length of 0 indicates end of stream if (messageLength != 0) { // Read the message into the buffer. ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength); ``` After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue. No. Pass the AppVeyor. This PR passed here. - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044 ``` SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. ................ ``` Closes apache#26555 from dongjoon-hyun/SPARK-R-TEST. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also. Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users: * ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes * ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype * ARROW-5579 - [Java] shade flatbuffer dependency * ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount * ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits * ARROW-5893 - [C++] Remove arrow::Column class from C++ library * ARROW-5970 - [Java] Provide pointer to Arrow buffer * ARROW-6070 - [Java] Avoid creating new schema before IPC sending * ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_ * ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files. * ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table * ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime * ARROW-1261 - [Java] Add container type for Map logical type * ARROW-1207 - [C++] Implement Map logical type Changelog can be seen at https://arrow.apache.org/release/0.15.0.html Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow. No Existing tests, manually tested with Python 3.7, 3.8 Closes apache#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

[[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](apache#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken. First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now. **Jenkins** ``` Skipped ------------------------------------------------------------------------ 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25) - arrow not installed ``` Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14. ``` Warnings ----------------------------------------------------------------------- 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243) ``` It is due to the version mismatch. ```java int messageLength = MessageSerializer.bytesToInt(buffer.array()); if (messageLength == IPC_CONTINUATION_TOKEN) { buffer.clear(); // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length if (in.readFully(buffer) == 4) { messageLength = MessageSerializer.bytesToInt(buffer.array()); } } // Length of 0 indicates end of stream if (messageLength != 0) { // Read the message into the buffer. ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength); ``` After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue. No. Pass the AppVeyor. This PR passed here. - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044 ``` SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. ................ ``` Closes apache#26555 from dongjoon-hyun/SPARK-R-TEST. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun added PYSPARK SQL labels Oct 16, 2019

BryanCutler changed the title ~~[WIP][SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.0~~ [WIP][SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 Nov 5, 2019

BryanCutler added 3 commits November 14, 2019 13:11

Upgrade Arrow Java, tests passing using pyarrow 0.15.0 and new IPC fo…

9a1affe

…rmat

Increase minimum pyarrow version

41f36b1

Update to use Arrow 0.15.1

4b8555c

BryanCutler force-pushed the arrow-upgrade-015-SPARK-29376 branch from 3e267c3 to 4b8555c Compare November 14, 2019 21:11

BryanCutler changed the title ~~[WIP][SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1~~ [SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 Nov 14, 2019

dongjoon-hyun mentioned this pull request Nov 15, 2019

[SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+ #26552

Closed

This was referenced Nov 16, 2019

[MINOR][TESTS] Ignore GitHub Action and AppVeyor file changes in testing #26556

Closed

[SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API #26555

Closed

rshkv mentioned this pull request May 23, 2020

Upgrade Arrow to 0.15 palantir/spark#683

Merged

nealrichardson mentioned this pull request May 26, 2020

Only apply arrow format env var for spark < 3 sparklyr/sparklyr#2538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 #26133

[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 #26133

BryanCutler commented Oct 15, 2019 •

edited

Loading

SparkQA commented Oct 15, 2019

BryanCutler commented Oct 15, 2019

BryanCutler commented Oct 15, 2019

shaneknapp commented Oct 16, 2019

HyukjinKwon commented Oct 16, 2019

viirya commented Oct 16, 2019

HyukjinKwon commented Oct 22, 2019

HyukjinKwon commented Oct 22, 2019

BryanCutler commented Oct 22, 2019

SparkQA commented Nov 5, 2019

dongjoon-hyun commented Nov 5, 2019

dongjoon-hyun commented Nov 5, 2019

BryanCutler commented Nov 5, 2019

BryanCutler commented Nov 14, 2019

shaneknapp commented Nov 14, 2019

shaneknapp commented Nov 14, 2019

BryanCutler commented Nov 14, 2019

BryanCutler commented Nov 14, 2019

SparkQA commented Nov 14, 2019

SparkQA commented Nov 15, 2019

BryanCutler commented Nov 15, 2019

dongjoon-hyun commented Nov 15, 2019

BryanCutler commented Nov 15, 2019

dongjoon-hyun commented Nov 15, 2019 •

edited

Loading

dongjoon-hyun commented Nov 15, 2019

dongjoon-hyun commented Nov 15, 2019

srowen commented Nov 16, 2019

dongjoon-hyun commented Nov 16, 2019 •

edited

Loading

dongjoon-hyun commented Nov 16, 2019 •

edited

Loading

BryanCutler commented Nov 16, 2019

dongjoon-hyun commented Nov 16, 2019 •

edited

Loading

shaneknapp commented Nov 19, 2019

HyukjinKwon commented Nov 20, 2019

[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 #26133

[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 #26133

Conversation

BryanCutler commented Oct 15, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Oct 15, 2019

BryanCutler commented Oct 15, 2019

BryanCutler commented Oct 15, 2019

shaneknapp commented Oct 16, 2019

HyukjinKwon commented Oct 16, 2019

viirya commented Oct 16, 2019

HyukjinKwon commented Oct 22, 2019

HyukjinKwon commented Oct 22, 2019

BryanCutler commented Oct 22, 2019

SparkQA commented Nov 5, 2019

dongjoon-hyun commented Nov 5, 2019

dongjoon-hyun commented Nov 5, 2019

BryanCutler commented Nov 5, 2019

BryanCutler commented Nov 14, 2019

shaneknapp commented Nov 14, 2019

shaneknapp commented Nov 14, 2019

BryanCutler commented Nov 14, 2019

BryanCutler commented Nov 14, 2019

SparkQA commented Nov 14, 2019

SparkQA commented Nov 15, 2019

BryanCutler commented Nov 15, 2019

dongjoon-hyun commented Nov 15, 2019

BryanCutler commented Nov 15, 2019

dongjoon-hyun commented Nov 15, 2019 • edited Loading

dongjoon-hyun commented Nov 15, 2019

dongjoon-hyun commented Nov 15, 2019

srowen commented Nov 16, 2019

dongjoon-hyun commented Nov 16, 2019 • edited Loading

dongjoon-hyun commented Nov 16, 2019 • edited Loading

BryanCutler commented Nov 16, 2019

dongjoon-hyun commented Nov 16, 2019 • edited Loading

shaneknapp commented Nov 19, 2019

HyukjinKwon commented Nov 20, 2019

BryanCutler commented Oct 15, 2019 •

edited

Loading

dongjoon-hyun commented Nov 15, 2019 •

edited

Loading

dongjoon-hyun commented Nov 16, 2019 •

edited

Loading

dongjoon-hyun commented Nov 16, 2019 •

edited

Loading

dongjoon-hyun commented Nov 16, 2019 •

edited

Loading