Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 #26133

Closed

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Oct 15, 2019

What changes were proposed in this pull request?

Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also.

Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users:

  • ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes
  • ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype
  • ARROW-5579 - [Java] shade flatbuffer dependency
  • ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount
  • ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
  • ARROW-5893 - [C++] Remove arrow::Column class from C++ library
  • ARROW-5970 - [Java] Provide pointer to Arrow buffer
  • ARROW-6070 - [Java] Avoid creating new schema before IPC sending
  • ARROW-6279 - [Python] Add Table.slice method or allow slices in __getitem__
  • ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files.
  • ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table
  • ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime
  • ARROW-1261 - [Java] Add container type for Map logical type
  • ARROW-1207 - [C++] Implement Map logical type

Changelog can be seen at https://arrow.apache.org/release/0.15.0.html

Why are the changes needed?

Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests, manually tested with Python 3.7, 3.8

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112126 has finished for PR 26133 at commit 25763c1.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

I'm marking this as a WIP because we need to decide if we will increase the minimum required version of PyArrow to 0.15.0 as well. The main issue is that there is a change in the Arrow binary IPC format.

If we do not increase the minimum pyarrow, it will not work by default. There will need to be some configuration set so python and java are using the same IPC format, which can be done 2 ways:

  1. Have both python and java write the legacy IPC format by default. This would not require too many additional changes, but not ideal because at some point the ability to write the legacy format will be dropped from pyarrow and no longer work.

  2. Detect the version of pyarrow being used and configure the Java writers accordingly. This will keep compatibility with all pyarrow <= 0.14.1 and future versions too. The problem is it could be messy to dynamically configure Java writers for pandas_udfs because the data is written before we even import pyarrow to be able to tell the version.

If we increase the minimum required version of PyArrow, then all python and java writers will use the default settings, which is the cleanest change and will be compatible with future versions of PyArrow. Also, I believe SparkR will require Arrow 0.15.0 also so this would keep Python and R inline.

NOTE: this change in the Arrow binary IPC format only affects writing temporary binary data with Python <-> Java and is not present in any long-term storage.

@BryanCutler
Copy link
Member Author

My vote is to increase the minimum version of PyArrow since it will be the cleanest change, which is done currently in this PR. I can followup with a message on the dev list if others agree. Before that though, it looks like 0.15.1 is already in the works to fix some bugs, so we should wait a little to see if that includes anything important to Spark.

cc @HyukjinKwon @felixcheung @viirya @shaneknapp

@shaneknapp
Copy link
Contributor

My vote is to increase the minimum version of PyArrow since it will be the cleanest change, which is done currently in this PR. I can followup with a message on the dev list if others agree. Before that though, it looks like 0.15.1 is already in the works to fix some bugs, so we should wait a little to see if that includes anything important to Spark.

cc @HyukjinKwon @felixcheung @viirya @shaneknapp

ACK

@HyukjinKwon
Copy link
Member

Yeah, increasing minimum version is fine to me too. I'll help double check after the Spark summit (and of this week or next week)

@viirya
Copy link
Member

viirya commented Oct 16, 2019

I also think increasing minimum version is fine. For option 1, we need to address an issue when legacy format is dropped. Option 2 sounds too complicated.

@HyukjinKwon
Copy link
Member

Sorry for my late response. +1 for increasing minimum version but would you mind if I ask to send an email to dev mailing list, @BryanCutler? I will reply on the thread right away as well.

From what I know, multiple major companies rely on Arrow ones and just wanted to make sure everybody is happy.

@HyukjinKwon
Copy link
Member

cc @ueshin too

@BryanCutler
Copy link
Member Author

Thanks everyone. It looks like 0.15.1 will be cut soon with some bug fixes, and we should probably use that as the minimum. @HyukjinKwon I will send a note to the mailing list as soon as I verify tests pass with it.

@SparkQA
Copy link

SparkQA commented Nov 5, 2019

Test build #113229 has finished for PR 26133 at commit 3e267c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @MaxGekk since he is trying to remove joda-time with this PR.

@dongjoon-hyun
Copy link
Member

HI, @BryanCutler and @HyukjinKwon .
Can we remove [WIP]? 😄

@BryanCutler
Copy link
Member Author

Can we remove [WIP]?

The arrow tests are being skipped right now, @shaneknapp when you have time could you do some of your magic and get a worker with pyarrow 0.15.1 to test this PR?

@BryanCutler BryanCutler changed the title [WIP][SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.0 [WIP][SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 Nov 5, 2019
@BryanCutler
Copy link
Member Author

ping @shaneknapp do you think you can get pyarrow 0.15.1 installed on the python 3.6 worker envs?

@shaneknapp
Copy link
Contributor

ping @shaneknapp do you think you can get pyarrow 0.15.1 installed on the python 3.6 worker envs?

sure, i can probably get that done today!

@shaneknapp
Copy link
Contributor

done!

@BryanCutler
Copy link
Member Author

awesome thanks @shaneknapp !

@BryanCutler
Copy link
Member Author

retest this please

@BryanCutler BryanCutler changed the title [WIP][SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 [SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 Nov 14, 2019
@SparkQA
Copy link

SparkQA commented Nov 14, 2019

Test build #113809 has finished for PR 26133 at commit 4b8555c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113820 has finished for PR 26133 at commit 4b8555c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

pyarrow with 0.15.1 tests have run for python 3.6, so this is good to merge. I'll be able to do it in a little bit.

@dongjoon-hyun
Copy link
Member

It's great! Thank you so much!

@BryanCutler
Copy link
Member Author

Thanks @shaneknapp @HyukjinKwon @viirya and @dongjoon-hyun !

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 15, 2019

Hi, All.

Unfortunately, Jenkins had been broken due to Python3 transition until today. And, the first recovered run failed with this.

Previous exception in task: 
sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available&#010;
io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)&#010;
io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)&#010;
io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)&#010;
io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)&#010;
org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)&#010; 

@dongjoon-hyun
Copy link
Member

Since it fails twice consecutively, I made a PR to verify the problem and the solution. The usual suspect is -Dio.netty.tryReflectionSetAccessible=true, but it has been for a while. I'm not sure why this fails suddenly at 0.15.1.

@dongjoon-hyun
Copy link
Member

cc @srowen

@srowen
Copy link
Member

srowen commented Nov 16, 2019

Hm, looks like https://issues.apache.org/jira/browse/ARROW-5412 and your PR does set the applicable system property. Let's see if that passes. Maybe @BryanCutler knows more. Let me continue on your PR.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 16, 2019

Another thing is AppVeyor. AppVeyor seems to fail continuously on JDK8 due to OOM. I saw this failure in some irrelevant PRs now.

Warnings -----------------------------------------------------------------------
26281. createDataFrame/collect Arrow optimization (@test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space
2629	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.jav

Since this is JDK8 issue, I created another WIP PR to verify this.

This might be Arrow version mismatch between 0.14 and 0.15. AppVeyor still installs Arrow 0.14.

dongjoon-hyun added a commit that referenced this pull request Nov 16, 2019
… Arrow on JDK9+

### What changes were proposed in this pull request?

This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow.

Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](apache/arrow#5078)).
> #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true".
> This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty.

### Why are the changes needed?

After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After #26133, JDK11 Jenkins job seem to fail.

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/

```scala
Previous exception in task:
sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available&#10;
io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)&#10;
io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)&#10;
io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)&#10;
io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)&#10;
org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)&#10;
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with JDK11.

Closes #26552 from dongjoon-hyun/SPARK-ARROW-JDK11.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 16, 2019

Hi, @shaneknapp , @felixcheung , @shivaram , @BryanCutler , @HyukjinKwon , @srowen .

It turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking API changes at ARROW-5505, and AppVeyor was the only having SparkR Arrow tests but it's broken now.

Jenkins

Skipped ------------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (@test_sparkSQL_arrow.R#25) - arrow not installed

2. createDataFrame/collect Arrow optimization - many partitions (partition order test) (@test_sparkSQL_arrow.R#48) - arrow not installed

3. createDataFrame/collect Arrow optimization - type specification (@test_sparkSQL_arrow.R#64) - arrow not installed

4. dapply() Arrow optimization (@test_sparkSQL_arrow.R#94) - arrow not installed

5. dapply() Arrow optimization - type specification (@test_sparkSQL_arrow.R#134) - arrow not installed

6. dapply() Arrow optimization - type specification (date and timestamp) (@test_sparkSQL_arrow.R#169) - arrow not installed

7. gapply() Arrow optimization (@test_sparkSQL_arrow.R#188) - arrow not installed

8. gapply() Arrow optimization - type specification (@test_sparkSQL_arrow.R#237) - arrow not installed

9. gapply() Arrow optimization - type specification (date and timestamp) (@test_sparkSQL_arrow.R#276) - arrow not installed

10. Arrow optimization - unsupported types (@test_sparkSQL_arrow.R#297) - arrow not installed

11. sparkJars tag in SparkContext (@test_Windows.R#22) - This test is only for Windows, skipped

For AppVeyor, currently, it failed with OOM due to protocol mismatch between versions.
After I matched the version, we hit ARROW-5505 finally.

In short, SparkR Arrow integration is broken now. We need to update it with new Arrow R interface.

I'll follow-up with a few PRs.

@BryanCutler
Copy link
Member Author

That's strange I didn't see any failure in the checks here, was the AppVeyor check not running at the time?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 16, 2019

Yes. AppVeyor has a rule to trigger. That seems too specific to catch this PR.

dongjoon-hyun added a commit that referenced this pull request Nov 17, 2019
### What changes were proposed in this pull request?

[[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken.

### Why are the changes needed?

First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now.

**Jenkins**
```
Skipped ------------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25)
- arrow not installed
```

Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14.
```
Warnings -----------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
```

It is due to the version mismatch.
```java
int messageLength = MessageSerializer.bytesToInt(buffer.array());
if (messageLength == IPC_CONTINUATION_TOKEN) {
  buffer.clear();
  // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length
  if (in.readFully(buffer) == 4) {
    messageLength = MessageSerializer.bytesToInt(buffer.array());
  }
}

// Length of 0 indicates end of stream
if (messageLength != 0) {
  // Read the message into the buffer.
  ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength);
```
 After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the AppVeyor.

This PR passed here.
- https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044

```
SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\..
................
```

Closes #26555 from dongjoon-hyun/SPARK-R-TEST.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@shaneknapp
Copy link
Contributor

as per my comment here (#26555 (comment)), i will add the R Arrow package on the jenkins workers some time this week.

@HyukjinKwon
Copy link
Member

(I answered to that comment in #26555 (comment) too)

rshkv pushed a commit to palantir/spark that referenced this pull request May 23, 2020
Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also.

Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users:

* ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes
* ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype
* ARROW-5579 - [Java] shade flatbuffer dependency
* ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount
* ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
* ARROW-5893 - [C++] Remove arrow::Column class from C++ library
* ARROW-5970 - [Java] Provide pointer to Arrow buffer
* ARROW-6070 - [Java] Avoid creating new schema before IPC sending
* ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_
* ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files.
* ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table
* ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime
* ARROW-1261 - [Java] Add container type for Map logical type
* ARROW-1207 - [C++] Implement Map logical type

Changelog can be seen at https://arrow.apache.org/release/0.15.0.html

Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow.

No

Existing tests, manually tested with Python 3.7, 3.8

Closes apache#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Jun 24, 2020
Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also.

Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users:

* ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes
* ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype
* ARROW-5579 - [Java] shade flatbuffer dependency
* ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount
* ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
* ARROW-5893 - [C++] Remove arrow::Column class from C++ library
* ARROW-5970 - [Java] Provide pointer to Arrow buffer
* ARROW-6070 - [Java] Avoid creating new schema before IPC sending
* ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_
* ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files.
* ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table
* ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime
* ARROW-1261 - [Java] Add container type for Map logical type
* ARROW-1207 - [C++] Implement Map logical type

Changelog can be seen at https://arrow.apache.org/release/0.15.0.html

Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow.

No

Existing tests, manually tested with Python 3.7, 3.8

Closes apache#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Jun 24, 2020
[[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](apache#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken.

First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now.

**Jenkins**
```
Skipped ------------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25)
- arrow not installed
```

Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14.
```
Warnings -----------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
```

It is due to the version mismatch.
```java
int messageLength = MessageSerializer.bytesToInt(buffer.array());
if (messageLength == IPC_CONTINUATION_TOKEN) {
  buffer.clear();
  // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length
  if (in.readFully(buffer) == 4) {
    messageLength = MessageSerializer.bytesToInt(buffer.array());
  }
}

// Length of 0 indicates end of stream
if (messageLength != 0) {
  // Read the message into the buffer.
  ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength);
```
 After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue.

No.

Pass the AppVeyor.

This PR passed here.
- https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044

```
SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\..
................
```

Closes apache#26555 from dongjoon-hyun/SPARK-R-TEST.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
rshkv pushed a commit to palantir/spark that referenced this pull request Jul 15, 2020
Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also.

Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users:

* ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes
* ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype
* ARROW-5579 - [Java] shade flatbuffer dependency
* ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount
* ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
* ARROW-5893 - [C++] Remove arrow::Column class from C++ library
* ARROW-5970 - [Java] Provide pointer to Arrow buffer
* ARROW-6070 - [Java] Avoid creating new schema before IPC sending
* ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_
* ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files.
* ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table
* ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime
* ARROW-1261 - [Java] Add container type for Map logical type
* ARROW-1207 - [C++] Implement Map logical type

Changelog can be seen at https://arrow.apache.org/release/0.15.0.html

Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow.

No

Existing tests, manually tested with Python 3.7, 3.8

Closes apache#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Jul 15, 2020
[[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](apache#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken.

First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now.

**Jenkins**
```
Skipped ------------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25)
- arrow not installed
```

Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14.
```
Warnings -----------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
```

It is due to the version mismatch.
```java
int messageLength = MessageSerializer.bytesToInt(buffer.array());
if (messageLength == IPC_CONTINUATION_TOKEN) {
  buffer.clear();
  // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length
  if (in.readFully(buffer) == 4) {
    messageLength = MessageSerializer.bytesToInt(buffer.array());
  }
}

// Length of 0 indicates end of stream
if (messageLength != 0) {
  // Read the message into the buffer.
  ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength);
```
 After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue.

No.

Pass the AppVeyor.

This PR passed here.
- https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044

```
SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\..
................
```

Closes apache#26555 from dongjoon-hyun/SPARK-R-TEST.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
rshkv pushed a commit to palantir/spark that referenced this pull request Jul 15, 2020
Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also.

Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users:

* ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes
* ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype
* ARROW-5579 - [Java] shade flatbuffer dependency
* ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount
* ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
* ARROW-5893 - [C++] Remove arrow::Column class from C++ library
* ARROW-5970 - [Java] Provide pointer to Arrow buffer
* ARROW-6070 - [Java] Avoid creating new schema before IPC sending
* ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_
* ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files.
* ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table
* ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime
* ARROW-1261 - [Java] Add container type for Map logical type
* ARROW-1207 - [C++] Implement Map logical type

Changelog can be seen at https://arrow.apache.org/release/0.15.0.html

Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow.

No

Existing tests, manually tested with Python 3.7, 3.8

Closes apache#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Jul 15, 2020
[[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](apache#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken.

First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now.

**Jenkins**
```
Skipped ------------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25)
- arrow not installed
```

Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14.
```
Warnings -----------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
```

It is due to the version mismatch.
```java
int messageLength = MessageSerializer.bytesToInt(buffer.array());
if (messageLength == IPC_CONTINUATION_TOKEN) {
  buffer.clear();
  // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length
  if (in.readFully(buffer) == 4) {
    messageLength = MessageSerializer.bytesToInt(buffer.array());
  }
}

// Length of 0 indicates end of stream
if (messageLength != 0) {
  // Read the message into the buffer.
  ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength);
```
 After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue.

No.

Pass the AppVeyor.

This PR passed here.
- https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044

```
SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\..
................
```

Closes apache#26555 from dongjoon-hyun/SPARK-R-TEST.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants