[SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0 #19884

BryanCutler · 2017-12-04T23:36:12Z

What changes were proposed in this pull request?

Upgrade Spark to Arrow 0.8.0 for Java and Python. Also includes an upgrade of Netty to 4.1.17 to resolve dependency requirements.

The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include:

Java refactoring for more simple API
Java reduced heap usage and streamlined hot code paths
Type support for DecimalType, ArrayType
Improved type casting support in Python
Simplified type checking in Python

How was this patch tested?

Existing tests

SparkQA · 2017-12-04T23:39:14Z

Test build #84447 has finished for PR 19884 at commit 4b0790b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-12-04T23:45:20Z

The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include:

Java refactoring for more simple API
Java reduced heap usage and streamlined hot code paths
Type support for DecimalType, ArrayType
Improved type casting support in Python
Simplified type checking in Python

dongjoon-hyun · 2017-12-04T23:48:55Z

Great, @BryanCutler . Could you put the highlight in the PR description, too?

BryanCutler · 2017-12-04T23:49:22Z

This is a WIP to start updating Spark to use Arrow 0.8.0 which will be released soon.

TODO:

Update to reflect Java API changes - Scala Arrow Tests Passing
Update to reflect Python API changes - Scala Python Tests Passing
Use new Python type checking
Remove Python type casting workarounds
Incorporate Netty Upgrade
Add message if user has older version of pyarrow

dongjoon-hyun · 2017-12-04T23:49:29Z

pom.xml

@@ -185,7 +185,7 @@
    <paranamer.version>2.8</paranamer.version>
    <maven-antrun.version>1.8</maven-antrun.version>
    <commons-crypto.version>1.0.0</commons-crypto.version>
-    <arrow.version>0.4.0</arrow.version>
+    <arrow.version>0.8.0-SNAPSHOT</arrow.version>


Is there any ETA for the official 0.8.0?

We are still wrapping a few things up, should be later this week or early next week.

Can we download the snapshot from somewhere for our local tests?

Should be able to cut an RC beginning of next week. I would suggest mvn-installing from Arrow master for the time being

BryanCutler · 2017-12-04T23:51:02Z

Great, @BryanCutler . Could you put the highlight in the PR description, too?

Sure, thanks @dongjoon-hyun ! Will do, just want to go back and check the release notes first

HyukjinKwon · 2017-12-05T00:28:01Z

cc @zsxwing as well, I saw you opened a JIRA about this - SPARK-22656

BryanCutler · 2017-12-05T01:00:23Z

@zsxwing, fyi after applying your Netty upgrade patch to Arrow, and then your other patch for Spark, all of the Spark Scala/Java tests pass

BryanCutler · 2017-12-07T20:03:45Z

python/pyspark/sql/types.py

        spark_type = StringType()
    elif at == pa.date32():
        spark_type = DateType()
-    elif type(at) == pa.TimestampType:
+    elif pa.types.is_timestamp(at):


@icexelloss @wesm is this the recommended way to check type id for the latest pyarrow? For types with a single bit width, I am using the is_* functions, like is_timestamp, but for others I still need to check object equality such as t == pa.date32() because there is no is_date32() only is_date()

Yep, this is right. I'm opening a JIRA to add more functions for testing exact types

https://issues.apache.org/jira/browse/ARROW-1905

Sounds good, thanks for confirming!

SparkQA · 2017-12-07T20:04:19Z

Test build #84616 has finished for PR 19884 at commit 93b1eb3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-08T23:04:17Z

Test build #84663 has finished for PR 19884 at commit fdba406.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

When I tried to run tests locally, I got OutOfMemoryException as follows:

[info]   org.apache.arrow.memory.OutOfMemoryException:
[info]   at org.apache.arrow.vector.complex.AbstractContainerVector.allocateNew(AbstractContainerVector.java:52)
[info]   at org.apache.spark.sql.execution.arrow.ArrowWriter$$anonfun$1.apply(ArrowWriter.scala:40)

We shouldn't explicitly use allocateNew() or something?

I'll wait for the next update. Thanks!

ueshin · 2017-12-11T04:44:24Z

python/pyspark/serializers.py

-        mask = None if casted.dtype == 'object' else s.isnull()
-        return pa.Array.from_pandas(casted, mask=mask, type=t)
+        mask = s.isnull()
+        # Workaround for casting timestamp units with timezone, ARROW-1906


Will the fix for this workaround be included in Arrow 0.8?

Yes, just fixed in ARROW-1906 apache/arrow#1411

ueshin · 2017-12-11T06:07:35Z

pom.xml

@@ -185,7 +185,7 @@
    <paranamer.version>2.8</paranamer.version>
    <maven-antrun.version>1.8</maven-antrun.version>
    <commons-crypto.version>1.0.0</commons-crypto.version>
-    <arrow.version>0.4.0</arrow.version>
+    <arrow.version>0.8.0-SNAPSHOT</arrow.version>


Please don't forget that we also need to update dev/deps/spark-deps-hadoop-2.x files.

zsxwing · 2017-12-11T20:50:04Z

I saw #18974 tried to upgrade Arrow but got closed due to some Jenkins issue. @ueshin do you have any idea what may block this PR? Jenkins cannot support to install multiple versions of PyArrow?

BryanCutler · 2017-12-11T21:58:10Z

When I tried to run tests locally, I got OutOfMemoryException

@ueshin , you got that error because the latest Arrow has upgraded Netty to 4.1.17 but Spark has an older version on the classpath. If you apply #19829 on top of this PR, the tests should pass.

BryanCutler · 2017-12-11T21:58:38Z

Jenkins cannot support to install multiple versions of PyArrow?

@zsxwing that's right, we will have to coordinate to make sure the Jenkins pyarrow is upgraded to version 0.8 as well. I'm not sure the best way to coordinate all of this because this PR, jenkins upgrade, and Spark Netty upgrade all need to happen at the same time.

@holdenk @shaneknapp will one of you be able to work on the pyarrow upgrade for Jenkins sometime around next week? (assuming Arrow 0.8 is released in the next day or so)

zsxwing · 2017-12-11T22:00:49Z

@BryanCutler could you just pull my changes into this PR since we need both changes to pass Jenkins? Thanks!

shaneknapp · 2017-12-11T22:06:35Z

yeah, i can do the upgrade next week. i'll be working remotely from the east coast, but unavailable at all on monday due to travel.

…

On Mon, Dec 11, 2017 at 1:59 PM, Bryan Cutler ***@***.***> wrote: Jenkins cannot support to install multiple versions of PyArrow? @zsxwing <https://github.com/zsxwing> that's right, we will have to coordinate to make sure the Jenkins pyarrow is upgraded to version 0.8 as well. I'm not sure the best way to coordinate all of this because this PR, jenkins upgrade, and Spark Netty upgrade all need to happen at the same time. @holdenk <https://github.com/holdenk> @shaneknapp <https://github.com/shaneknapp> will one of you be able to work on the pyarrow upgrade for Jenkins sometime around next week? (assuming Arrow 0.8 is released in the next day or so) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19884 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABiDrEi41hUQjTmiBiKwTHlm1onv23lfks5s_aW5gaJpZM4Q1ftW> .

wesm · 2017-12-11T23:09:04Z

The Arrow 0.8.0 release vote just started today. Assuming it passes, the earliest you could see packages pushed to PyPI or conda-forge would be sometime on Thursday evening or Friday.

SparkQA · 2017-12-11T23:44:15Z

Test build #84738 has finished for PR 19884 at commit 46ad595.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-12T00:04:12Z

Test build #84740 has finished for PR 19884 at commit c3d612f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-12-12T00:24:01Z

dev/deps/spark-deps-hadoop-2.6

-netty-all-4.0.47.Final.jar
+netty-all-4.1.17.Final.jar
+netty-buffer-4.1.17.Final.jar
+netty-common-4.1.17.Final.jar


@zsxwing do you think netty-buffer and netty-common can be safely excluded in the Spark pom because the same classes are also in netty-all?

@BryanCutler Yes. It should be safe.

Cool, thx just wanted to be sure

BryanCutler · 2017-12-12T00:27:42Z

yeah, i can do the upgrade next week. i'll be working remotely from the east coast, but unavailable at all on monday due to travel.

Great, thanks @shaneknapp ! I'll ping you when I think we are set to go

SparkQA · 2017-12-12T00:49:15Z

Test build #84743 has finished for PR 19884 at commit 3a5e3c1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-21T06:44:26Z

python/pyspark/sql/utils.py

@@ -110,3 +110,12 @@ def toJArray(gateway, jtype, arr):
    for i in range(0, len(arr)):
        jarr[i] = arr[i]
    return jarr
+
+
+def _require_minimum_pyarrow_version():


@ueshin did we do the same thing for pandas?

No. I just checked if ImportError occurred or not. We should do the same thing for pandas later.

cloud-fan · 2017-12-21T06:48:16Z

LGTM, I'm also fine to ignore some tests if they are hard to fix, to unblock other PRs sooner.

BryanCutler · 2017-12-21T07:20:03Z

I used a workaround for timestamp casts that allows the tests to pass for me locally, and left a note to look into the root cause later. Hopefully this should pass now and we will be good to merge.

cloud-fan · 2017-12-21T07:29:03Z

python/pyspark/sql/functions.py

       >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\
       ...     .show()  # doctest: +SKIP
       +----------+--------------+------------+
       |slen(name)|to_upper(name)|add_one(age)|
       +----------+--------------+------------+
-       |         8|      JOHN DOE|          22|
+       |         8|          JOHN|          22|


nit: we should revert this too

oops, done!

SparkQA · 2017-12-21T08:05:01Z

Test build #85244 has finished for PR 19884 at commit b0200ef.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-21T08:05:01Z

Test build #85242 has finished for PR 19884 at commit ae84c84.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-12-21T08:06:28Z

Jenkins, retest this please.

SparkQA · 2017-12-21T11:36:19Z

Test build #85246 has finished for PR 19884 at commit b0200ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-12-21T11:45:24Z

LGTM

Merged to master.

HyukjinKwon · 2017-12-21T11:49:35Z

Hi @zsxwing is it okay to resolve SPARK-19552?

wesm · 2017-12-21T16:38:00Z

@BryanCutler can you give me a minimal repro for the timestamp issue you cited above?

zsxwing · 2017-12-22T01:27:25Z

@HyukjinKwon yeah, I closed the ticket.

BryanCutler · 2017-12-22T22:22:02Z

Thanks all for reviewing and getting the Netty upgrade in also!

BryanCutler · 2017-12-22T22:22:54Z

@BryanCutler can you give me a minimal repro for the timestamp issue you cited above?

Sure @wesm, I'll ping you with a repro

## What changes were proposed in this pull request? This is a follow-up pr of #19884 updating setup.py file to add pyarrow dependency. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20089 from ueshin/issues/SPARK-22324/fup1.

HyukjinKwon · 2018-01-07T12:35:33Z

@BryanCutler, did we resolve #19884 (comment)? If not, shall we file a JIRA?

BryanCutler · 2018-01-08T05:52:09Z

@HyukjinKwon ARROW-1949 was created to add an option to allow truncation when data will be lost. Once that is in Arrow, we can remove the workaround if we want.

gatorsmile · 2018-01-14T08:01:01Z

common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java

@@ -91,7 +91,7 @@ public long position() {
  }

  @Override
-  public long transfered() {
+  public long transferred() {


This breaks binary compatibility. Is it intentional? @zsxwing @cloud-fan

It doesn't. The old method is implemented in AbstractFileRegion.transfered. In addition, the whole network module is private, we don't need to maintain compatibility.

Oh, I see. AbstractFileRegion.transfered is final so it may break binary compatibility. However, this is fine since it's a private module.

I see. Thanks!

Upgrade Spark to Arrow 0.8.0 for Java and Python. Also includes an upgrade of Netty to 4.1.17 to resolve dependency requirements. The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include: * Java refactoring for more simple API * Java reduced heap usage and streamlined hot code paths * Type support for DecimalType, ArrayType * Improved type casting support in Python * Simplified type checking in Python Existing tests Author: Bryan Cutler <cutlerb@gmail.com> Author: Shixiong Zhu <zsxwing@gmail.com> Closes apache#19884 from BryanCutler/arrow-upgrade-080-SPARK-22324.

zsxwing added 2 commits November 27, 2017 10:57

Upgrade Netty to 4.1.17

2311fa2

Fix ProtocolSuite

9d78299

dongjoon-hyun reviewed Dec 4, 2017

View reviewed changes

BryanCutler mentioned this pull request Dec 5, 2017

ARROW-1864: [Java] Upgrade Netty to 4.1.17 apache/arrow#1376

Closed

BryanCutler commented Dec 7, 2017

View reviewed changes

Add AbstractFileRegion

de67ce2

ueshin reviewed Dec 11, 2017

View reviewed changes

BryanCutler force-pushed the arrow-upgrade-080-SPARK-22324 branch from fdba406 to 46ad595 Compare December 11, 2017 23:39

BryanCutler force-pushed the arrow-upgrade-080-SPARK-22324 branch from 46ad595 to c3d612f Compare December 11, 2017 23:59

BryanCutler commented Dec 12, 2017

View reviewed changes

BryanCutler force-pushed the arrow-upgrade-080-SPARK-22324 branch from c3d612f to 3a5e3c1 Compare December 12, 2017 00:43

cloud-fan reviewed Dec 21, 2017

View reviewed changes

workaround for timestamp casting

ae84c84

cloud-fan reviewed Dec 21, 2017

View reviewed changes

revert doctest result

b0200ef

asfgit closed this in 59d5263 Dec 21, 2017

ueshin mentioned this pull request Dec 27, 2017

[SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py file. #20089

Closed

gatorsmile reviewed Jan 14, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Feb 1, 2018

[SPARK-22274][PYTHON][SQL] User-defined aggregation functions with pandas udf (full shuffle) #19872

Closed

BryanCutler mentioned this pull request Feb 1, 2018

[SPARK-23292][TEST] always run python tests #20465

Closed

curtishoward mentioned this pull request Mar 21, 2018

Upgrade Netty to 4.1.17 twosigma/spark#28

Merged

This was referenced Oct 4, 2018

[WIP] [SPARK-19552] [BUILD] Upgrade Netty version to 4.1.8 final #16888

Closed

Upgrade Netty to 4.1.x apache/druid#6417

Merged

[SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0 #19884

[SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0 #19884

Conversation

BryanCutler commented Dec 4, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 4, 2017

BryanCutler commented Dec 4, 2017 • edited Loading

dongjoon-hyun commented Dec 4, 2017

BryanCutler commented Dec 4, 2017 • edited Loading

dongjoon-hyun Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Dec 4, 2017

HyukjinKwon commented Dec 5, 2017

BryanCutler commented Dec 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 7, 2017

SparkQA commented Dec 8, 2017

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing commented Dec 11, 2017

BryanCutler commented Dec 11, 2017

BryanCutler commented Dec 11, 2017

zsxwing commented Dec 11, 2017 • edited Loading

shaneknapp commented Dec 11, 2017 via email

wesm commented Dec 11, 2017

SparkQA commented Dec 11, 2017

SparkQA commented Dec 12, 2017

BryanCutler Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Dec 12, 2017

SparkQA commented Dec 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 21, 2017

BryanCutler commented Dec 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2017

SparkQA commented Dec 21, 2017

ueshin commented Dec 21, 2017

SparkQA commented Dec 21, 2017

HyukjinKwon commented Dec 21, 2017

HyukjinKwon commented Dec 21, 2017

wesm commented Dec 21, 2017

zsxwing commented Dec 22, 2017

BryanCutler commented Dec 22, 2017

BryanCutler commented Dec 22, 2017

HyukjinKwon commented Jan 7, 2018

BryanCutler commented Jan 8, 2018

gatorsmile Jan 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Dec 4, 2017 •

edited

Loading

BryanCutler commented Dec 4, 2017 •

edited

Loading

BryanCutler commented Dec 4, 2017 •

edited

Loading

dongjoon-hyun Dec 4, 2017 •

edited

Loading

zsxwing commented Dec 11, 2017 •

edited

Loading

BryanCutler Dec 12, 2017 •

edited

Loading

gatorsmile Jan 14, 2018 •

edited

Loading