Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0 #19884

Closed

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Dec 4, 2017

What changes were proposed in this pull request?

Upgrade Spark to Arrow 0.8.0 for Java and Python. Also includes an upgrade of Netty to 4.1.17 to resolve dependency requirements.

The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include:

  • Java refactoring for more simple API
  • Java reduced heap usage and streamlined hot code paths
  • Type support for DecimalType, ArrayType
  • Improved type casting support in Python
  • Simplified type checking in Python

How was this patch tested?

Existing tests

@SparkQA
Copy link

SparkQA commented Dec 4, 2017

Test build #84447 has finished for PR 19884 at commit 4b0790b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

BryanCutler commented Dec 4, 2017

The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include:

  • Java refactoring for more simple API
  • Java reduced heap usage and streamlined hot code paths
  • Type support for DecimalType, ArrayType
  • Improved type casting support in Python
  • Simplified type checking in Python

@dongjoon-hyun
Copy link
Member

Great, @BryanCutler . Could you put the highlight in the PR description, too?

@BryanCutler
Copy link
Member Author

BryanCutler commented Dec 4, 2017

This is a WIP to start updating Spark to use Arrow 0.8.0 which will be released soon.

TODO:

  • Update to reflect Java API changes - Scala Arrow Tests Passing
  • Update to reflect Python API changes - Scala Python Tests Passing
  • Use new Python type checking
  • Remove Python type casting workarounds
  • Incorporate Netty Upgrade
  • Add message if user has older version of pyarrow

pom.xml Outdated
@@ -185,7 +185,7 @@
<paranamer.version>2.8</paranamer.version>
<maven-antrun.version>1.8</maven-antrun.version>
<commons-crypto.version>1.0.0</commons-crypto.version>
<arrow.version>0.4.0</arrow.version>
<arrow.version>0.8.0-SNAPSHOT</arrow.version>
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any ETA for the official 0.8.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still wrapping a few things up, should be later this week or early next week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we download the snapshot from somewhere for our local tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to cut an RC beginning of next week. I would suggest mvn-installing from Arrow master for the time being

@BryanCutler
Copy link
Member Author

Great, @BryanCutler . Could you put the highlight in the PR description, too?

Sure, thanks @dongjoon-hyun ! Will do, just want to go back and check the release notes first

@HyukjinKwon
Copy link
Member

cc @zsxwing as well, I saw you opened a JIRA about this - SPARK-22656

@BryanCutler
Copy link
Member Author

@zsxwing, fyi after applying your Netty upgrade patch to Arrow, and then your other patch for Spark, all of the Spark Scala/Java tests pass

spark_type = StringType()
elif at == pa.date32():
spark_type = DateType()
elif type(at) == pa.TimestampType:
elif pa.types.is_timestamp(at):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@icexelloss @wesm is this the recommended way to check type id for the latest pyarrow? For types with a single bit width, I am using the is_* functions, like is_timestamp, but for others I still need to check object equality such as t == pa.date32() because there is no is_date32() only is_date()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is right. I'm opening a JIRA to add more functions for testing exact types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for confirming!

@SparkQA
Copy link

SparkQA commented Dec 7, 2017

Test build #84616 has finished for PR 19884 at commit 93b1eb3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2017

Test build #84663 has finished for PR 19884 at commit fdba406.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried to run tests locally, I got OutOfMemoryException as follows:

[info]   org.apache.arrow.memory.OutOfMemoryException:
[info]   at org.apache.arrow.vector.complex.AbstractContainerVector.allocateNew(AbstractContainerVector.java:52)
[info]   at org.apache.spark.sql.execution.arrow.ArrowWriter$$anonfun$1.apply(ArrowWriter.scala:40)

We shouldn't explicitly use allocateNew() or something?

I'll wait for the next update. Thanks!

mask = None if casted.dtype == 'object' else s.isnull()
return pa.Array.from_pandas(casted, mask=mask, type=t)
mask = s.isnull()
# Workaround for casting timestamp units with timezone, ARROW-1906
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the fix for this workaround be included in Arrow 0.8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just fixed in ARROW-1906 apache/arrow#1411

pom.xml Outdated
@@ -185,7 +185,7 @@
<paranamer.version>2.8</paranamer.version>
<maven-antrun.version>1.8</maven-antrun.version>
<commons-crypto.version>1.0.0</commons-crypto.version>
<arrow.version>0.4.0</arrow.version>
<arrow.version>0.8.0-SNAPSHOT</arrow.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't forget that we also need to update dev/deps/spark-deps-hadoop-2.x files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@zsxwing
Copy link
Member

zsxwing commented Dec 11, 2017

I saw #18974 tried to upgrade Arrow but got closed due to some Jenkins issue. @ueshin do you have any idea what may block this PR? Jenkins cannot support to install multiple versions of PyArrow?

@BryanCutler
Copy link
Member Author

When I tried to run tests locally, I got OutOfMemoryException

@ueshin , you got that error because the latest Arrow has upgraded Netty to 4.1.17 but Spark has an older version on the classpath. If you apply #19829 on top of this PR, the tests should pass.

@BryanCutler
Copy link
Member Author

Jenkins cannot support to install multiple versions of PyArrow?

@zsxwing that's right, we will have to coordinate to make sure the Jenkins pyarrow is upgraded to version 0.8 as well. I'm not sure the best way to coordinate all of this because this PR, jenkins upgrade, and Spark Netty upgrade all need to happen at the same time.

@holdenk @shaneknapp will one of you be able to work on the pyarrow upgrade for Jenkins sometime around next week? (assuming Arrow 0.8 is released in the next day or so)

@zsxwing
Copy link
Member

zsxwing commented Dec 11, 2017

@BryanCutler could you just pull my changes into this PR since we need both changes to pass Jenkins? Thanks!

@shaneknapp
Copy link
Contributor

shaneknapp commented Dec 11, 2017 via email

@wesm
Copy link
Member

wesm commented Dec 11, 2017

The Arrow 0.8.0 release vote just started today. Assuming it passes, the earliest you could see packages pushed to PyPI or conda-forge would be sometime on Thursday evening or Friday.

@SparkQA
Copy link

SparkQA commented Dec 11, 2017

Test build #84738 has finished for PR 19884 at commit 46ad595.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #84740 has finished for PR 19884 at commit c3d612f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

netty-all-4.0.47.Final.jar
netty-all-4.1.17.Final.jar
netty-buffer-4.1.17.Final.jar
netty-common-4.1.17.Final.jar
Copy link
Member Author

@BryanCutler BryanCutler Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing do you think netty-buffer and netty-common can be safely excluded in the Spark pom because the same classes are also in netty-all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler Yes. It should be safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thx just wanted to be sure

@BryanCutler
Copy link
Member Author

yeah, i can do the upgrade next week. i'll be working remotely from the east coast, but unavailable at all on monday due to travel.

Great, thanks @shaneknapp ! I'll ping you when I think we are set to go

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #84743 has finished for PR 19884 at commit 3a5e3c1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -110,3 +110,12 @@ def toJArray(gateway, jtype, arr):
for i in range(0, len(arr)):
jarr[i] = arr[i]
return jarr


def _require_minimum_pyarrow_version():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin did we do the same thing for pandas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I just checked if ImportError occurred or not. We should do the same thing for pandas later.

@cloud-fan
Copy link
Contributor

LGTM, I'm also fine to ignore some tests if they are hard to fix, to unblock other PRs sooner.

@BryanCutler
Copy link
Member Author

I used a workaround for timestamp casts that allows the tests to pass for me locally, and left a note to look into the root cause later. Hopefully this should pass now and we will be good to merge.

>>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\
... .show() # doctest: +SKIP
+----------+--------------+------------+
|slen(name)|to_upper(name)|add_one(age)|
+----------+--------------+------------+
| 8| JOHN DOE| 22|
| 8| JOHN| 22|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should revert this too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, done!

@SparkQA
Copy link

SparkQA commented Dec 21, 2017

Test build #85244 has finished for PR 19884 at commit b0200ef.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2017

Test build #85242 has finished for PR 19884 at commit ae84c84.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Dec 21, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Dec 21, 2017

Test build #85246 has finished for PR 19884 at commit b0200ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

LGTM

Merged to master.

@asfgit asfgit closed this in 59d5263 Dec 21, 2017
@HyukjinKwon
Copy link
Member

Hi @zsxwing is it okay to resolve SPARK-19552?

@wesm
Copy link
Member

wesm commented Dec 21, 2017

@BryanCutler can you give me a minimal repro for the timestamp issue you cited above?

@zsxwing
Copy link
Member

zsxwing commented Dec 22, 2017

@HyukjinKwon yeah, I closed the ticket.

@BryanCutler
Copy link
Member Author

Thanks all for reviewing and getting the Netty upgrade in also!

@BryanCutler
Copy link
Member Author

@BryanCutler can you give me a minimal repro for the timestamp issue you cited above?

Sure @wesm, I'll ping you with a repro

asfgit pushed a commit that referenced this pull request Dec 27, 2017
## What changes were proposed in this pull request?

This is a follow-up pr of #19884 updating setup.py file to add pyarrow dependency.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #20089 from ueshin/issues/SPARK-22324/fup1.
@HyukjinKwon
Copy link
Member

@BryanCutler, did we resolve #19884 (comment)? If not, shall we file a JIRA?

@BryanCutler
Copy link
Member Author

@HyukjinKwon ARROW-1949 was created to add an option to allow truncation when data will be lost. Once that is in Arrow, we can remove the workaround if we want.

@@ -91,7 +91,7 @@ public long position() {
}

@Override
public long transfered() {
public long transferred() {
Copy link
Member

@gatorsmile gatorsmile Jan 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks binary compatibility. Is it intentional? @zsxwing @cloud-fan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. The old method is implemented in AbstractFileRegion.transfered. In addition, the whole network module is private, we don't need to maintain compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. AbstractFileRegion.transfered is final so it may break binary compatibility. However, this is fine since it's a private module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

sumwale pushed a commit to TIBCOSoftware/snappy-spark that referenced this pull request Feb 7, 2022
Upgrade Spark to Arrow 0.8.0 for Java and Python.  Also includes an upgrade of Netty to 4.1.17 to resolve dependency requirements.

The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include:

* Java refactoring for more simple API
* Java reduced heap usage and streamlined hot code paths
* Type support for DecimalType, ArrayType
* Improved type casting support in Python
* Simplified type checking in Python

Existing tests

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>

Closes apache#19884 from BryanCutler/arrow-upgrade-080-SPARK-22324.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants