Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Nov 16, 2021

What changes were proposed in this pull request?

Bump up Apache Arrow version to 6.0.0, from 2.0.0

Why are the changes needed?

Apache Spark is still using a old Apache Arrow version 2.0.0, while 6.0.0 was released already at October, 2021. We should pick up improvements & bug fixes from the newer version.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

@github-actions github-actions bot added the BUILD label Nov 16, 2021
@HyukjinKwon
Copy link
Member

cc @BryanCutler FYI

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @sunchao have you tried running PySpark tests with PyArrow 6.0.0?

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49724/

@sunchao
Copy link
Member Author

sunchao commented Nov 16, 2021

@BryanCutler eh I didn't - was hoping that the Spark CI would help with that. Do you know how I can test that?

@HyukjinKwon
Copy link
Member

Oh, it should better be tested @sunchao. Running it regularly should work (https://spark.apache.org/developer-tools.html) but with pip install pyarrow==6.0.0

@HyukjinKwon
Copy link
Member

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49724/

@sunchao
Copy link
Member Author

sunchao commented Nov 16, 2021

Thanks @HyukjinKwon , you mean run tests like:

python/run-tests --testnames pyspark.sql.tests.test_arrow

but with pip install pyarrow==6.0.0, is that correct?

Let me do it soon.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 16, 2021

oh you can do, instead:

pip install -r dev/requirements.txt
pip install pyarrow==6.0.0
python/run-tests --modules pyspark-sql

that would verify all the things 👍

@SparkQA
Copy link

SparkQA commented Nov 16, 2021

Test build #145254 has finished for PR 34613 at commit 2fe10a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM if @sunchao succeeded with @HyukjinKwon 's direction.
Thanks, @sunchao , @BryanCutler , @HyukjinKwon

@HyukjinKwon
Copy link
Member

Merged to master.

@sunchao
Copy link
Member Author

sunchao commented Nov 17, 2021

Thanks for the merge @HyukjinKwon !

Sorry for the delay, I'm still trying to test this. I followed the steps you mentioned above:

pip install -r dev/requirements.txt
pip install pyarrow==6.0.0
python/run-tests --modules pyspark-sql

and also set SPARK_HOME to the dist directory under my Spark repo. However I'm getting this error:

======================================================================
ERROR [0.000s]: setUpClass (pyspark.sql.tests.test_arrow.MaxResultArrowTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/sunchao/git/spark/dist/python/pyspark/sql/tests/test_arrow.py", line 672, in setUpClass
    "local[4]", cls.__name__, conf=SparkConf().set("spark.driver.maxResultSize", "10k")
  File "/Users/sunchao/git/spark/dist/python/pyspark/conf.py", line 131, in __init__
    self._jconf = _jvm.SparkConf(loadDefaults)
  File "/Users/sunchao/git/spark/dist/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1697, in __getattr__
    answer = self._gateway_client.send_command(
  File "/Users/sunchao/git/spark/dist/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1036, in send_command
    connection = self._get_connection()
  File "/Users/sunchao/git/spark/dist/python/lib/py4j-0.10.9.2-src.zip/py4j/clientserver.py", line 281, in _get_connection
    connection = self._create_new_connection()
  File "/Users/sunchao/git/spark/dist/python/lib/py4j-0.10.9.2-src.zip/py4j/clientserver.py", line 288, in _create_new_connection
    connection.connect_to_java_server()
  File "/Users/sunchao/git/spark/dist/python/lib/py4j-0.10.9.2-src.zip/py4j/clientserver.py", line 402, in connect_to_java_server
    self.socket.connect((self.java_address, self.java_port))
ConnectionRefusedError: [Errno 61] Connection refused

Do you know what I could have missed here?

@HyukjinKwon
Copy link
Member

Oh, SPARK_HOME has to be set to the root directory of GItHub repo.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 18, 2021

the full command I use:

./build/mvn -DskipTests -Phive-2.3 -Phive clean package  # can also be sbt build that's the same as GA build uses.
export SPARK_HOME=`pwd`
./python/run-tests --python-executables=python3 --modules pyspark-sql

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 18, 2021

and .. if you're running on a Mac, should also set:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

@sunchao
Copy link
Member Author

sunchao commented Nov 18, 2021

Hmm I must missed something since I always get this "ConnectionRefusedError: [Errno 61] Connection refused" even though I followed the exact above steps. Not sure what host & port it tries to access: I did open up passwordless ssh to localhost.

@HyukjinKwon
Copy link
Member

@sunchao, how does it work if you do one of below:

  1. export SPARK_LOCAL_IP=localhost
  2. open /etc/hosts and double check 127.0.0.1 localhost

?

As you already assumed, seems like it fails to open a socket with localhost.

sunchao added a commit to sunchao/spark that referenced this pull request Dec 8, 2021
### What changes were proposed in this pull request?

Bump up Apache Arrow version to 6.0.0, from 2.0.0

### Why are the changes needed?

Apache Spark is still using a old Apache Arrow version 2.0.0, while 6.0.0 was released already at October, 2021. We should pick up improvements & bug fixes from the newer version.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes apache#34613 from sunchao/SPARK-37342.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants