Skip to content

[SPARK-50135][BUILD] Upgrade ZooKeeper to 3.9.3#48666

Closed
panbingkun wants to merge 1 commit intoapache:masterfrom
panbingkun:SPARK-50135
Closed

[SPARK-50135][BUILD] Upgrade ZooKeeper to 3.9.3#48666
panbingkun wants to merge 1 commit intoapache:masterfrom
panbingkun:SPARK-50135

Conversation

@panbingkun
Copy link
Contributor

@panbingkun panbingkun commented Oct 26, 2024

What changes were proposed in this pull request?

The pr aims to upgrade ZooKeeper from 3.9.2 to 3.9.3.

Why are the changes needed?

The full release notes: https://zookeeper.apache.org/doc/r3.9.3/releasenotes.html

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the BUILD label Oct 26, 2024
@dongjoon-hyun
Copy link
Member

It looks good to me, @panbingkun . Is the PR ready?

@panbingkun
Copy link
Contributor Author

It looks good to me, @panbingkun . Is the PR ready?

Yep,thanks. ❤️

@panbingkun panbingkun marked this pull request as ready for review October 26, 2024 22:55
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @panbingkun .
Merged to master.

@HyukjinKwon
Copy link
Member

Surprisingly this caused the test failures in Spark Connect specifically for Mac ....

/python/run-tests --python-executables=python3 --testnames "pyspark.sql.tests.connect.test_connect_collection"

With this PR:

google.protobuf.runtime_version.VersionError: Detected incompatible Protobuf Gencode/Runtime versions when loading spark/connect/base.proto: gencode 5.28.3 runtime 5.28.2. Runtime version cannot be older than the linked gencode version. See Protobuf version guarantees at https://protobuf.dev/support/cross-version-runtime-guarantee.

Without this PR:

Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python tests: ['pyspark.sql.tests.connect.test_connect_collection']
python3 python_implementation is CPython
python3 version is: Python 3.11.9
Starting test(python3): pyspark.sql.tests.connect.test_connect_collection (temp output: /.../spark/python/target/c288c2b5-a11a-4d12-8169-3947c2db9cdc/python3__pyspark.sql.tests.connect.test_connect_collection__rwwdqgag.log)
Finished test(python3): pyspark.sql.tests.connect.test_connect_collection (16s)
Tests passed in 16 seconds

...

Let me revert this for now because the change is sort of trivial but seems like this affects all developers with Mac ..

@zhengruifeng
Copy link
Contributor

thanks @HyukjinKwon for addressing this issue. Python developers had been blocked for 2 weeks.

IIRC, this is not the first time that some Python tests only fails with MacOS.
This kind of issue is hard to fix, I actually attempted to resolve it by reverting some suspicious PRs and comparing with versions of OSS GA, but failed to find the root cause.

Is it possible to add a lightweight MacOS GA job to guard basic PySpark functionality? @dongjoon-hyun @Yikun @LuciferYang

@HyukjinKwon
Copy link
Member

We have MacOS test at https://github.com/apache/spark/actions/workflows/build_maven_java21_macos15.yml but we're not running PySpark tests now. We could improve this furhter.

@panbingkun
Copy link
Contributor Author

Surprisingly this caused the test failures in Spark Connect specifically for Mac ....

/python/run-tests --python-executables=python3 --testnames "pyspark.sql.tests.connect.test_connect_collection"

With this PR:

google.protobuf.runtime_version.VersionError: Detected incompatible Protobuf Gencode/Runtime versions when loading spark/connect/base.proto: gencode 5.28.3 runtime 5.28.2. Runtime version cannot be older than the linked gencode version. See Protobuf version guarantees at https://protobuf.dev/support/cross-version-runtime-guarantee.

Without this PR:

Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python tests: ['pyspark.sql.tests.connect.test_connect_collection']
python3 python_implementation is CPython
python3 version is: Python 3.11.9
Starting test(python3): pyspark.sql.tests.connect.test_connect_collection (temp output: /.../spark/python/target/c288c2b5-a11a-4d12-8169-3947c2db9cdc/python3__pyspark.sql.tests.connect.test_connect_collection__rwwdqgag.log)
Finished test(python3): pyspark.sql.tests.connect.test_connect_collection (16s)
Tests passed in 16 seconds

...

Let me revert this for now because the change is sort of trivial but seems like this affects all developers with Mac ..

Thank you for helping to fix it. Let me investigate it. Thanks!

@dongjoon-hyun
Copy link
Member

Thank you for the head-up and recovery.

@LuciferYang
Copy link
Contributor

Are all current PySpark tests run in a container environment? Even if the os is specified as mac-os, are the existing images still based on Ubuntu? Or is this Python-Only on macOS joob supposed to run in a physical machine environment?

@panbingkun
Copy link
Contributor Author

I think I have identified this issue and will submit a new PR this afternoon to solve it.

@zhengruifeng
Copy link
Contributor

Are all current PySpark tests run in a container environment? Even if the os is specified as mac-os, are the existing images still based on Ubuntu? Or is this Python-Only on macOS joob supposed to run in a physical machine environment?

yes, the are always container job.
we should add a MacOS non-container job for this purpose, or make a MacOS image instead of Ubuntu image

@panbingkun
Copy link
Contributor Author

panbingkun commented Nov 6, 2024

  • When we use sbt to compile the spark project, in directory assembly/target/scala-2.13/jars/, we will find netty related dependencies, some are version 4.1.110.Final, but other are version 4.1.113.Final, as follow:
(pyspark) ➜  jars git:(master) ✗ ls -1 netty-*4*
netty-all-4.1.110.Final.jar
netty-buffer-4.1.113.Final.jar
netty-codec-4.1.113.Final.jar
netty-codec-http-4.1.110.Final.jar
netty-codec-http2-4.1.110.Final.jar
netty-codec-socks-4.1.110.Final.jar
netty-common-4.1.113.Final.jar
netty-handler-4.1.113.Final.jar
netty-handler-proxy-4.1.110.Final.jar
netty-resolver-4.1.113.Final.jar
netty-tcnative-boringssl-static-2.0.66.Final-linux-aarch_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-linux-x86_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-osx-aarch_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-osx-x86_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-windows-x86_64.jar
netty-transport-4.1.113.Final.jar
netty-transport-classes-epoll-4.1.113.Final.jar
netty-transport-classes-kqueue-4.1.110.Final.jar
netty-transport-native-epoll-4.1.113.Final-linux-aarch_64.jar
netty-transport-native-epoll-4.1.113.Final-linux-riscv64.jar
netty-transport-native-epoll-4.1.113.Final-linux-x86_64.jar
netty-transport-native-kqueue-4.1.110.Final-osx-aarch_64.jar
netty-transport-native-kqueue-4.1.110.Final-osx-x86_64.jar
netty-transport-native-unix-common-4.1.113.Final.jar

and ./python/run-tests --python-executables=python3 --testnames "pyspark.sql.tests.connect.test_connect_collection" will failed.

======================================================================
ERROR [0.002s]: test_to_pandas (pyspark.sql.tests.connect.test_connect_collection.SparkConnectCollectionTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/panbingkun/Developer/spark/spark-community/python/pyspark/sql/tests/connect/test_connect_collection.py", line 108, in test_to_pandas
    self.connect.sql(query).toPandas(),
  File "/Users/panbingkun/Developer/spark/spark-community/python/pyspark/sql/connect/session.py", line 753, in sql
    data, properties, ei = self.client.execute_command(cmd.command(self._client))
  File "/Users/panbingkun/Developer/spark/spark-community/python/pyspark/sql/connect/client/core.py", line 1109, in execute_command
    data, _, metrics, observed_metrics, properties = self._execute_and_fetch(
  File "/Users/panbingkun/Developer/spark/spark-community/python/pyspark/sql/connect/client/core.py", line 1517, in _execute_and_fetch
    for response in self._execute_and_fetch_as_iterator(
  File "/Users/panbingkun/Developer/spark/spark-community/python/pyspark/sql/connect/client/core.py", line 1494, in _execute_and_fetch_as_iterator
    self._handle_error(error)
  File "/Users/panbingkun/Developer/spark/spark-community/python/pyspark/sql/connect/client/core.py", line 1764, in _handle_error
    self._handle_rpc_error(error)
  File "/Users/panbingkun/Developer/spark/spark-community/python/pyspark/sql/connect/client/core.py", line 1849, in _handle_rpc_error
    raise SparkConnectGrpcException(str(rpc_error)) from None
pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "Encountered end-of-stream mid-frame"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-11-06T16:13:26.71221+08:00", grpc_status:13, grpc_message:"Encountered end-of-stream mid-frame"}"
>

----------------------------------------------------------------------
Ran 9 tests in 12.791s

FAILED (errors=5)

Generating XML reports...
Generated XML report: target/test-reports/TEST-pyspark.sql.tests.connect.test_connect_collection.SparkConnectCollectionTests-20241106161316.xml
Generated XML report: target/test-reports/TEST-pyspark.sql.tests.connect.test_connect_basic.SparkConnectSQLTestCase-20241106161316.xml

Had test failures in pyspark.sql.tests.connect.test_connect_collection with python3; see logs.
  • When we use maven to compile the spark project, in directory assembly/target/scala-2.13/jars, we will find netty related dependencies, all are version 4.1.110.Final, as follow:
(pyspark) ➜  jars git:(master) ✗ ls -1 netty-*4*
netty-all-4.1.110.Final.jar
netty-buffer-4.1.110.Final.jar
netty-codec-4.1.110.Final.jar
netty-codec-http-4.1.110.Final.jar
netty-codec-http2-4.1.110.Final.jar
netty-codec-socks-4.1.110.Final.jar
netty-common-4.1.110.Final.jar
netty-handler-4.1.110.Final.jar
netty-handler-proxy-4.1.110.Final.jar
netty-resolver-4.1.110.Final.jar
netty-tcnative-boringssl-static-2.0.66.Final-linux-aarch_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-linux-x86_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-osx-aarch_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-osx-x86_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-windows-x86_64.jar
netty-transport-4.1.110.Final.jar
netty-transport-classes-epoll-4.1.110.Final.jar
netty-transport-classes-kqueue-4.1.110.Final.jar
netty-transport-native-epoll-4.1.110.Final-linux-aarch_64.jar
netty-transport-native-epoll-4.1.110.Final-linux-riscv64.jar
netty-transport-native-epoll-4.1.110.Final-linux-x86_64.jar
netty-transport-native-kqueue-4.1.110.Final-osx-aarch_64.jar
netty-transport-native-kqueue-4.1.110.Final-osx-x86_64.jar
netty-transport-native-unix-common-4.1.110.Final.jar

And ./python/run-tests --python-executables=python3 --testnames "pyspark.sql.tests.connect.test_connect_collection" will succeed.

(pyspark) ➜  spark-community git:(master) ✗ ./python/run-tests --python-executables=python3 --testnames "pyspark.sql.tests.connect.test_connect_collection"
Running PySpark tests. Output is in /Users/panbingkun/Developer/spark/spark-community/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python tests: ['pyspark.sql.tests.connect.test_connect_collection']
python3 python_implementation is CPython
python3 version is: Python 3.9.19
Starting test(python3): pyspark.sql.tests.connect.test_connect_collection (temp output: /Users/panbingkun/Developer/spark/spark-community/python/target/f5aa47b5-2b52-4106-a997-30e54b1316e6/python3__pyspark.sql.tests.connect.test_connect_collection__5e2svetz.log)
Finished test(python3): pyspark.sql.tests.connect.test_connect_collection (15s)
Tests passed in 15 seconds
  • Although we have excluded the netty* class (eg: netty-handler...jar) that zookeeper depends on in the pom.xml file, the compile result of sbt clearly does not meet our expectations.

    spark/pom.xml

    Lines 1543 to 1568 in 4771638

    <dependency>
    <groupId>org.apache.zookeeper</groupId>
    <artifactId>zookeeper</artifactId>
    <version>${zookeeper.version}</version>
    <scope>${hadoop.deps.scope}</scope>
    <exclusions>
    <exclusion>
    <groupId>org.jboss.netty</groupId>
    <artifactId>netty</artifactId>
    </exclusion>
    <exclusion>
    <groupId>jline</groupId>
    <artifactId>jline</artifactId>
    </exclusion>
    <exclusion>
    <groupId>io.netty</groupId>
    <artifactId>netty-handler</artifactId>
    </exclusion>
    <exclusion>
    <groupId>io.netty</groupId>
    <artifactId>netty-transport-native-epoll</artifactId>
    </exclusion>
    <exclusion>
    <groupId>io.netty</groupId>
    <artifactId>netty-tcnative-boringssl-static</artifactId>
    </exclusion>

  • So, for future upgrades (some other non zookeeper classes may have similar situations and may encounter the above issues), I will solve this problem in SparkBuild.scala

@LuciferYang
Copy link
Contributor

How was '4.1.113' introduced?

@LuciferYang
Copy link
Contributor

<exclusion> 
       <groupId>org.jboss.netty</groupId> 
       <artifactId>netty</artifactId> 
     </exclusion> 

Here, what is excluded is the dependency on netty 3.x. The groupId for netty 3.x and 4.x is different.

@panbingkun
Copy link
Contributor Author

cc @HyukjinKwon @LuciferYang

We have MacOS test at https://github.com/apache/spark/actions/workflows/build_maven_java21_macos15.yml but we're not running PySpark tests now. We could improve this furhter.

Based on the above conclusions and analysis, I believe that any compilation based on Maven may not detect this issue in advance.

@panbingkun
Copy link
Contributor Author

panbingkun commented Nov 6, 2024

Below is the dependency exclusion for netty...

<exclusion> 
       <groupId>org.jboss.netty</groupId> 
       <artifactId>netty</artifactId> 
     </exclusion> 

Here, what is excluded is the dependency on netty 3.x. The groupId for netty 3.x and 4.x is different.

image

@panbingkun
Copy link
Contributor Author

A new pr for it: #48771

@LuciferYang
Copy link
Contributor

LuciferYang commented Nov 6, 2024

we should change to exclude

<exclusion> 
       <groupId>io.netty</groupId> 
       <artifactId>*</artifactId> 
     </exclusion> 

from zookeeper instead of

<exclusion> 
       <groupId>org.jboss.netty</groupId> 
       <artifactId>netty</artifactId> 
     </exclusion> 

org.jboss.netty is the groupId of netty 3.x

@panbingkun
Copy link
Contributor Author

we should change to exclude

<exclusion> 
       <groupId>io.netty</groupId> 
       <artifactId>*</artifactId> 
     </exclusion> 

from zookeeper instead of

<exclusion> 
       <groupId>org.jboss.netty</groupId> 
       <artifactId>netty</artifactId> 
     </exclusion> 

org.jboss.netty is the groupId of netty 3.x

It doesn't seem to work, I've tried it and I can confirm again.

@panbingkun
Copy link
Contributor Author

we should change to exclude

<exclusion> 
       <groupId>io.netty</groupId> 
       <artifactId>*</artifactId> 
     </exclusion> 

from zookeeper instead of

<exclusion> 
       <groupId>org.jboss.netty</groupId> 
       <artifactId>netty</artifactId> 
     </exclusion> 

org.jboss.netty is the groupId of netty 3.x

Unfortunately, this issue still exists.

image
(pyspark) ➜  spark-community git:(master) ✗ cd assembly/target/scala-2.13/jars
(pyspark) ➜  jars git:(master) ✗ ls -1 netty-*
netty-all-4.1.110.Final.jar
netty-buffer-4.1.113.Final.jar
netty-codec-4.1.113.Final.jar
netty-codec-http-4.1.110.Final.jar
netty-codec-http2-4.1.110.Final.jar
netty-codec-socks-4.1.110.Final.jar
netty-common-4.1.113.Final.jar
netty-handler-4.1.113.Final.jar
netty-handler-proxy-4.1.110.Final.jar
netty-resolver-4.1.113.Final.jar
netty-tcnative-boringssl-static-2.0.66.Final-linux-aarch_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-linux-x86_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-osx-aarch_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-osx-x86_64.jar
netty-tcnative-boringssl-static-2.0.66.Final-windows-x86_64.jar
netty-tcnative-boringssl-static-2.0.66.Final.jar
netty-tcnative-classes-2.0.66.Final.jar
netty-transport-4.1.113.Final.jar
netty-transport-classes-epoll-4.1.113.Final.jar
netty-transport-classes-kqueue-4.1.110.Final.jar
netty-transport-native-epoll-4.1.113.Final-linux-aarch_64.jar
netty-transport-native-epoll-4.1.113.Final-linux-riscv64.jar
netty-transport-native-epoll-4.1.113.Final-linux-x86_64.jar
netty-transport-native-kqueue-4.1.110.Final-osx-aarch_64.jar
netty-transport-native-kqueue-4.1.110.Final-osx-x86_64.jar
netty-transport-native-unix-common-4.1.113.Final.jar

@LuciferYang
Copy link
Contributor

we should change to exclude

<exclusion> 
       <groupId>io.netty</groupId> 
       <artifactId>*</artifactId> 
     </exclusion> 

from zookeeper instead of

<exclusion> 
       <groupId>org.jboss.netty</groupId> 
       <artifactId>netty</artifactId> 
     </exclusion> 

org.jboss.netty is the groupId of netty 3.x

Unfortunately, this issue still exists.

image ```shell (pyspark) ➜ spark-community git:(master) ✗ cd assembly/target/scala-2.13/jars (pyspark) ➜ jars git:(master) ✗ ls -1 netty-* netty-all-4.1.110.Final.jar netty-buffer-4.1.113.Final.jar netty-codec-4.1.113.Final.jar netty-codec-http-4.1.110.Final.jar netty-codec-http2-4.1.110.Final.jar netty-codec-socks-4.1.110.Final.jar netty-common-4.1.113.Final.jar netty-handler-4.1.113.Final.jar netty-handler-proxy-4.1.110.Final.jar netty-resolver-4.1.113.Final.jar netty-tcnative-boringssl-static-2.0.66.Final-linux-aarch_64.jar netty-tcnative-boringssl-static-2.0.66.Final-linux-x86_64.jar netty-tcnative-boringssl-static-2.0.66.Final-osx-aarch_64.jar netty-tcnative-boringssl-static-2.0.66.Final-osx-x86_64.jar netty-tcnative-boringssl-static-2.0.66.Final-windows-x86_64.jar netty-tcnative-boringssl-static-2.0.66.Final.jar netty-tcnative-classes-2.0.66.Final.jar netty-transport-4.1.113.Final.jar netty-transport-classes-epoll-4.1.113.Final.jar netty-transport-classes-kqueue-4.1.110.Final.jar netty-transport-native-epoll-4.1.113.Final-linux-aarch_64.jar netty-transport-native-epoll-4.1.113.Final-linux-riscv64.jar netty-transport-native-epoll-4.1.113.Final-linux-x86_64.jar netty-transport-native-kqueue-4.1.110.Final-osx-aarch_64.jar netty-transport-native-kqueue-4.1.110.Final-osx-x86_64.jar netty-transport-native-unix-common-4.1.113.Final.jar ```

Ok ~ I'll investigate if there's a simpler method later

@panbingkun
Copy link
Contributor Author

The final solution is to upgrade netty to version 1.1.114 first, and then upgrade zookeeper.
I have verified locally that the netty... generated by compiling with sbt is version 1.1.114

dongjoon-hyun pushed a commit that referenced this pull request Nov 7, 2024
### What changes were proposed in this pull request?
The pr aims to upgrade `ZooKeeper` from `3.9.2` to `3.9.3`.
This PR is to fix potential issues with PR #48666

### Why are the changes needed?
The full release notes: https://zookeeper.apache.org/doc/r3.9.3/releasenotes.html

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manually check
```shell
./build/sbt -Phadoop-3 -Pkubernetes -Pkinesis-asl -Phive-thriftserver -Pdocker-integration-tests -Pyarn -Phadoop-cloud -Pspark-ganglia-lgpl -Phive -Pjvm-profiler clean package

[info] Note: Some input files use or override a deprecated API.
[info] Note: Recompile with -Xlint:deprecation for details.
[warn] multiple main classes detected: run 'show discoveredMainClasses' to see the list
[success] Total time: 272 s (04:32), completed Nov 6, 2024, 4:29:52 PM
```

```shell
(pyspark) ➜  spark-community git:(SPARK-50135_FOLLOWUP) ✗ ./python/run-tests --python-executables=python3 --testnames "pyspark.sql.tests.connect.test_connect_collection"
Running PySpark tests. Output is in /Users/panbingkun/Developer/spark/spark-community/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python tests: ['pyspark.sql.tests.connect.test_connect_collection']
python3 python_implementation is CPython
python3 version is: Python 3.9.19
Starting test(python3): pyspark.sql.tests.connect.test_connect_collection (temp output: /Users/panbingkun/Developer/spark/spark-community/python/target/097bd7e0-9311-4484-ae2d-c0f4c63fc6f9/python3__pyspark.sql.tests.connect.test_connect_collection__8dzaeio9.log)
Finished test(python3): pyspark.sql.tests.connect.test_connect_collection (14s)
Tests passed in 14 seconds

```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48771 from panbingkun/SPARK-50135_FOLLOWUP.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants