Skip to content

[SPARK-48919][CONNECT] Move connect code generation and dependency management to a separate project#47378

Closed
hvanhovell wants to merge 6 commits intoapache:masterfrom
hvanhovell:SPARK-48919
Closed

[SPARK-48919][CONNECT] Move connect code generation and dependency management to a separate project#47378
hvanhovell wants to merge 6 commits intoapache:masterfrom
hvanhovell:SPARK-48919

Conversation

@hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

This PR moves the connect protos and code generation into a separate module. This module produces a shaded artifact that is used by both the connect server and client.

Why are the changes needed?

This is the first step in creating a Scala Dataframe API shared between sql/core and connect.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No

<configuration>
<visitor>true</visitor>
<sourceDirectory>../api/src/main/antlr4</sourceDirectory>
<sourceDirectory>src/main/antlr4</sourceDirectory>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Artifact from the last refactorings...

<shadedArtifactAttached>false</shadedArtifactAttached>
<promoteTransitiveDependencies>true</promoteTransitiveDependencies>
<artifactSet>
<!-- TODO make sure this is complete. -->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the shading is now done in connect-api.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and yes I still need to check iif the shading rules between SBT and maven are still consistent.

<artifactId>spark-sql-api_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all moved to connect-api.


<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has all been moved to connect-api.

<filter>
<artifact>org.apache.tomcat:annotations-api</artifact>
<includes>
<include>javax/annotation/**</include>
Copy link
Contributor Author

@hvanhovell hvanhovell Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually only need javax.annotation.Generated. We are dropping the rest because shading all of javax is not something we want to do. The alternative is that we don't shade org.apache.tomcat:annotations-api and com.google.code.findbugs:jsr305.

</extensions>
<plugins>
<!-- Shading for protobuf & gRPC -->
<plugin>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a perfect world we check if we are not packaging unshaded classes in the uber jar.

name.startsWith("pmml-model-") || name.startsWith("scala-collection-compat_") ||
name.startsWith("jsr305-") || name.startsWith("netty-") || name == "unused-1.0.0.jar"
val cp = (Runtime / managedClasspath).value
val prefixesToShade = Seq(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to use a less wonky mechanism here.

object SparkConnect {
import BuildCommons.protoVersion

object SparkConnectServer {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only here to make assembly work. It is a bit weird because this module does not need a full assembly anymore.

@hvanhovell
Copy link
Contributor Author

For the reviewers. Most of this PR is mechanical, renaming imports to their new shaded names. Please focus on the Maven and SBT build files first!

@HyukjinKwon
Copy link
Member

cc @LuciferYang if you find some time to review.

@pan3793
Copy link
Member

pan3793 commented Jul 17, 2024

I remember there are issues for maven to consume a shaded module in the same project. i.e. you must run mvn install -pl <shaded-module> first, otherwise mvn test or mvn compile can not see the shaded classes. that's why some apache projects use maven as building system creating a dedicated git repo to manage the shaded deps, e.g.
https://github.com/apache/ratis-thirdparty
https://github.com/apache/hadoop-thirdparty
https://github.com/apache/hbase-thirdparty
https://github.com/apache/kyuubi-shaded
https://github.com/apache/flink-shaded

@hvanhovell
Copy link
Contributor Author

@pan3793 thanks for the input. I did check maven package and that seemed to work for packaging (this is the command I used: build/mvn package -pl connector/connect/client/jvm -am). I did see that test did try to consume the unshaded dependency, and it subsequently failed. I am not sure how much of an issue this is since we use SBT for CI.

@LuciferYang
Copy link
Contributor

Thank you for pinging me, @HyukjinKwon

@pan3793
Copy link
Member

pan3793 commented Jul 17, 2024

I am not sure how much of an issue this is since we use SBT for CI.

@hvanhovell It significantly affects users who use Maven in local dev, we have done it in Apache Kyuubi and finally reverted due to bad dev experience.

IDEA recognizes Spark as a Maven project in importing(at least by default, not sure how to import Spark as an SBT project), thus it also affects users who want to run UT in IDEA.

PS: due to some network issues in CN, SBT is hard to bootstrap which blocks a lot of devs to try such an awesome building tool ...

@LuciferYang
Copy link
Contributor

I am not sure how much of an issue this is since we use SBT for CI.

There are multiple daily tests now using Maven for testing.

@LuciferYang
Copy link
Contributor

@hvanhovell local run

build/mvn clean install -DskipTests -Phive
build/mvn test -pl connector/connect/client/jvm -Phive

then

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 36.08 s <<< FAILURE! -- in org.apache.spark.sql.JavaEncoderSuite
[ERROR] org.apache.spark.sql.JavaEncoderSuite -- Time elapsed: 36.08 s <<< ERROR!
org.apache.spark.sql.connect.client.RetriesExceeded
	at org.apache.spark.sql.connect.client.GrpcRetryHandler$Retrying.waitAfterAttempt(GrpcRetryHandler.scala:215)
	at org.apache.spark.sql.connect.client.GrpcRetryHandler$Retrying.retry(GrpcRetryHandler.scala:224)
	at org.apache.spark.sql.connect.client.GrpcRetryHandler.retry(GrpcRetryHandler.scala:38)
	at org.apache.spark.sql.connect.client.CustomSparkConnectBlockingStub.$anonfun$analyzePlan$1(CustomSparkConnectBlockingStub.scala:76)
	at org.apache.spark.sql.connect.client.GrpcExceptionConverter.convert(GrpcExceptionConverter.scala:58)
	at org.apache.spark.sql.connect.client.CustomSparkConnectBlockingStub.analyzePlan(CustomSparkConnectBlockingStub.scala:75)
	at org.apache.spark.sql.connect.client.SparkConnectClient.analyze(SparkConnectClient.scala:110)
	at org.apache.spark.sql.connect.client.SparkConnectClient.analyze(SparkConnectClient.scala:238)
	at org.apache.spark.sql.connect.client.SparkConnectClient.analyze(SparkConnectClient.scala:209)
	at org.apache.spark.sql.SparkSession.version$lzycompute(SparkSession.scala:82)
	at org.apache.spark.sql.SparkSession.version(SparkSession.scala:81)
	at org.apache.spark.sql.test.SparkConnectServerUtils$.createSparkSession(RemoteSparkSession.scala:192)
	at org.apache.spark.sql.test.SparkConnectServerUtils.createSparkSession(RemoteSparkSession.scala)
	at org.apache.spark.sql.JavaEncoderSuite.setup(JavaEncoderSuite.java:42)

@LuciferYang
Copy link
Contributor

local run

./dev/test-dependencies.sh --replace-manifest
git diff
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 9dba90b81b6..a1278741a87 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -10,7 +10,10 @@ aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar
 aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar
 aliyun-java-sdk-ram/3.1.0//aliyun-java-sdk-ram-3.1.0.jar
 aliyun-sdk-oss/3.13.2//aliyun-sdk-oss-3.13.2.jar
+animal-sniffer-annotations/1.23//animal-sniffer-annotations-1.23.jar
+annotations-api/6.0.53//annotations-api-6.0.53.jar
 annotations/17.0.0//annotations-17.0.0.jar
+annotations/4.1.1.4//annotations-4.1.1.4.jar
 antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar
 antlr4-runtime/4.13.1//antlr4-runtime-4.13.1.jar
 aopalliance-repackaged/3.0.3//aopalliance-repackaged-3.0.3.jar
@@ -63,11 +66,22 @@ derby/10.16.1.1//derby-10.16.1.1.jar
 derbyshared/10.16.1.1//derbyshared-10.16.1.1.jar
 derbytools/10.16.1.1//derbytools-10.16.1.1.jar
 dropwizard-metrics-hadoop-metrics2-reporter/0.1.2//dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
+error_prone_annotations/2.18.0//error_prone_annotations-2.18.0.jar
 esdk-obs-java/3.20.4.2//esdk-obs-java-3.20.4.2.jar
 flatbuffers-java/23.5.26//flatbuffers-java-23.5.26.jar
 gcs-connector/hadoop3-2.2.21/shaded/gcs-connector-hadoop3-2.2.21-shaded.jar
 gmetric4j/1.0.10//gmetric4j-1.0.10.jar
-gson/2.2.4//gson-2.2.4.jar
+grpc-api/1.62.2//grpc-api-1.62.2.jar
+grpc-context/1.62.2//grpc-context-1.62.2.jar
+grpc-core/1.62.2//grpc-core-1.62.2.jar
+grpc-inprocess/1.62.2//grpc-inprocess-1.62.2.jar
+grpc-netty/1.62.2//grpc-netty-1.62.2.jar
+grpc-protobuf-lite/1.62.2//grpc-protobuf-lite-1.62.2.jar
+grpc-protobuf/1.62.2//grpc-protobuf-1.62.2.jar
+grpc-services/1.62.2//grpc-services-1.62.2.jar
+grpc-stub/1.62.2//grpc-stub-1.62.2.jar
+grpc-util/1.62.2//grpc-util-1.62.2.jar
+gson/2.8.9//gson-2.8.9.jar
 guava/14.0.1//guava-14.0.1.jar
 hadoop-aliyun/3.4.0//hadoop-aliyun-3.4.0.jar
 hadoop-annotations/3.4.0//hadoop-annotations-3.4.0.jar
@@ -102,6 +116,7 @@ icu4j/75.1//icu4j-75.1.jar
 ini4j/0.5.4//ini4j-0.5.4.jar
 istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar
 ivy/2.5.2//ivy-2.5.2.jar
+j2objc-annotations/2.8//j2objc-annotations-2.8.jar
...

I think we should modify assembly/pom.xml to prevent dependency leakage.

"scala-",
"netty-")
val unexpectedUnshadedJars = filterClasspath(unshadedJars, expectedUnshadedPrefixes)
if (unexpectedUnshadedJars.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run / Run TPC-DS queries with SF=1



[error] java.lang.IllegalStateException: Unexpected unshaded jar(s) found:
[error]  - Attributed(/home/runner/.cache/coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/jpmml/pmml-model/1.4.8/pmml-model-1.4.8.jar)
[error] 	at SparkConnectApi$.$anonfun$settings$15(SparkBuild.scala:692)
[error] 	at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] 	at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63)
[error] 	at sbt.std.Transform$$anon$4.work(Transform.scala:69)
[error] 	at sbt.Execute.$anonfun$submit$2(Execute.scala:283)
[error] 	at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24)
[error] 	at sbt.Execute.work(Execute.scala:292)
[error] 	at sbt.Execute.$anonfun$submit$1(Execute.scala:283)
[error] 	at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error] 	at sbt.CompletionService$$anon$2.call(CompletionService.scala:65)
[error] 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[error] 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
[error] 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[error] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[error] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[error] 	at java.base/java.lang.Thread.run(Thread.java:840)
[error] (connect-api / assembly / assemblyExcludedJars) java.lang.IllegalStateException: Unexpected unshaded jar(s) found:
[error]  - Attributed(/home/runner/.cache/coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/jpmml/pmml-model/1.4.8/pmml-model-1.4.8.jar)
[error] Total time: 234 s (03:54), completed Jul 17, 2024, 1:49:40 AM

It's a bit strange, only the task Run TPC-DS queries with SF=1 detected this issue.

@LuciferYang
Copy link
Contributor

connect = Module(
name="connect",
dependencies=[hive, avro, protobuf],
source_file_regexes=[
"connect",
"connector/connect",
],
sbt_test_goals=[
"connect/test",
"connect-client-jvm/test",
],
)

Although the connect-api module does not have test cases, should the configuration of the connect Module be changed so that corresponding tests are triggered when changes occur in connect-api?

@LuciferYang
Copy link
Contributor

LuciferYang commented Jul 17, 2024

local run

build/mvn clean compile -DskipTests

then failed

INFO] compiling 35 Scala sources to /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/target/scala-2.13/classes ...
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala:20: object sparkproject is not a member of package org
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala:55: not found: type Message
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala:57: not found: type Parser
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:35: object sparkproject is not a member of package org
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:36: object sparkproject is not a member of package org
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:37: object sparkproject is not a member of package org
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala:28: object sparkproject is not a member of package org
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:239: not found: type StatusRuntimeException
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:246: not found: type StreamObserver
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CustomSparkConnectStub.scala:20: object sparkproject is not a member of package org
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CustomSparkConnectStub.scala:31: not found: type StreamObserver
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CustomSparkConnectStub.scala:30: not found: type StreamObserver
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:314: not found: type StreamObserver
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:370: not found: type StreamObserver
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:325: not found: value ByteString
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:350: not found: type ByteString
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:360: not found: value ByteString
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:361: not found: value ByteString
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:35: Unused import
Applicable -Wconf / @nowarn filters for this fatal warning: msg=<part of the message>, cat=unused-imports, site=org.apache.spark.sql.connect.client
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:36: Unused import
Applicable -Wconf / @nowarn filters for this fatal warning: msg=<part of the message>, cat=unused-imports, site=org.apache.spark.sql.connect.client
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine-13/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala:37: Unused import
....
[INFO] Reactor Summary for Spark Project Parent POM 4.0.0-SNAPSHOT:
[INFO] 
[INFO] Spark Project Parent POM ........................... SUCCESS [  0.930 s]
[INFO] Spark Project Tags ................................. SUCCESS [  2.402 s]
[INFO] Spark Project Sketch ............................... SUCCESS [  1.448 s]
[INFO] Spark Project Common Utils ......................... SUCCESS [  8.096 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  2.192 s]
[INFO] Spark Project Networking ........................... SUCCESS [  3.045 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  2.513 s]
[INFO] Spark Project Variant .............................. SUCCESS [  1.736 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  2.052 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  1.796 s]
[INFO] Spark Project Core ................................. SUCCESS [ 29.749 s]
[INFO] Spark Project ML Local Library ..................... SUCCESS [  4.818 s]
[INFO] Spark Project GraphX ............................... SUCCESS [  7.816 s]
[INFO] Spark Project Streaming ............................ SUCCESS [  9.809 s]
[INFO] Spark Project Connect API .......................... SUCCESS [  9.421 s]
[INFO] Spark Project SQL API .............................. SUCCESS [ 11.768 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [ 27.705 s]
[INFO] Spark Project SQL .................................. SUCCESS [ 27.801 s]
[INFO] Spark Project ML Library ........................... SUCCESS [ 30.874 s]
[INFO] Spark Project Tools ................................ SUCCESS [  2.164 s]
[INFO] Spark Project Hive ................................. SUCCESS [  9.120 s]
[INFO] Spark Project REPL ................................. SUCCESS [  3.762 s]
[INFO] Spark Project Connect Common ....................... FAILURE [  3.242 s]
....

run build/mvn clean compile -DskipTests with master branch is ok

@hvanhovell
Copy link
Contributor Author

@LuciferYang where are those test running?

@LuciferYang
Copy link
Contributor

For example:

@hvanhovell hvanhovell closed this Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants