[SPARK-48919][CONNECT] Move connect code generation and dependency management to a separate project#47378
[SPARK-48919][CONNECT] Move connect code generation and dependency management to a separate project#47378hvanhovell wants to merge 6 commits intoapache:masterfrom
Conversation
| <configuration> | ||
| <visitor>true</visitor> | ||
| <sourceDirectory>../api/src/main/antlr4</sourceDirectory> | ||
| <sourceDirectory>src/main/antlr4</sourceDirectory> |
There was a problem hiding this comment.
Artifact from the last refactorings...
| <shadedArtifactAttached>false</shadedArtifactAttached> | ||
| <promoteTransitiveDependencies>true</promoteTransitiveDependencies> | ||
| <artifactSet> | ||
| <!-- TODO make sure this is complete. --> |
There was a problem hiding this comment.
Most of the shading is now done in connect-api.
There was a problem hiding this comment.
... and yes I still need to check iif the shading rules between SBT and maven are still consistent.
| <artifactId>spark-sql-api_${scala.binary.version}</artifactId> | ||
| <version>${project.version}</version> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
This is all moved to connect-api.
|
|
||
| <dependencies> | ||
| <dependency> | ||
| <groupId>org.apache.spark</groupId> |
There was a problem hiding this comment.
This has all been moved to connect-api.
| <filter> | ||
| <artifact>org.apache.tomcat:annotations-api</artifact> | ||
| <includes> | ||
| <include>javax/annotation/**</include> |
There was a problem hiding this comment.
We actually only need javax.annotation.Generated. We are dropping the rest because shading all of javax is not something we want to do. The alternative is that we don't shade org.apache.tomcat:annotations-api and com.google.code.findbugs:jsr305.
| </extensions> | ||
| <plugins> | ||
| <!-- Shading for protobuf & gRPC --> | ||
| <plugin> |
There was a problem hiding this comment.
In a perfect world we check if we are not packaging unshaded classes in the uber jar.
| name.startsWith("pmml-model-") || name.startsWith("scala-collection-compat_") || | ||
| name.startsWith("jsr305-") || name.startsWith("netty-") || name == "unused-1.0.0.jar" | ||
| val cp = (Runtime / managedClasspath).value | ||
| val prefixesToShade = Seq( |
There was a problem hiding this comment.
I'd love to use a less wonky mechanism here.
| object SparkConnect { | ||
| import BuildCommons.protoVersion | ||
|
|
||
| object SparkConnectServer { |
There was a problem hiding this comment.
This is only here to make assembly work. It is a bit weird because this module does not need a full assembly anymore.
|
For the reviewers. Most of this PR is mechanical, renaming imports to their new shaded names. Please focus on the Maven and SBT build files first! |
|
cc @LuciferYang if you find some time to review. |
|
I remember there are issues for maven to consume a shaded module in the same project. i.e. you must run |
|
@pan3793 thanks for the input. I did check maven package and that seemed to work for packaging (this is the command I used: |
|
Thank you for pinging me, @HyukjinKwon |
@hvanhovell It significantly affects users who use Maven in local dev, we have done it in Apache Kyuubi and finally reverted due to bad dev experience. IDEA recognizes Spark as a Maven project in importing(at least by default, not sure how to import Spark as an SBT project), thus it also affects users who want to run UT in IDEA. PS: due to some network issues in CN, SBT is hard to bootstrap which blocks a lot of devs to try such an awesome building tool ... |
There are multiple daily tests now using Maven for testing. |
|
@hvanhovell local run then |
|
local run I think we should modify |
| "scala-", | ||
| "netty-") | ||
| val unexpectedUnshadedJars = filterClasspath(unshadedJars, expectedUnshadedPrefixes) | ||
| if (unexpectedUnshadedJars.nonEmpty) { |
There was a problem hiding this comment.
Run / Run TPC-DS queries with SF=1
[error] java.lang.IllegalStateException: Unexpected unshaded jar(s) found:
[error] - Attributed(/home/runner/.cache/coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/jpmml/pmml-model/1.4.8/pmml-model-1.4.8.jar)
[error] at SparkConnectApi$.$anonfun$settings$15(SparkBuild.scala:692)
[error] at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63)
[error] at sbt.std.Transform$$anon$4.work(Transform.scala:69)
[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:283)
[error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24)
[error] at sbt.Execute.work(Execute.scala:292)
[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:283)
[error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:65)
[error] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[error] at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
[error] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[error] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[error] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[error] at java.base/java.lang.Thread.run(Thread.java:840)
[error] (connect-api / assembly / assemblyExcludedJars) java.lang.IllegalStateException: Unexpected unshaded jar(s) found:
[error] - Attributed(/home/runner/.cache/coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/jpmml/pmml-model/1.4.8/pmml-model-1.4.8.jar)
[error] Total time: 234 s (03:54), completed Jul 17, 2024, 1:49:40 AM
It's a bit strange, only the task Run TPC-DS queries with SF=1 detected this issue.
|
spark/dev/sparktestsupport/modules.py Lines 323 to 334 in 3a24555 Although the |
|
local run then failed run |
|
@LuciferYang where are those test running? |
|
For example:
|
What changes were proposed in this pull request?
This PR moves the connect protos and code generation into a separate module. This module produces a shaded artifact that is used by both the connect server and client.
Why are the changes needed?
This is the first step in creating a Scala Dataframe API shared between sql/core and connect.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
No