[SPARK-42653][CONNECT] Artifact transfer from Scala/JVM client to Server #40256

vicennial · 2023-03-02T16:49:45Z

What changes were proposed in this pull request?

This PR introduces a mechanism to transfer artifacts (currently, local .jar + .class files) from a Spark Connect JVM/Scala client over to the server side of Spark Connect. The mechanism follows the protocol as defined in #40147 and supports batching (for multiple "small" artifacts) and chunking (for large artifacts).

Note: Server-side artifact handling is not covered in this PR.

Why are the changes needed?

In the decoupled client-server architecture of Spark Connect, a remote client may use a local JAR or a new class in their UDF that may not be present on the server. To handle these cases of missing "artifacts", we implement a mechanism to transfer artifacts from the client side over to the server side as per the protocol defined in #40147.

Does this PR introduce any user-facing change?

Yes, users would be able to use the addArtifact and addArtifacts methods (via a SparkSession instance) to transfer local files (.jar and .class extensions).

How was this patch tested?

Unit tests - located in ArtifactSuite.

hvanhovell · 2023-03-02T18:52:24Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

+   *
+   * Currently only local files with extensions .jar and .class are supported.
+   */
+  def addArtifact(path: String): Unit = client.addArtifact(path)


Can you mark these as experimental?

hvanhovell · 2023-03-02T19:07:21Z

.../connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala

+      writeBatch()
+    }
+    stream.onCompleted()
+    ThreadUtils.awaitResult(promise.future, Duration.Inf)


I am a bit on the fence about this one. This is fine for now, but in a not so far away future we shouldn't block indefinitely.

I agree, we need a timeout policy. Handling this as part of https://issues.apache.org/jira/browse/SPARK-42658 (along with retry policy)

hvanhovell · 2023-03-02T19:07:30Z

.../connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala

+    }
+    stream.onCompleted()
+    ThreadUtils.awaitResult(promise.future, Duration.Inf)
+    // TODO: Handle responses containing CRC failures.


File a ticket please.

hvanhovell · 2023-03-02T19:09:04Z

.../connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala

+}
+
+trait ClassFinder {
+  def findClasses(): Iterator[Artifact]


We should document this a bit better. For example is this method returning all REPL generated classes, or only the new ones?

👍 Adding this as part of https://issues.apache.org/jira/browse/SPARK-42657

hvanhovell · 2023-03-02T19:09:21Z

.../connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala

+  }
+}
+
+trait ClassFinder {


Move it to its own source file?

Deleting the class finder related code for now, will add it as part of https://issues.apache.org/jira/browse/SPARK-42657 (since we don't use them in this PR)

hvanhovell · 2023-03-02T19:10:22Z

.../connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala

+  /**
+   * Payload stored on this machine.
+   */
+  sealed trait LocalData extends Storage {


I think we can flatten this hierarchy for now. There is no other data than local data.

Yeah, makes sense. Keeping the name LocalData (rather than renaming it to say, Data) intact to make it explicit that the data needs to be present locally for the transfer to take place (for now).

hvanhovell · 2023-03-02T20:09:04Z

...or/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/ArtifactSuite.scala

+   * @param chunk
+   * @return
+   */
+  private def checkChunkDataAndCrc(


A bit of a high level point. You are now using the same code to compute the crc, and to verify it. Is it possible to create more separation here. I would consider checking crcs in, or creating a file with known crc segments.

Added truth/golden files 👍

hvanhovell

Look pretty good overall. Left a couple of small comments.

amaliujia · 2023-03-03T02:26:34Z

.../connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala

+        val data = proto.AddArtifactsRequest.ArtifactChunk
+          .newBuilder()
+          .setData(ByteString.readFrom(in))
+          .setCrc(in.getChecksum.getValue)


I am not an expert on networking so just a question for my self education:

so the gRPC level bytes transmission is not 100% reliable so we need another CRC to check nothing is corrupted?

I am not sure about grpc's guarantees. However I have seen network transfers go wrong, and then checksums are your friend.

Sounds good. Thanks

amaliujia · 2023-03-03T03:19:02Z

Overall looks good. Thank you!

hvanhovell

LGTM

### What changes were proposed in this pull request? This PR introduces a mechanism to transfer artifacts (currently, local `.jar` + `.class` files) from a Spark Connect JVM/Scala client over to the server side of Spark Connect. The mechanism follows the protocol as defined in #40147 and supports batching (for multiple "small" artifacts) and chunking (for large artifacts). Note: Server-side artifact handling is not covered in this PR. ### Why are the changes needed? In the decoupled client-server architecture of Spark Connect, a remote client may use a local JAR or a new class in their UDF that may not be present on the server. To handle these cases of missing "artifacts", we implement a mechanism to transfer artifacts from the client side over to the server side as per the protocol defined in #40147. ### Does this PR introduce _any_ user-facing change? Yes, users would be able to use the `addArtifact` and `addArtifacts` methods (via a `SparkSession` instance) to transfer local files (`.jar` and `.class` extensions). ### How was this patch tested? Unit tests - located in `ArtifactSuite`. Closes #40256 from vicennial/SPARK-42653. Authored-by: vicennial <venkata.gudesa@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 8a0d626) Signed-off-by: Herman van Hovell <herman@databricks.com>

HyukjinKwon · 2023-03-03T12:31:51Z

...or/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/ArtifactSuite.scala

+    Files
+      .readAllLines(artifactCrcPath.resolve(crcFileName))
+      .asScala
+      .map(_.toLong)


Scala 2.13 build is broken by this. I made a quick followup #40267

### What changes were proposed in this pull request? This PR adds server-side artifact management as a follow up to the client-side artifact transfer introduced in #40256. Note: The artifacts added on the server are visible to **all users** of the cluster. This is a limitation of the current spark architecture (unisolated classloaders). Apart from storing generic artifacts, we handle jars and classfiles in specific ways: - Jars: - Jars may be added but not removed or overwritten. - Added jars would be visible to **all** users/tasks/queries. - Classfiles: - Classfiles may not be explicitly removed but are allowed to be overwritten. - We piggyback on top of the REPL architecture to serve classfiles to the executors - If a REPL is initialized, classfiles are stored in the existing `spark.repl.class.outputDir` and share the URI with `spark.repl.class.uri`. - If a REPL is not being used, we use a custom directory (root: `sparkContext. sparkConnectArtifactDirectory`) to store classfiles and point the `spark.repl.class.uri` towards it. - Class files are visible to **all** users/tasks/queries. ### Why are the changes needed? #40256 implements the client-side transfer of artifacts to the server but currently, the server does not process these requests. We need to implement a server-side management mechanism to handle the storage of these artifacts on the driver as well as perform further processing (such as adding jars and moving class files to the right directories). ### Does this PR introduce _any_ user-facing change? Yes, a new experimental API but no behavioural changes. A new method called `sparkConnectArtifactDirectory` is accessible through SparkContext (the directory storing all artifacts from SparkConnect) ### How was this patch tested? New unit tests. Closes #40368 from vicennial/SPARK-42748. Lead-authored-by: vicennial <venkata.gudesa@databricks.com> Co-authored-by: Venkata Sai Akhil Gudesa <venkata.gudesa@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

HyukjinKwon · 2023-05-22T00:22:20Z

.../connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/ArtifactManager.scala

+    val buf = new Array[Byte](CHUNK_SIZE)
+    var bytesRead = 0
+    var count = 0
+    while (count != -1 && bytesRead < CHUNK_SIZE) {


qq: why do we need this while loop? Seems like:

count = in.read(buf, 0, CHUNK_SIZE) if (count == 0) ByteString.empty() else ByteString.copyFrom(buf, 0, count)

would be good enough because read is blocked until it meets EOF IIRC.

…in Python client ### What changes were proposed in this pull request? This PR implements `SparkSession.addArtifact(s)`. The logic is basically translated from Scala (#40256) to Python here. One difference is that, it does not support `class` files and `cache` (#40827) because it's not realistic for Python client to add `class` files. For `cache`, this implementation will be used as a base work. This PR is also a base work to implement sending py-files and archive files ### Why are the changes needed? For feature parity w/ Scala client. In addition, this is also base work for `cache` implementation, and Python dependency management (https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html) ### Does this PR introduce _any_ user-facing change? Yes, this exposes an API `SparkSession.addArtifact(s)`. ### How was this patch tested? Unittests were added. Also manually tested. Closes #41250 from HyukjinKwon/python-addArtifacts. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR introduces a mechanism to transfer artifacts (currently, local `.jar` + `.class` files) from a Spark Connect JVM/Scala client over to the server side of Spark Connect. The mechanism follows the protocol as defined in apache#40147 and supports batching (for multiple "small" artifacts) and chunking (for large artifacts). Note: Server-side artifact handling is not covered in this PR. ### Why are the changes needed? In the decoupled client-server architecture of Spark Connect, a remote client may use a local JAR or a new class in their UDF that may not be present on the server. To handle these cases of missing "artifacts", we implement a mechanism to transfer artifacts from the client side over to the server side as per the protocol defined in apache#40147. ### Does this PR introduce _any_ user-facing change? Yes, users would be able to use the `addArtifact` and `addArtifacts` methods (via a `SparkSession` instance) to transfer local files (`.jar` and `.class` extensions). ### How was this patch tested? Unit tests - located in `ArtifactSuite`. Closes apache#40256 from vicennial/SPARK-42653. Authored-by: vicennial <venkata.gudesa@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 8a0d626) Signed-off-by: Herman van Hovell <herman@databricks.com>

[SPARK-42653] Artifact transfer from client

5821341

github-actions bot added CONNECT SQL labels Mar 2, 2023

use java8 to compile sample class files

f8b7ba5

hvanhovell reviewed Mar 2, 2023

View reviewed changes

vicennial added 3 commits March 3, 2023 00:14

doc feedback

feda6e7

crc truth files

d80be11

empty commit to trigger tests

ec2b3a8

amaliujia reviewed Mar 3, 2023

View reviewed changes

vicennial requested a review from hvanhovell March 3, 2023 11:09

hvanhovell approved these changes Mar 3, 2023

View reviewed changes

hvanhovell closed this in 8a0d626 Mar 3, 2023

HyukjinKwon reviewed Mar 3, 2023

View reviewed changes

vicennial mentioned this pull request Mar 10, 2023

[SPARK-42748][CONNECT] Server-side Artifact Management #40368

Closed

HyukjinKwon reviewed May 22, 2023

View reviewed changes

HyukjinKwon mentioned this pull request May 22, 2023

[SPARK-43612][CONNECT][PYTHON] Implement SparkSession.addArtifact(s) in Python client #41250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42653][CONNECT] Artifact transfer from Scala/JVM client to Server #40256

[SPARK-42653][CONNECT] Artifact transfer from Scala/JVM client to Server #40256

vicennial commented Mar 2, 2023 •

edited

hvanhovell Mar 2, 2023

hvanhovell Mar 2, 2023

vicennial Mar 2, 2023

hvanhovell Mar 2, 2023

hvanhovell Mar 2, 2023

vicennial Mar 2, 2023

hvanhovell Mar 2, 2023

vicennial Mar 2, 2023

hvanhovell Mar 2, 2023

vicennial Mar 2, 2023

hvanhovell Mar 2, 2023

vicennial Mar 3, 2023

hvanhovell left a comment

amaliujia Mar 3, 2023 •

edited

hvanhovell Mar 3, 2023

amaliujia Mar 3, 2023

amaliujia commented Mar 3, 2023

hvanhovell left a comment

HyukjinKwon Mar 3, 2023

vicennial Mar 3, 2023

HyukjinKwon May 22, 2023

[SPARK-42653][CONNECT] Artifact transfer from Scala/JVM client to Server #40256

[SPARK-42653][CONNECT] Artifact transfer from Scala/JVM client to Server #40256

Conversation

vicennial commented Mar 2, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

amaliujia Mar 3, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia commented Mar 3, 2023

hvanhovell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vicennial commented Mar 2, 2023 •

edited

amaliujia Mar 3, 2023 •

edited