Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-42748][CONNECT] Server-side Artifact Management
### What changes were proposed in this pull request? This PR adds server-side artifact management as a follow up to the client-side artifact transfer introduced in #40256. Note: The artifacts added on the server are visible to **all users** of the cluster. This is a limitation of the current spark architecture (unisolated classloaders). Apart from storing generic artifacts, we handle jars and classfiles in specific ways: - Jars: - Jars may be added but not removed or overwritten. - Added jars would be visible to **all** users/tasks/queries. - Classfiles: - Classfiles may not be explicitly removed but are allowed to be overwritten. - We piggyback on top of the REPL architecture to serve classfiles to the executors - If a REPL is initialized, classfiles are stored in the existing `spark.repl.class.outputDir` and share the URI with `spark.repl.class.uri`. - If a REPL is not being used, we use a custom directory (root: `sparkContext. sparkConnectArtifactDirectory`) to store classfiles and point the `spark.repl.class.uri` towards it. - Class files are visible to **all** users/tasks/queries. ### Why are the changes needed? #40256 implements the client-side transfer of artifacts to the server but currently, the server does not process these requests. We need to implement a server-side management mechanism to handle the storage of these artifacts on the driver as well as perform further processing (such as adding jars and moving class files to the right directories). ### Does this PR introduce _any_ user-facing change? Yes, a new experimental API but no behavioural changes. A new method called `sparkConnectArtifactDirectory` is accessible through SparkContext (the directory storing all artifacts from SparkConnect) ### How was this patch tested? New unit tests. Closes #40368 from vicennial/SPARK-42748. Lead-authored-by: vicennial <venkata.gudesa@databricks.com> Co-authored-by: Venkata Sai Akhil Gudesa <venkata.gudesa@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>
- Loading branch information
1 parent
631e8eb
commit ec02224
Showing
26 changed files
with
930 additions
and
46 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+5.54 KB
connector/connect/common/src/test/resources/artifact-tests/Hello.class
Binary file not shown.
1 change: 1 addition & 0 deletions
1
connector/connect/common/src/test/resources/artifact-tests/crc/Hello.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
553633018 |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
162 changes: 162 additions & 0 deletions
162
...er/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.connect.artifact | ||
|
||
import java.net.{URL, URLClassLoader} | ||
import java.nio.file.{Files, Path, Paths, StandardCopyOption} | ||
import java.util.concurrent.CopyOnWriteArrayList | ||
|
||
import scala.collection.JavaConverters._ | ||
|
||
import org.apache.spark.{SparkContext, SparkEnv} | ||
import org.apache.spark.sql.SparkSession | ||
import org.apache.spark.util.Utils | ||
|
||
/** | ||
* The Artifact Manager for the [[SparkConnectService]]. | ||
* | ||
* This class handles the storage of artifacts as well as preparing the artifacts for use. | ||
* Currently, jars and classfile artifacts undergo additional processing: | ||
* - Jars are automatically added to the underlying [[SparkContext]] and are accessible by all | ||
* users of the cluster. | ||
* - Class files are moved into a common directory that is shared among all users of the | ||
* cluster. Note: Under a multi-user setup, class file conflicts may occur between user | ||
* classes as the class file directory is shared. | ||
*/ | ||
class SparkConnectArtifactManager private[connect] { | ||
|
||
// The base directory where all artifacts are stored. | ||
// Note: If a REPL is attached to the cluster, class file artifacts are stored in the | ||
// REPL's output directory. | ||
private[connect] lazy val artifactRootPath = SparkContext.getActive match { | ||
case Some(sc) => | ||
sc.sparkConnectArtifactDirectory.toPath | ||
case None => | ||
throw new RuntimeException("SparkContext is uninitialized!") | ||
} | ||
private[connect] lazy val artifactRootURI = { | ||
val fileServer = SparkEnv.get.rpcEnv.fileServer | ||
fileServer.addDirectory("artifacts", artifactRootPath.toFile) | ||
} | ||
|
||
// The base directory where all class files are stored. | ||
// Note: If a REPL is attached to the cluster, we piggyback on the existing REPL output | ||
// directory to store class file artifacts. | ||
private[connect] lazy val classArtifactDir = SparkEnv.get.conf | ||
.getOption("spark.repl.class.outputDir") | ||
.map(p => Paths.get(p)) | ||
.getOrElse(artifactRootPath.resolve("classes")) | ||
|
||
private[connect] lazy val classArtifactUri: String = | ||
SparkEnv.get.conf.getOption("spark.repl.class.uri") match { | ||
case Some(uri) => uri | ||
case None => | ||
throw new RuntimeException("Class artifact URI had not been initialised in SparkContext!") | ||
} | ||
|
||
private val jarsList = new CopyOnWriteArrayList[Path] | ||
|
||
/** | ||
* Get the URLs of all jar artifacts added through the [[SparkConnectService]]. | ||
* | ||
* @return | ||
*/ | ||
def getSparkConnectAddedJars: Seq[URL] = jarsList.asScala.map(_.toUri.toURL).toSeq | ||
|
||
/** | ||
* Add and prepare a staged artifact (i.e an artifact that has been rebuilt locally from bytes | ||
* over the wire) for use. | ||
* | ||
* @param session | ||
* @param remoteRelativePath | ||
* @param serverLocalStagingPath | ||
*/ | ||
private[connect] def addArtifact( | ||
session: SparkSession, | ||
remoteRelativePath: Path, | ||
serverLocalStagingPath: Path): Unit = { | ||
require(!remoteRelativePath.isAbsolute) | ||
if (remoteRelativePath.startsWith("classes/")) { | ||
// Move class files to common location (shared among all users) | ||
val target = classArtifactDir.resolve(remoteRelativePath.toString.stripPrefix("classes/")) | ||
Files.createDirectories(target.getParent) | ||
// Allow overwriting class files to capture updates to classes. | ||
Files.move(serverLocalStagingPath, target, StandardCopyOption.REPLACE_EXISTING) | ||
} else { | ||
val target = artifactRootPath.resolve(remoteRelativePath) | ||
Files.createDirectories(target.getParent) | ||
// Disallow overwriting jars because spark doesn't support removing jars that were | ||
// previously added, | ||
if (Files.exists(target)) { | ||
throw new RuntimeException( | ||
s"Duplicate Jar: $remoteRelativePath. " + | ||
s"Jars cannot be overwritten.") | ||
} | ||
Files.move(serverLocalStagingPath, target) | ||
if (remoteRelativePath.startsWith("jars")) { | ||
// Adding Jars to the underlying spark context (visible to all users) | ||
session.sessionState.resourceLoader.addJar(target.toString) | ||
jarsList.add(target) | ||
} | ||
} | ||
} | ||
} | ||
|
||
object SparkConnectArtifactManager { | ||
|
||
private var _activeArtifactManager: SparkConnectArtifactManager = _ | ||
|
||
/** | ||
* Obtain the active artifact manager or create a new artifact manager. | ||
* | ||
* @return | ||
*/ | ||
def getOrCreateArtifactManager: SparkConnectArtifactManager = { | ||
if (_activeArtifactManager == null) { | ||
_activeArtifactManager = new SparkConnectArtifactManager | ||
} | ||
_activeArtifactManager | ||
} | ||
|
||
private lazy val artifactManager = getOrCreateArtifactManager | ||
|
||
/** | ||
* Obtain a classloader that contains jar and classfile artifacts on the classpath. | ||
* | ||
* @return | ||
*/ | ||
def classLoaderWithArtifacts: ClassLoader = { | ||
val urls = artifactManager.getSparkConnectAddedJars :+ | ||
artifactManager.classArtifactDir.toUri.toURL | ||
new URLClassLoader(urls.toArray, Utils.getContextOrSparkClassLoader) | ||
} | ||
|
||
/** | ||
* Run a segment of code utilising a classloader that contains jar and classfile artifacts on | ||
* the classpath. | ||
* | ||
* @param thunk | ||
* @tparam T | ||
* @return | ||
*/ | ||
def withArtifactClassLoader[T](thunk: => T): T = { | ||
Utils.withContextClassLoader(classLoaderWithArtifacts) { | ||
thunk | ||
} | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.