[SPARK-43906][PYTHON][CONNECT] Implement the file support in SparkSession.addArtifacts #41415

HyukjinKwon · 2023-06-01T02:27:25Z

What changes were proposed in this pull request?

This PR proposes to add the support of the regular files in SparkSession.addArtifacts.

Why are the changes needed?

So users can add the regular files in the worker nodes.

Does this PR introduce any user-facing change?

Yes, it adds the support of arbitrary regular files in SparkSession.addArtifacts.

How was this patch tested?

Added a couple of unittests.

Also manually tested in local-cluster:

./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar` --master "local-cluster[2,2,1024]"
./bin/pyspark --remote "sc://localhost:15002"

import os
import tempfile
from pyspark.sql.functions import udf
from pyspark import SparkFiles

with tempfile.TemporaryDirectory() as d:
    file_path = os.path.join(d, "my_file.txt")
    with open(file_path, "w") as f:
        f.write("Hello world!!")
    @udf("string")
    def func(x):
        with open(
            os.path.join(SparkFiles.getRootDirectory(), "my_file.txt"), "r"
        ) as my_file:
            return my_file.read().strip()
    spark.addArtifacts(file_path, file=True)
    spark.range(1).select(func("id")).show()

HyukjinKwon · 2023-06-01T02:28:09Z

. cc @hvanhovell @vicennial, mind taking a look please?

hvanhovell · 2023-06-01T12:32:53Z

...erver/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala

@@ -154,6 +154,8 @@ class SparkConnectArtifactManager private[connect] {
        val canonicalUri =
          fragment.map(UriBuilder.fromUri(target.toUri).fragment).getOrElse(target.toUri)
        sessionHolder.session.sparkContext.addArchive(canonicalUri.toString)
+      } else if (remoteRelativePath.startsWith(s"files${File.separator}")) {
+        sessionHolder.session.sparkContext.addFile(target.toString)


We are going to add session isolation for scala udfs soon. How do you think we should implement file support when there are multiple users uploading files with the same name?

I believe jars have the same problem, and I believe we could share the same fix.

My only problem here is that we need to design this for the python side. In practice artifacts for a session are exposed in a session specific location. How would a python user interact with these files? Through org.apache.spark.SparkFiles?

For regular files and archives, I don't intend to expose org.apache.spark.SparkFiles for now.
Since the files are archives are always stored at the current working directory of executors in production, I was simply thinking about creating a session dedicated directory, and change the current working directory to that (during Python UDF execution).

Meaning that the end users would continue accessing to their file with ./myfile.txt or ./myarchive.

(SparkFiles is being used in the test case here but that's a sort of hack to make sure of cleaning up, etc.)

Just to be extra clear, each Spark Connect session will have a dedicated directory on worker nodes.

I will make a PR right away after this.

Offline discussed with Herman. Will merge this first, and make a PR to address it.

HyukjinKwon · 2023-06-07T01:54:13Z

Merged to master.

…sion.addArtifacts ### What changes were proposed in this pull request? This PR proposes to add the support of the regular files in `SparkSession.addArtifacts`. ### Why are the changes needed? So users can add the regular files in the worker nodes. ### Does this PR introduce _any_ user-facing change? Yes, it adds the support of arbitrary regular files in `SparkSession.addArtifacts`. ### How was this patch tested? Added a couple of unittests. Also manually tested in `local-cluster`: ```bash ./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar` --master "local-cluster[2,2,1024]" ./bin/pyspark --remote "sc://localhost:15002" ``` ```python import os import tempfile from pyspark.sql.functions import udf from pyspark import SparkFiles with tempfile.TemporaryDirectory() as d: file_path = os.path.join(d, "my_file.txt") with open(file_path, "w") as f: f.write("Hello world!!") udf("string") def func(x): with open( os.path.join(SparkFiles.getRootDirectory(), "my_file.txt"), "r" ) as my_file: return my_file.read().strip() spark.addArtifacts(file_path, file=True) spark.range(1).select(func("id")).show() ``` Closes apache#41415 from HyukjinKwon/addFile. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Implement the regular file support in SparkSession.addArtifacts

7114513

github-actions bot added CONNECT CORE PYTHON SQL labels Jun 1, 2023

hvanhovell reviewed Jun 1, 2023

View reviewed changes

HyukjinKwon closed this in 6fd1c64 Jun 7, 2023

HyukjinKwon deleted the addFile branch January 15, 2024 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43906][PYTHON][CONNECT] Implement the file support in SparkSession.addArtifacts #41415

[SPARK-43906][PYTHON][CONNECT] Implement the file support in SparkSession.addArtifacts #41415

HyukjinKwon commented Jun 1, 2023 •

edited

HyukjinKwon commented Jun 1, 2023

hvanhovell Jun 1, 2023

HyukjinKwon Jun 2, 2023

hvanhovell Jun 2, 2023

HyukjinKwon Jun 5, 2023 •

edited

HyukjinKwon Jun 5, 2023

HyukjinKwon Jun 7, 2023

HyukjinKwon Jun 7, 2023

HyukjinKwon Jun 7, 2023

HyukjinKwon commented Jun 7, 2023

[SPARK-43906][PYTHON][CONNECT] Implement the file support in SparkSession.addArtifacts #41415

[SPARK-43906][PYTHON][CONNECT] Implement the file support in SparkSession.addArtifacts #41415

Conversation

HyukjinKwon commented Jun 1, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Jun 1, 2023

hvanhovell Jun 1, 2023

Choose a reason for hiding this comment

HyukjinKwon Jun 2, 2023

Choose a reason for hiding this comment

hvanhovell Jun 2, 2023

Choose a reason for hiding this comment

HyukjinKwon Jun 5, 2023 • edited

Choose a reason for hiding this comment

HyukjinKwon Jun 5, 2023

Choose a reason for hiding this comment

HyukjinKwon Jun 7, 2023

Choose a reason for hiding this comment

HyukjinKwon Jun 7, 2023

Choose a reason for hiding this comment

HyukjinKwon Jun 7, 2023

Choose a reason for hiding this comment

HyukjinKwon commented Jun 7, 2023

HyukjinKwon commented Jun 1, 2023 •

edited

HyukjinKwon Jun 5, 2023 •

edited