[SPARK-43474] [SS] [CONNECT] Add a spark connect access to runtime Dataframes by ID. #41580

rangadi · 2023-06-13T23:59:32Z

[This is a continuation of #41146, to change the author of the PR. Retains the description.]

What changes were proposed in this pull request?

This change adds a new spark connect relation type CachedRemoteRelation, which can represent a DataFrame that's been cached on the server side.

On the server side, each SessionHolder has a cache to maintain mapping from Dataframe ID to actual dataframe.

On the client side, a new relation type and function is added. The new function can create a DataFrame reference given a key. The key is the id of a cached DataFrame, which is usually passed from server to the client. When transforming the DataFrame reference, the server finds the actual DataFrame from the cache and replace it.

One use case of this function will be streaming foreachBatch(). Server needs to call user function for every batch which takes a DataFrame as argument. With the new function, we can cache the DataFrame on the server. Pass the id back to client which can creates the DataFrame reference.

Why are the changes needed?

This change is needed to support streaming foreachBatch() in Spark Connect.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Scala unit test.
Manual test.
(More end to end test will be added when foreachBatch() is supported. Currently there is no way to add a dataframe to the server cache using Python.)

rangadi · 2023-06-14T00:34:29Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+// Represents a remote relation that has been cached on server.
+message CachedRemoteRelation {
+  // (Required) ID of the remote related (assigned by the service).
+  string relation_id = 3;


[continuation of the comment here]
@grundprinzip removed user_id & 'session_id'.

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

grundprinzip · 2023-06-15T17:59:01Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

@@ -788,6 +790,14 @@ class SparkConnectPlanner(val session: SparkSession) extends Logging {
      .logicalPlan
  }

+  private def transformCachedRemoteRelation(
+      session: SparkSession,


btw the session is already a class member of SparkConnectPlanner.

grundprinzip · 2023-06-15T18:18:42Z

...src/main/scala/org/apache/spark/sql/connect/service/SparkConnectCachedDataFrameManager.scala

+
+  // Session.sessionUUID -> Map[DF Reference ID -> DF]
+  @GuardedBy("this")
+  private val dataFrameCache = mutable.Map[String, mutable.Map[String, DataFrame]]()


Once you move this into the session holder, please use a concurrent hashmap instead because you only need one then.

grundprinzip · 2023-06-15T18:19:59Z

...est/scala/org/apache/spark/sql/connect/service/SparkConnectCachedDataFrameManagerSuite.scala

+import org.apache.spark.sql.connect.common.InvalidPlanInput
+import org.apache.spark.sql.test.SharedSparkSession
+
+class SparkConnectCachedDataFrameManagerSuite extends SharedSparkSession {


I think it makes sense to add tests that when you have different sessions with different users that you don't get access to a "guessed" cached plan. This is the key part for security.

### What changes were proposed in this pull request? This adds SessionHolder rather than just SparkSession to `SparkConnectPlanner`. This is to allow access to session specific state at connect server level. Note that this is Spark-Connect specific session state, and is not stored with SparkSession. E.g. * Mapping from _dataframe reference id_ to actual dataframe in #41580 * File and archives stored with session in #41495 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Existing unit tests. Closes #41618 from rangadi/session-holder. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? This adds SessionHolder rather than just SparkSession to `SparkConnectPlanner`. This is to allow access to session specific state at connect server level. Note that this is Spark-Connect specific session state, and is not stored with SparkSession. E.g. * Mapping from _dataframe reference id_ to actual dataframe in apache#41580 * File and archives stored with session in apache#41495 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Existing unit tests. Closes apache#41618 from rangadi/session-holder. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

bogao007

LGTM

MaxGekk · 2023-06-28T17:11:50Z

...ector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala

+   */
+  private[connect] def cacheDataFrameById(dfId: String, df: DataFrame): Unit = {
+    if (dataFrameCache.putIfAbsent(dfId, df) != null) {
+      throw new IllegalArgumentException(s"A dataframe is already associated with id $dfId")


Cannot you use the error framework for this?

Any existing ones I can use? This should be rare since it would only be caused by our bug. This is not very user visible.
I didn't see much of framework errors used in connect server yet.

Any existing ones I can use?

Add new error class to error-classes.json, and raise SparkException.

it would only be caused by our bug

Then we should consider SparkException.internalError

I didn't see much of framework errors used in connect server yet.

Should start doing that if we are going to transfer errors/exceptions from server to client in a consistent way.

I can use SparkException.internalError. Seems like it is not used anywhere in connect yet.

So far, we are migrating on the error class, and converted 72 cases already:

$ find . -name "*.scala" -print0|xargs -0 grep 'SparkException.internalError'|wc -l 72

Nice. Updated it to use SparkException.internalError.

MaxGekk · 2023-06-28T17:36:39Z

...ector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala

+  private[connect] def getDataFrameOrThrow(dfId: String): DataFrame = {
+    Option(dataFrameCache.get(dfId))
+      .getOrElse {
+        throw InvalidPlanInput(s"No DataFrame with id $dfId is found in the session $sessionId")


The same here, how about to introduce an error class?

May be not needed since InvalidPlanInput is used widely for this exact purpose and this is less user visible.

HyukjinKwon · 2023-06-29T16:24:52Z

Merged to master.

…scala` to `SparkConnectSessionHolderSuite.scala` ### What changes were proposed in this pull request? This PR aims to fix a typo `Hodler` in file name. - `SparkConnectSessionHodlerSuite.scala` (from) - `SparkConnectSessionHolderSuite.scala` (to) It's also unmatched with the class name in the file because class name itself is correct. https://github.com/apache/spark/blob/3363c2af3f6a59363135451d251f25e328a4fddf/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/service/SparkConnectSessionHodlerSuite.scala#L37 ### Why are the changes needed? This is a typo from the original PR. - #41580 Since the original PR is shipped as Apache Spark 3.5.0, I created a JIRA instead of a follow-up. We need to backport this patch to `branch-3.5`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43657 from dongjoon-hyun/SPARK-45791. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6d669fa) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…scala` to `SparkConnectSessionHolderSuite.scala` ### What changes were proposed in this pull request? This PR aims to fix a typo `Hodler` in file name. - `SparkConnectSessionHodlerSuite.scala` (from) - `SparkConnectSessionHolderSuite.scala` (to) It's also unmatched with the class name in the file because class name itself is correct. https://github.com/apache/spark/blob/3363c2af3f6a59363135451d251f25e328a4fddf/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/service/SparkConnectSessionHodlerSuite.scala#L37 ### Why are the changes needed? This is a typo from the original PR. - #41580 Since the original PR is shipped as Apache Spark 3.5.0, I created a JIRA instead of a follow-up. We need to backport this patch to `branch-3.5`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43657 from dongjoon-hyun/SPARK-45791. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

MaxGekk · 2023-12-18T13:21:26Z

...ector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala

+   */
+  private[connect] def cacheDataFrameById(dfId: String, df: DataFrame): Unit = {
+    if (dataFrameCache.putIfAbsent(dfId, df) != null) {
+      SparkException.internalError(s"A dataframe is already associated with id $dfId")


The internalError just creates SparkException, so, need to throw it apparently. Here is the PR #44400 with a minor fix of this mistake and another one.

…k Connect ### What changes were proposed in this pull request? This PR proposes to `DataFrame.checkpoint` and `DataFrame.localCheckpoint` API in Spark Connect. #### Overview ![Screenshot 2024-05-16 at 10 39 25 AM](https://github.com/apache/spark/assets/6477701/c5c4754f-3d5e-4f4a-8f9d-a7218ce49320) 1. Spark Connect Client invokes [local]checkpoint - Connects to the server, store (Session UI, UUID) <> Checkpointed DataFrame 2. Execute [local]checkpoint 3. Returns UUID for the checkedpointed DataFrame. - Client side holds the UUID with truncated (replaced) the protobuf message 4. When the DataFrame in client side is garbage-collected, it is invoked to clear the state within Spark Connect server. 5. If the checkpointed RDD is not referred anymore (e.g., not even by temp view as an example), it is cleaned by ContextCleaner (which runs separately, and periodically) 6. *When the session is closed, it attempts to clear all mapped state in Spark Connect server (because it is not guaranteed to call `DataFrame.__del__` in Python upon garbage-collection) 7. *If the checkpointed RDD is not referred anymore (e.g., not even by temp view as an example), it is cleaned by ContextCleaner (which runs separately, and periodically) *In 99.999% cases, the state (map<(session_id, uuid), c'p'dataframe>) will be cleared when DataFrame is garbage-collected, e.g., unless there are some crashes. Practically, Py4J also leverages to clean up their Java objects. For 0.001% cases, the 6. and 7. address them. Both steps happen when session is closed, and session holder is released, see also [#41580](#41580). #### Command/RPCs Reuse `CachedRemoteRelation` (from [#41580](#41580)) ```proto message Command { oneof command_type { ... CheckpointCommand checkpoint_command = 14; RemoveCachedRemoteRelationCommand remove_cached_remote_relation_command = 15; ... } } // Command to remove `CashedRemoteRelation` message RemoveCachedRemoteRelationCommand { // (Required) The remote to be related CachedRemoteRelation relation = 1; } message CheckpointCommand { // (Required) The logical plan to checkpoint. Relation relation = 1; // (Optional) Locally checkpoint using a local temporary // directory in Spark Connect server (Spark Driver) optional bool local = 2; // (Optional) Whether to checkpoint this dataframe immediately. optional bool eager = 3; } message CheckpointCommandResult { // (Required) The logical plan checkpointed. CachedRemoteRelation relation = 1; } ``` ```proto message ExecutePlanResponse { ... oneof response_type { ... CheckpointCommandResult checkpoint_command_result = 19; } ... message Checkpoint { // (Required) The logical plan checkpointed. CachedRemoteRelation relation = ...; } } ``` #### Usage ```bash ./sbin/start-connect-server.sh --conf spark.checkpoint.dir=/path/to/checkpoint ``` ```python spark.range(1).localCheckpoint() spark.range(1).checkpoint() ``` ### Why are the changes needed? For feature parity without Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, it adds both `DataFrame.checkpoint` and `DataFrame.localCheckpoint` API in Spark Connect. ### How was this patch tested? Unittests, and manually tested as below: **Code** ```bash ./bin/pyspark --remote "local[*]" ``` ```python >>> df = spark.range(1).localCheckpoint() >>> df.explain(True) == Parsed Logical Plan == LogicalRDD [id#1L], false == Analyzed Logical Plan == id: bigint LogicalRDD [id#1L], false == Optimized Logical Plan == LogicalRDD [id#1L], false == Physical Plan == *(1) Scan ExistingRDD[id#1L] >>> df._plan <pyspark.sql.connect.plan.CachedRemoteRelation object at 0x147734a50> >>> del df ``` **Logs** ``` ... {"ts":"2024-05-14T06:18:01.711Z","level":"INFO","msg":"Caching DataFrame with id 7316f315-d20d-446d-b5e7-ac848870e280","context":{"dataframe_id":"7316f315-d20d-446d-b5e7-ac848870e280"},"logger":"SparkConnectAnalyzeHandler"} ... {"ts":"2024-05-14T06:18:11.718Z","level":"INFO","msg":"Removing DataFrame with id 7316f315-d20d-446d-b5e7-ac848870e280 from the cache","context":{"dataframe_id":"7316f315-d20d-446d-b5e7-ac848870e280"},"logger":"SparkConnectPlanner"} ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46570 from HyukjinKwon/SPARK-48258. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

rangadi added 2 commits June 13, 2023 15:30

Peng's pr

e50fa97

Change the cache key to session

e0c711d

github-actions bot added CONNECT CORE PYTHON SQL labels Jun 13, 2023

remove unnused method

647ca67

rangadi commented Jun 14, 2023

View reviewed changes

rangadi mentioned this pull request Jun 14, 2023

[SPARK-43474] [SS] [CONNECT] Add a spark connect function to create DataFrame reference #41146

Closed

rangadi added 4 commits June 13, 2023 21:13

scalfmt fix

1ca1f3a

build python protobufs

5d25a32

minor: change field no to 1

8a05f9b

python codegen

dde3e36

grundprinzip reviewed Jun 15, 2023

View reviewed changes

incomplete discard

ef62238

rangadi mentioned this pull request Jun 16, 2023

[SPARK-43474] [SS] [CONNECT] Add SessionHolder to SparkConnectPlanner #41618

Closed

Merge remote-tracking branch 'upstream/master' into df-ref

d3c2503

rangadi added 5 commits June 27, 2023 00:11

minor

74b98ca

Merge remote-tracking branch 'upstream/master' into df-ref

b6e8f3d

SessionHodler tests

bb2ad33

update tests

5af56c9

remove cache manager

5c0c2de

bogao007 approved these changes Jun 28, 2023

View reviewed changes

rangadi added 2 commits June 27, 2023 23:36

scalafmt

6dca3e6

retrigger

8de5197

MaxGekk reviewed Jun 28, 2023

View reviewed changes

Use internalError

dbca798

rangadi requested review from MaxGekk, HyukjinKwon and grundprinzip June 28, 2023 23:54

HyukjinKwon approved these changes Jun 29, 2023

View reviewed changes

HyukjinKwon closed this in c973400 Jun 29, 2023

dongjoon-hyun mentioned this pull request Nov 4, 2023

[SPARK-45791][CONNECT][TESTS] Rename SparkConnectSessionHodlerSuite.scala to SparkConnectSessionHolderSuite.scala #43657

Closed

MaxGekk reviewed Dec 18, 2023

View reviewed changes

HyukjinKwon mentioned this pull request May 16, 2024

[SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spark Connect #46570

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43474] [SS] [CONNECT] Add a spark connect access to runtime Dataframes by ID. #41580

[SPARK-43474] [SS] [CONNECT] Add a spark connect access to runtime Dataframes by ID. #41580

rangadi commented Jun 13, 2023 •

edited

rangadi Jun 14, 2023

grundprinzip Jun 15, 2023

grundprinzip Jun 15, 2023

grundprinzip Jun 15, 2023

bogao007 left a comment

MaxGekk Jun 28, 2023

rangadi Jun 28, 2023

MaxGekk Jun 28, 2023

rangadi Jun 28, 2023 •

edited

MaxGekk Jun 28, 2023

rangadi Jun 28, 2023

MaxGekk Jun 28, 2023

rangadi Jun 28, 2023

HyukjinKwon commented Jun 29, 2023

MaxGekk Dec 18, 2023

[SPARK-43474] [SS] [CONNECT] Add a spark connect access to runtime Dataframes by ID. #41580

[SPARK-43474] [SS] [CONNECT] Add a spark connect access to runtime Dataframes by ID. #41580

Conversation

rangadi commented Jun 13, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogao007 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rangadi Jun 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 29, 2023

Choose a reason for hiding this comment

rangadi commented Jun 13, 2023 •

edited

rangadi Jun 28, 2023 •

edited