[SPARK-42585][CONNECT] Streaming of local relations #40827

MaxGekk · 2023-04-17T20:03:56Z

What changes were proposed in this pull request?

In the PR, I propose to transfer a local relation to the server in streaming way when it exceeds some size which is defined by the SQL config spark.sql.session.localRelationCacheThreshold. The config value is 64MB by default. In particular:

The client applies the sha256 function over the arrow form of the local relation;
It checks presents of the relation at the server side by sending the relation hash to the server;
If the server doesn't have the local relation, the client transfers the local relation as an artefact with the name cache/<sha256>;
As soon as the relation has presented at the server already, or transferred recently, the client transform the logical plan by replacing the LocalRelation node by CachedLocalRelation with the hash.
On another hand, the server converts CachedLocalRelation back to LocalRelation by retrieving the relation body from the local cache.

Details of the implementation

The client sends new command ArtifactStatusesRequest to check either the local relation is cached at the server or not. New command comes via new RPC endpoint ArtifactStatus. And the server answers by new message ArtifactStatusesResponse, see base.proto.

The client transfers serialized (in avro) body of local relation and its schema via the RPC endpoint AddArtifacts. On another hand, the server stores the received artifact in the block manager using the id CacheId. The last one has 3 parts:

userId - the identifier of the user that created the local relation,
sessionId - the identifier of the session which the relation belongs to,
hash - a sha-256 hash over relation body.

See SparkConnectArtifactManager.addArtifact().

The current query is blocked till the local relation is cached at the server side.

When the server receives the query, it retrieves userId, sessionId and hash from CachedLocalRelation, and gets the local relation data from the block manager. See SparkConnectPlanner.transformCachedLocalRelation().

The occupied blocks at the block manager are removed when an user session is invalidated in userSessionMapping. See SparkConnectService.RemoveSessionListener and BlockManager.removeCache()`.

Why are the changes needed?

To allow creating a dataframe from a large local collection. spark.createDataFrame(...) fails with the following error w/o the changes:

23/04/21 20:32:20 WARN NettyServerStream: Exception processing message
org.sparkproject.connect.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 134217728: 268435456
	at org.sparkproject.connect.grpc.Status.asRuntimeException(Status.java:526)

Does this PR introduce any user-facing change?

No. The changes extend the existing proto API.

How was this patch tested?

By running the new tests:

$ build/sbt "test:testOnly *.ArtifactManagerSuite"
$ build/sbt "test:testOnly *.ClientE2ETestSuite"
$ build/sbt "test:testOnly *.ArtifactStatusesHandlerSuite"

…aFrame-2

zhengruifeng · 2023-04-20T07:40:03Z

...erver/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala

+          tmpFile = tmpFile,
+          blockSize = tmpFile.length())
+        updater.save()
+      }(catchBlock = {tmpFile.delete()})


will connect server remove the temp files after closing the connect session?
I guess we may add session id and user id to the blockId, and release all the related blocks when a session ends.

Yep, makes sense. Let me try that.

…aFrame-2

grundprinzip · 2023-04-28T19:38:21Z

Are the Python changes done in a follow up?

grundprinzip · 2023-04-28T19:42:14Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

-        val arrowData = ConvertToArrow(encoder, data, timeZoneId, allocator)
-        localRelationBuilder.setData(arrowData)
+        val (arrowData, arrowDataSize) = ConvertToArrow(encoder, data, timeZoneId, allocator)
+        if (arrowDataSize <= conf.get(SQLConf.LOCAL_RELATION_CACHE_THRESHOLD.key).toInt) {


It's kind of weird that we're using an internal API for the client side confs.

Ideally we leverage the existing stub configs in connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/config/ConnectCommon.scala for now?

It's kind of weird that we're using an internal API for the client side confs.

I thought the caching approach could be implemented not only in the connect.

The place you pointed out has constants only but I want to give users some control over the feature.

Yes absolutely, your're right this needs to be configurable. My point is mostly that we don't have Spark Confs on the client. In Python we don't have the JVM to parse them on startup for example, you can set them via spark.conf.set but that's it.

My general recommendation would be to avoid pulling in additional SQL/Core dependencies in the client.

My point is mostly that we don't have Spark Confs on the client.

ahh, you meant this:

Suggested change

if (arrowDataSize <= conf.get(SQLConf.LOCAL_RELATION_CACHE_THRESHOLD.key).toInt) {

if (arrowDataSize <= conf.get("spark.sql.session.localRelationCacheThreshold").toInt) {

like @hvanhovell did above:

spark/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

Line 125 in 3814d15

val timeZoneId = conf.get("spark.sql.session.timeZone")

This reverts commit ea0cc60.

…aFrame-2

MaxGekk · 2023-05-01T13:12:11Z

Are the Python changes done in a follow up?

Yep, in a separate PR.

HyukjinKwon · 2023-05-02T00:58:12Z

Merged to master.

### What changes were proposed in this pull request? In the PR, I propose to transfer a local relation to the server in streaming way when it exceeds some size which is defined by the SQL config `spark.sql.session.localRelationCacheThreshold`. The config value is 64MB by default. In particular: 1. The client applies the `sha256` function over the arrow form of the local relation; 2. It checks presents of the relation at the server side by sending the relation hash to the server; 3. If the server doesn't have the local relation, the client transfers the local relation as an artefact with the name `cache/<sha256>`; 4. As soon as the relation has presented at the server already, or transferred recently, the client transform the logical plan by replacing the `LocalRelation` node by `CachedLocalRelation` with the hash. 5. On another hand, the server converts `CachedLocalRelation` back to `LocalRelation` by retrieving the relation body from the local cache. #### Details of the implementation The client sends new command `ArtifactStatusesRequest` to check either the local relation is cached at the server or not. New command comes via new RPC endpoint `ArtifactStatus`. And the server answers by new message `ArtifactStatusesResponse`, see **base.proto**. The client transfers serialized (in avro) body of local relation and its schema via the RPC endpoint `AddArtifacts`. On another hand, the server stores the received artifact in the block manager using the id `CacheId`. The last one has 3 parts: - `userId` - the identifier of the user that created the local relation, - `sessionId` - the identifier of the session which the relation belongs to, - `hash` - a `sha-256` hash over relation body. See **SparkConnectArtifactManager.addArtifact()**. The current query is blocked till the local relation is cached at the server side. When the server receives the query, it retrieves `userId`, `sessionId` and `hash` from `CachedLocalRelation`, and gets the local relation data from the block manager. See **SparkConnectPlanner.transformCachedLocalRelation()**. The occupied blocks at the block manager are removed when an user session is invalidated in `userSessionMapping`. See **SparkConnectService.RemoveSessionListener** and **BlockManager.removeCache()`**. ### Why are the changes needed? To allow creating a dataframe from a large local collection. `spark.createDataFrame(...)` fails with the following error w/o the changes: ```java 23/04/21 20:32:20 WARN NettyServerStream: Exception processing message org.sparkproject.connect.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 134217728: 268435456 at org.sparkproject.connect.grpc.Status.asRuntimeException(Status.java:526) ``` ### Does this PR introduce _any_ user-facing change? No. The changes extend the existing proto API. ### How was this patch tested? By running the new tests: ``` $ build/sbt "test:testOnly *.ArtifactManagerSuite" $ build/sbt "test:testOnly *.ClientE2ETestSuite" $ build/sbt "test:testOnly *.ArtifactStatusesHandlerSuite" ``` Closes apache#40827 from MaxGekk/streaming-createDataFrame-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…in Python client ### What changes were proposed in this pull request? This PR implements `SparkSession.addArtifact(s)`. The logic is basically translated from Scala (#40256) to Python here. One difference is that, it does not support `class` files and `cache` (#40827) because it's not realistic for Python client to add `class` files. For `cache`, this implementation will be used as a base work. This PR is also a base work to implement sending py-files and archive files ### Why are the changes needed? For feature parity w/ Scala client. In addition, this is also base work for `cache` implementation, and Python dependency management (https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html) ### Does this PR introduce _any_ user-facing change? Yes, this exposes an API `SparkSession.addArtifact(s)`. ### How was this patch tested? Unittests were added. Also manually tested. Closes #41250 from HyukjinKwon/python-addArtifacts. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? In the PR, I propose to extend Artifact API of the Python connect client by two new methods similarly to #40827: 1. `is_cached_artifact()` checks the cache of the given hash presents at the server side. 2. `cache_artifact()` caches a blob in memory at the server side. ### Why are the changes needed? To allow creating a dataframe from a large local collection. `spark.createDataFrame(...)` fails with the following error w/o the changes: ```python pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.RESOURCE_EXHAUSTED details = "Sent message larger than max (629146388 vs. 134217728)" debug_error_string = "UNKNOWN:Error received from peer localhost:58218 {grpc_message:"Sent message larger than max (629146388 vs. 134217728)", grpc_status:8, created_time:"2023-06-05T18:35:50.912817+03:00"}" ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new tests: ``` $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.client.test_artifact ArtifactTests' ``` Closes #41465 from MaxGekk/streaming-createDataFrame-python-3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…reaming manner ### What changes were proposed in this pull request? In the PR, I propose to transfer a local relation from **the Python connect client** to the server in streaming way when it exceeds some size which is defined by the SQL config `spark.sql.session.localRelationCacheThreshold`. The implementation is similar to #40827. In particular: 1. The client applies the `sha256` function over **the proto form** of the local relation; 2. It checks presents of the relation at the server side by sending the relation hash to the server; 3. If the server doesn't have the local relation, the client transfers the local relation as an artefact with the name `cache/<sha256>`; 4. As soon as the relation has presented at the server already, or transferred recently, the client transform the logical plan by replacing the `LocalRelation` node by `CachedLocalRelation` with the hash. 5. On another hand, the server converts `CachedLocalRelation` back to `LocalRelation` by retrieving the relation body from the local cache. ### Why are the changes needed? To fix the issues of creating a large dataframe from a local collection: ```python pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.RESOURCE_EXHAUSTED details = "Sent message larger than max (134218508 vs. 134217728)" debug_error_string = "UNKNOWN:Error received from peer localhost:50982 {grpc_message:"Sent message larger than max (134218508 vs. 134217728)", grpc_status:8, created_time:"2023-06-09T15:34:08.362797+03:00"} ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new test: ``` $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_streaming_local_relation' ``` Closes #41537 from MaxGekk/streaming-createDataFrame-python-4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? In the PR, I propose to extend Artifact API of the Python connect client by two new methods similarly to apache#40827: 1. `is_cached_artifact()` checks the cache of the given hash presents at the server side. 2. `cache_artifact()` caches a blob in memory at the server side. ### Why are the changes needed? To allow creating a dataframe from a large local collection. `spark.createDataFrame(...)` fails with the following error w/o the changes: ```python pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.RESOURCE_EXHAUSTED details = "Sent message larger than max (629146388 vs. 134217728)" debug_error_string = "UNKNOWN:Error received from peer localhost:58218 {grpc_message:"Sent message larger than max (629146388 vs. 134217728)", grpc_status:8, created_time:"2023-06-05T18:35:50.912817+03:00"}" ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new tests: ``` $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.client.test_artifact ArtifactTests' ``` Closes apache#41465 from MaxGekk/streaming-createDataFrame-python-3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…reaming manner ### What changes were proposed in this pull request? In the PR, I propose to transfer a local relation from **the Python connect client** to the server in streaming way when it exceeds some size which is defined by the SQL config `spark.sql.session.localRelationCacheThreshold`. The implementation is similar to apache#40827. In particular: 1. The client applies the `sha256` function over **the proto form** of the local relation; 2. It checks presents of the relation at the server side by sending the relation hash to the server; 3. If the server doesn't have the local relation, the client transfers the local relation as an artefact with the name `cache/<sha256>`; 4. As soon as the relation has presented at the server already, or transferred recently, the client transform the logical plan by replacing the `LocalRelation` node by `CachedLocalRelation` with the hash. 5. On another hand, the server converts `CachedLocalRelation` back to `LocalRelation` by retrieving the relation body from the local cache. ### Why are the changes needed? To fix the issues of creating a large dataframe from a local collection: ```python pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.RESOURCE_EXHAUSTED details = "Sent message larger than max (134218508 vs. 134217728)" debug_error_string = "UNKNOWN:Error received from peer localhost:50982 {grpc_message:"Sent message larger than max (134218508 vs. 134217728)", grpc_status:8, created_time:"2023-06-09T15:34:08.362797+03:00"} ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new test: ``` $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_streaming_local_relation' ``` Closes apache#41537 from MaxGekk/streaming-createDataFrame-python-4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk added 4 commits April 15, 2023 01:45

Add an end-to-end test

8f4c0d0

Add the SQL config spark.sql.session.localRelationCacheThreshold

b43e79c

Update createDataset()

d89d004

Add the proto message: CachedLocalRelation

4adee82

github-actions bot added CONNECT SQL labels Apr 17, 2023

MaxGekk changed the title ~~[WIP][SPARK-42585][CONNECT] Streaming the createDataFrame implementation~~ [WIP][SPARK-42585][CONNECT] Streaming of local relations Apr 17, 2023

Re-gen relations_pb2.py and relations_pb2.pyi

33a2baf

github-actions bot added CORE PYTHON labels Apr 17, 2023

MaxGekk added 10 commits April 18, 2023 11:49

Merge remote-tracking branch 'origin/master' into streaming-createDat…

55bfac5

…aFrame-2

Re-gen relations_pb2.py and relations_pb2.pyi

0a4639e

Re-gen golden files for PlanGenerationTestSuite

a78b7c0

Re-gen golden files

3f22c9f

Send the local relation to the remote cache

74b409e

Partial implementation

526e696

Impl w/o exist checks

c8b0bbc

Bug fixes

25c77ce

Merge remote-tracking branch 'origin/master' into streaming-createDat…

f603081

…aFrame-2

Re-gen relations_pb2.py

b37b595

zhengruifeng reviewed Apr 20, 2023

View reviewed changes

MaxGekk added 9 commits April 20, 2023 17:28

Bug fix: serialization

88b0e47

Add a test to ArtifactManagerSuite

09f6fc5

Add a test to ClientE2ETestSuite

33b1650

Trigger build

5932218

Reformat SparkConnectArtifactManager and SparkConnectPlanner

d369cf4

Trigger build

ba72916

Trigger build

3d8d157

Merge remote-tracking branch 'origin/master' into streaming-createDat…

856b5e7

…aFrame-2

Re-gen proto/relations_pb2.py

bbc2632

MaxGekk added 4 commits April 28, 2023 14:46

Trigger build

5952148

Implement remove cache in the unified way

b7bbad3

Remove a duplicate test

ea0cc60

Merge remote-tracking branch 'origin/master' into streaming-createDat…

22841da

…aFrame-2

grundprinzip reviewed Apr 28, 2023

View reviewed changes

MaxGekk added 8 commits April 28, 2023 22:49

Revert "Remove a duplicate test"

49c0c10

This reverts commit ea0cc60.

Don't tell to the master about caches

0e87751

Release the read lock of the caches

61e8e2e

Make the artifact test more stable

7dc4d1c

Reformat SparkConnectPlanner

1ec8fb5

Merge remote-tracking branch 'origin/master' into streaming-createDat…

900867e

…aFrame-2

Reformat SparkConnectService

710d224

Remove the dependency of SQLConf

46eb354

Trigger build

2351069

HyukjinKwon approved these changes May 2, 2023

View reviewed changes

HyukjinKwon closed this in 0d7618a May 2, 2023

HyukjinKwon mentioned this pull request May 22, 2023

[SPARK-43612][CONNECT][PYTHON] Implement SparkSession.addArtifact(s) in Python client #41250

Closed

MaxGekk mentioned this pull request Jun 7, 2023

[SPARK-44006][CONNECT][PYTHON] Support cache artifacts #41465

Closed

MaxGekk mentioned this pull request Jun 9, 2023

[SPARK-43971][CONNECT][PYTHON] Support Python's createDataFrame in streaming manner #41537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42585][CONNECT] Streaming of local relations #40827

[SPARK-42585][CONNECT] Streaming of local relations #40827

MaxGekk commented Apr 17, 2023 •

edited

zhengruifeng Apr 20, 2023

MaxGekk Apr 20, 2023

MaxGekk Apr 26, 2023

grundprinzip commented Apr 28, 2023

grundprinzip Apr 28, 2023

MaxGekk Apr 29, 2023

grundprinzip Apr 29, 2023

MaxGekk Apr 30, 2023

MaxGekk commented May 1, 2023

HyukjinKwon commented May 2, 2023

	if (arrowDataSize <= conf.get(SQLConf.LOCAL_RELATION_CACHE_THRESHOLD.key).toInt) {
	if (arrowDataSize <= conf.get("spark.sql.session.localRelationCacheThreshold").toInt) {

[SPARK-42585][CONNECT] Streaming of local relations #40827

[SPARK-42585][CONNECT] Streaming of local relations #40827

Conversation

MaxGekk commented Apr 17, 2023 • edited

What changes were proposed in this pull request?

Details of the implementation

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng Apr 20, 2023

Choose a reason for hiding this comment

MaxGekk Apr 20, 2023

Choose a reason for hiding this comment

MaxGekk Apr 26, 2023

Choose a reason for hiding this comment

grundprinzip commented Apr 28, 2023

grundprinzip Apr 28, 2023

Choose a reason for hiding this comment

MaxGekk Apr 29, 2023

Choose a reason for hiding this comment

grundprinzip Apr 29, 2023

Choose a reason for hiding this comment

MaxGekk Apr 30, 2023

Choose a reason for hiding this comment

MaxGekk commented May 1, 2023

HyukjinKwon commented May 2, 2023

MaxGekk commented Apr 17, 2023 •

edited