[SPARK-44006][CONNECT][PYTHON] Support cache artifacts #41465

MaxGekk · 2023-06-05T13:28:12Z

What changes were proposed in this pull request?

In the PR, I propose to extend Artifact API of the Python connect client by two new methods similarly to #40827:

is_cached_artifact() checks the cache of the given hash presents at the server side.
cache_artifact() caches a blob in memory at the server side.

Why are the changes needed?

To allow creating a dataframe from a large local collection. spark.createDataFrame(...) fails with the following error w/o the changes:

pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.RESOURCE_EXHAUSTED
	details = "Sent message larger than max (629146388 vs. 134217728)"
	debug_error_string = "UNKNOWN:Error received from peer localhost:58218 {grpc_message:"Sent message larger than max (629146388 vs. 134217728)", grpc_status:8, created_time:"2023-06-05T18:35:50.912817+03:00"}"

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By running new tests:

$ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.client.test_artifact ArtifactTests'

HyukjinKwon · 2023-06-07T03:07:21Z

python/pyspark/sql/tests/connect/test_connect_basic.py

@@ -623,6 +623,20 @@ def test_with_local_list(self):
        ):
            self.connect.createDataFrame(data, "col1 int, col2 int, col3 int")

+    def test_streaming_local_relation(self):
+        import random
+        import string


we can just import them on the top

HyukjinKwon · 2023-06-07T03:07:26Z

python/pyspark/sql/tests/connect/client/test_artifact.py

@@ -271,6 +271,17 @@ def func(x):
            self.spark.addArtifacts(f"{archive_path}.zip#my_files", archive=True)
            self.assertEqual(self.spark.range(1).select(func("id")).first()[0], "hello world!")

+    def test_cache_artifact(self):
+        import hashlib


…aFrame-python-3

MaxGekk · 2023-06-07T10:57:05Z

python/pyspark/sql/tests/connect/client/test_artifact.py

@@ -294,6 +295,15 @@ def func(x):
            self.spark.addArtifacts(file_path, file=True)
            self.assertEqual(self.spark.range(1).select(func("id")).first()[0], "Hello world!!")

+    def test_cache_artifact(self):


@HyukjinKwon For some unknown reasons the test suit freezes after the test completes successfully.

I have added similar test for Scala client, please, review it: #41493

…aFrame-python-3

MaxGekk · 2023-06-08T12:05:00Z

Merging to master. Thank you, @HyukjinKwon for review.

### What changes were proposed in this pull request? In the PR, I propose to extend Artifact API of the Python connect client by two new methods similarly to apache#40827: 1. `is_cached_artifact()` checks the cache of the given hash presents at the server side. 2. `cache_artifact()` caches a blob in memory at the server side. ### Why are the changes needed? To allow creating a dataframe from a large local collection. `spark.createDataFrame(...)` fails with the following error w/o the changes: ```python pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.RESOURCE_EXHAUSTED details = "Sent message larger than max (629146388 vs. 134217728)" debug_error_string = "UNKNOWN:Error received from peer localhost:58218 {grpc_message:"Sent message larger than max (629146388 vs. 134217728)", grpc_status:8, created_time:"2023-06-05T18:35:50.912817+03:00"}" ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new tests: ``` $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.client.test_artifact ArtifactTests' ``` Closes apache#41465 from MaxGekk/streaming-createDataFrame-python-3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk added 2 commits June 5, 2023 15:24

Add the test test_streaming_local_relation

db40fe8

Reformat test_connect_basic.py

87b4654

github-actions bot added CONNECT CORE PYTHON SQL labels Jun 5, 2023

MaxGekk added 5 commits June 5, 2023 20:27

Implement is_cached_artifact + test

1402392

Reformat artifact.py

28c139e

Fix is_cached_artifact

23cc0b2

Implement cache_artifact

1166f23

Reformat

77b50fb

HyukjinKwon reviewed Jun 7, 2023

View reviewed changes

HyukjinKwon approved these changes Jun 7, 2023

View reviewed changes

MaxGekk added 2 commits June 7, 2023 09:40

Merge remote-tracking branch 'origin/master' into streaming-createDat…

2186187

…aFrame-python-3

Move imports

3c510ab

MaxGekk commented Jun 7, 2023

View reviewed changes

MaxGekk added 4 commits June 7, 2023 16:04

Merge remote-tracking branch 'origin/master' into streaming-createDat…

bdaa563

…aFrame-python-3

Reformat

d0b89ed

Merge remote-tracking branch 'origin/master' into streaming-createDat…

4289fce

…aFrame-python-3

Revert connect/test_connect_basic.py

0e8d542

MaxGekk changed the title ~~[WIP][CONNECT][PYTHON] Support Python's createDataFrame in streaming manner~~ [SPARK-44006][CONNECT][PYTHON] Support cache artifacts Jun 8, 2023

MaxGekk marked this pull request as ready for review June 8, 2023 08:19

MaxGekk closed this in 958b854 Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44006][CONNECT][PYTHON] Support cache artifacts #41465

[SPARK-44006][CONNECT][PYTHON] Support cache artifacts #41465

MaxGekk commented Jun 5, 2023 •

edited

HyukjinKwon Jun 7, 2023

HyukjinKwon Jun 7, 2023

MaxGekk Jun 7, 2023

MaxGekk commented Jun 8, 2023

[SPARK-44006][CONNECT][PYTHON] Support cache artifacts #41465

[SPARK-44006][CONNECT][PYTHON] Support cache artifacts #41465

Conversation

MaxGekk commented Jun 5, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon Jun 7, 2023

Choose a reason for hiding this comment

HyukjinKwon Jun 7, 2023

Choose a reason for hiding this comment

MaxGekk Jun 7, 2023

Choose a reason for hiding this comment

MaxGekk commented Jun 8, 2023

MaxGekk commented Jun 5, 2023 •

edited