-
Notifications
You must be signed in to change notification settings - Fork 28k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-43971][CONNECT][PYTHON] Support Python's createDataFrame in st…
…reaming manner ### What changes were proposed in this pull request? In the PR, I propose to transfer a local relation from **the Python connect client** to the server in streaming way when it exceeds some size which is defined by the SQL config `spark.sql.session.localRelationCacheThreshold`. The implementation is similar to #40827. In particular: 1. The client applies the `sha256` function over **the proto form** of the local relation; 2. It checks presents of the relation at the server side by sending the relation hash to the server; 3. If the server doesn't have the local relation, the client transfers the local relation as an artefact with the name `cache/<sha256>`; 4. As soon as the relation has presented at the server already, or transferred recently, the client transform the logical plan by replacing the `LocalRelation` node by `CachedLocalRelation` with the hash. 5. On another hand, the server converts `CachedLocalRelation` back to `LocalRelation` by retrieving the relation body from the local cache. ### Why are the changes needed? To fix the issues of creating a large dataframe from a local collection: ```python pyspark.errors.exceptions.connect.SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.RESOURCE_EXHAUSTED details = "Sent message larger than max (134218508 vs. 134217728)" debug_error_string = "UNKNOWN:Error received from peer localhost:50982 {grpc_message:"Sent message larger than max (134218508 vs. 134217728)", grpc_status:8, created_time:"2023-06-09T15:34:08.362797+03:00"} ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new test: ``` $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_streaming_local_relation' ``` Closes #41537 from MaxGekk/streaming-createDataFrame-python-4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
- Loading branch information
Showing
4 changed files
with
79 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters