Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41001] [CONNECT] [PYTHON] Implementing Connection String for Python Client #38485

Closed
wants to merge 9 commits into from

Conversation

grundprinzip
Copy link
Contributor

@grundprinzip grundprinzip commented Nov 2, 2022

What changes were proposed in this pull request?

This PR implements the connection string for Spark Connect clients according to the documentation added in #38470.

With this patch it becomes possible to connect to a Spark Connect endpoint using

spark = SparkRemoteSession(user_id="martin", connection_string="sc://hostname/;use_ssl=true;token=abcd")
spark.read.table("test").limit(10).toPandas()

The connection string is properly parsed and filtered. This allows to dynamically configure SSL and bearer token authentication. All remaining parameters are converted into GRPC Metadata pairs and submitted as part of the request.

Why are the changes needed?

User experience.

Does this PR introduce any user-facing change?

No, experimental API.

How was this patch tested?

UT

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine from a cursory look. cc @zhengruifeng too

python/pyspark/sql/connect/client.py Outdated Show resolved Hide resolved
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

grundprinzip and others added 2 commits November 4, 2022 06:05
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
@grundprinzip
Copy link
Contributor Author

accepted suggestion and fixed a doc example with missing quote

@HyukjinKwon
Copy link
Member

Merged to master.

@@ -167,6 +171,43 @@ def test_simple_datasource_read(self) -> None:
self.assertEqual(len(expectResult), len(actualResult))


class ChannelBuilderTests(ReusedPySparkTestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be skipped by should_test_connect like SparkConnectSQLTestCase in this file.

@unittest.skipIf(not should_test_connect, connect_requirement_message)
class SparkConnectSQLTestCase(ReusedPySparkTestCase):

I made a PR for that.

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…hon Client

### What changes were proposed in this pull request?

This PR implements the connection string for Spark Connect clients according to the documentation added in apache#38470.

With this patch it becomes possible to connect to a Spark Connect endpoint using

```
spark = SparkRemoteSession(user_id="martin", connection_string="sc://hostname/;use_ssl=true;token=abcd")
spark.read.table("test").limit(10).toPandas()
```

The connection string is properly parsed and filtered. This allows to dynamically configure SSL and bearer token authentication. All remaining parameters are converted into GRPC Metadata pairs and submitted as part of the request.

### Why are the changes needed?
User experience.

### Does this PR introduce _any_ user-facing change?
No, experimental API.

### How was this patch tested?
UT

Closes apache#38485 from grundprinzip/SPARK-41001.

Lead-authored-by: Martin Grund <martin.grund@databricks.com>
Co-authored-by: Martin Grund <grundprinzip@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants