[SPARK-41114] [CONNECT] [PYTHON] [FOLLOW-UP] Python Client support for local data #38803

grundprinzip · 2022-11-25T17:48:09Z

What changes were proposed in this pull request?

Since the Spark Connect server now supports reading local data from the client. This patch implements the necessary changes in the Python client to support reading from a local Pandas Data frame.

import pandas

pdf = pandas.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
df = spark.createDataFrame(pdf)
rows = df.filter(df.a == lit(3)).collect()
self.assertTrue(len(rows) == 1)
self.assertEqual(rows[0][0], 3)
self.assertEqual(rows[0][1], "c")

Why are the changes needed?

Compatibility

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

…r local data

grundprinzip · 2022-11-25T17:48:35Z

R: @zhengruifeng @HyukjinKwon

bjornjorgensen · 2022-11-25T18:56:24Z

Not the biggest issue in the world but, it's common to import pandas as pd.

grundprinzip · 2022-11-26T20:56:26Z

Not the biggest issue in the world but, it's common to import pandas as pd.

Fixed.

HyukjinKwon · 2022-11-27T01:54:46Z

python/pyspark/sql/connect/session.py

@@ -205,6 +207,31 @@ def __init__(self, connectionString: str, userId: Optional[str] = None):
        # Create the reader
        self.read = DataFrameReader(self)

+    def createDataFrame(self, data: "pd.DataFrame") -> "DataFrame":


Actually, the implementation here isn't matched to what we have in createDataFrame(pandas).

By default, the Arrow message conversion (more specifically in https://github.com/apache/spark/pull/38659/files#diff-d630cc4be6c65a3c3f7d6dbfe990f99ba992ccc26d9c3aaf6cfe46e163cb7389R514-R521) have to happen in RDD so we can parallelize this.

For a bit of history, PySpark added the initial version with RDD first, and added this local relation as an optimization for small dataset (see also #36683) later.

I am fine with the current approach but the main problem here is that 1. we can't stream the input, 2. it will have the size limit (likely 4KB). cc @hvanhovell FYI

It is impossible to match the implementation because in Pyspark to parallelize a first serialization is already happening to pass the input DF to the executors.

In our case to even send the data to spark we have to serialize it.

That said you're right that this currently does not support streaming of local data to the client. But the limit is not 4kb but probably whatever the max message size of GRPC is so in the megabytes.

I think we need to add the client side streaming APIs at some point but I'd like to defer that for a bit.

For a large pd.dataframe, I guess we can optimize it in this way in the future: split it into several batches, and create a localRelation for each batch, and finally Union them.

HyukjinKwon

Couple of comments. I am fine with this as the first initial version.

AmplabJenkins · 2022-11-27T04:28:13Z

Can one of the admins verify this patch?

amaliujia · 2022-11-27T18:36:41Z

python/pyspark/sql/connect/plan.py

+        self._pdf = pdf
+
+    def plan(self, session: "SparkConnectClient") -> proto.Relation:
+        assert self._pdf is not None


Nit: is this a bit redundant though that plan.py is internal API, the constructor does not accepts Optional pandas dataframe and we have mypy to do type checking?

I think you're right. It makes sense to move the assertion into the session.

As an FYI, all of the mypy checks are really just for the code that we write. During runtime, the user can pass whatever they want and we should make sure that we have proper checks for it. But since plan is internal API it makes a lot of sense to have the checking on the public API instead.

amaliujia · 2022-11-27T18:37:43Z

python/pyspark/sql/connect/plan.py

+
+        sink = pa.BufferOutputStream()
+        table = pa.Table.from_pandas(self._pdf)
+        with pa.ipc.new_stream(sink, table.schema) as writer:


I am not familiar here so a question:

any possible that an empty panda dataframe are used here (e.g. has schema but no data). If so maybe have a test case?

I'll add a test for that, thanks for the proposal!

zhengruifeng · 2022-11-28T02:06:25Z

python/pyspark/sql/connect/session.py

+        """
+        assert data is not None
+        if len(data) == 0:
+            raise ValueError("Input data cannot be empty")


IIRC, createDataFrame in pyspark doesnot support empty pandas dataframe. I think it would be fine to throw an error here.

hvanhovell · 2022-11-28T13:40:00Z

alright merging this one.

amaliujia · 2022-11-28T18:35:39Z

LGTM!

…ocal data ### What changes were proposed in this pull request? Since the Spark Connect server now supports reading local data from the client. This patch implements the necessary changes in the Python client to support reading from a local Pandas Data frame. ``` import pandas pdf = pandas.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}) df = spark.createDataFrame(pdf) rows = df.filter(df.a == lit(3)).collect() self.assertTrue(len(rows) == 1) self.assertEqual(rows[0][0], 3) self.assertEqual(rows[0][1], "c") ``` ### Why are the changes needed? Compatibility ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38803 from grundprinzip/SPARK-41114. Authored-by: Martin Grund <martin.grund@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

[SPARK-41114] [CONNECT] [PYTHON] [FOLLOW-UP] Python Client support fo…

ad1c38b

…r local data

github-actions bot added CONNECT CORE PYTHON SQL labels Nov 25, 2022

grundprinzip mentioned this pull request Nov 26, 2022

[SPARK-41114][CONNECT] Support local data for LocalRelation #38659

Closed

merge

a37aaf3

grundprinzip added 2 commits November 26, 2022 21:56

comments

d223a82

merge

deeb0aa

HyukjinKwon reviewed Nov 27, 2022

View reviewed changes

HyukjinKwon approved these changes Nov 27, 2022

View reviewed changes

amaliujia reviewed Nov 27, 2022

View reviewed changes

review comments

991d069

zhengruifeng reviewed Nov 28, 2022

View reviewed changes

zhengruifeng approved these changes Nov 28, 2022

View reviewed changes

cloud-fan approved these changes Nov 28, 2022

View reviewed changes

hvanhovell closed this in 5fbf7e2 Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41114] [CONNECT] [PYTHON] [FOLLOW-UP] Python Client support for local data #38803

[SPARK-41114] [CONNECT] [PYTHON] [FOLLOW-UP] Python Client support for local data #38803

grundprinzip commented Nov 25, 2022

grundprinzip commented Nov 25, 2022

bjornjorgensen commented Nov 25, 2022

grundprinzip commented Nov 26, 2022

HyukjinKwon Nov 27, 2022

HyukjinKwon Nov 27, 2022

grundprinzip Nov 27, 2022

zhengruifeng Nov 28, 2022

HyukjinKwon left a comment

AmplabJenkins commented Nov 27, 2022

amaliujia Nov 27, 2022 •

edited

grundprinzip Nov 27, 2022

amaliujia Nov 27, 2022 •

edited

grundprinzip Nov 27, 2022

zhengruifeng Nov 28, 2022

hvanhovell commented Nov 28, 2022

amaliujia commented Nov 28, 2022

[SPARK-41114] [CONNECT] [PYTHON] [FOLLOW-UP] Python Client support for local data #38803

[SPARK-41114] [CONNECT] [PYTHON] [FOLLOW-UP] Python Client support for local data #38803

Conversation

grundprinzip commented Nov 25, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

grundprinzip commented Nov 25, 2022

bjornjorgensen commented Nov 25, 2022

grundprinzip commented Nov 26, 2022

HyukjinKwon Nov 27, 2022

Choose a reason for hiding this comment

HyukjinKwon Nov 27, 2022

Choose a reason for hiding this comment

grundprinzip Nov 27, 2022

Choose a reason for hiding this comment

zhengruifeng Nov 28, 2022

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 27, 2022

amaliujia Nov 27, 2022 • edited

Choose a reason for hiding this comment

grundprinzip Nov 27, 2022

Choose a reason for hiding this comment

amaliujia Nov 27, 2022 • edited

Choose a reason for hiding this comment

grundprinzip Nov 27, 2022

Choose a reason for hiding this comment

zhengruifeng Nov 28, 2022

Choose a reason for hiding this comment

hvanhovell commented Nov 28, 2022

amaliujia commented Nov 28, 2022

amaliujia Nov 27, 2022 •

edited

amaliujia Nov 27, 2022 •

edited