[SPARK-43032][CONNECT][SS] Add Streaming query manager by WweiL · Pull Request #40861 · apache/spark

WweiL · 2023-04-19T22:14:34Z

What changes were proposed in this pull request?

Add support of StreamingQueryManager() to CONNECT PYTHON client.

Why are the changes needed?

Now users can use typical streaming query manager method by calling session.streams

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Manual test and unit test

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
      /_/

Using Python version 3.9.16 (main, Dec  7 2022 01:11:58)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
>>> q = spark.readStream.format("rate").load().writeStream.format("memory").queryName("test").start()
23/04/19 23:10:43 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-913e48b9-26d8-448f-899f-d9f5ae08707d. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/04/19 23:10:43 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
>>> spark.streams.active
[<pyspark.sql.connect.streaming.query.StreamingQuery object at 0x7f4590400d90>]
>>> q1 = spark.streams.active[0]
>>> q1.id == q.id
True
>>> q1.runId == q.runId
True
>>> q1.runId == q.runId
True
>>> q.name
'test'
>>> q1.name
'test'
>>> q == q1
False
>>> q1.stop()
>>> q
<pyspark.sql.connect.streaming.query.StreamingQuery object at 0x7f4590400b20>
>>> q1
<pyspark.sql.connect.streaming.query.StreamingQuery object at 0x7f4590400ee0>
>>> q.isActive
False

amaliujia · 2023-04-19T22:19:52Z

connector/connect/common/src/main/protobuf/spark/connect/base.proto

This is a breaking change...

probably you just do StreamingQueryManagerCommandResult streaming_query_manager_command_result = 11; and do not re-use existing proto number

I see. TY i'll change it back
Just wanted to make similar names closer...

… files

rangadi

Mostly LGTM. Left a couple of comments.

rangadi · 2023-05-01T19:07:09Z

connector/connect/common/src/main/protobuf/spark/connect/commands.proto

  string run_id = 2;
+
+  // (Optional) The name of this query.
+  optional string name = 3;


Why is this needed. Note that query-name is not part of real identity of the query. This can be extra field like response for 'WriteStreamOperation'.

Right I'm also not very sure here. Basically I added this because here:

spark/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

Line 2411 in 069caf4

case StreamingQueryManagerCommand.CommandCase.GET_QUERY =>

In the sqm.get(queryID) handler, the query may or may not have a name. But in case it has, if we don't return the name then the client won't have it's name.

We could also maintain a local cache of query names, but that would add more complexity on cleaning up that cache though... What do you think

We can return the name, but we don't need to change StreamingQueryInstanceId. We can return both StreamingQueryInstanceId and the name

connector/connect/common/src/main/protobuf/spark/connect/commands.proto

WweiL · 2023-05-01T19:38:01Z

PTAL! @amaliujia @HyukjinKwon Thank you!

WweiL · 2023-05-01T21:57:21Z

python/pyspark/sql/connect/client.py

-                            yield {
-                                "streaming_query_manager_command_result": b.streaming_query_manager_command_result
-                            }
+                            cmd_result = b.streaming_query_manager_command_result


If I don't do this, dev/reformat-python always changes this line to

yield { "streaming_query_manager_command_result": b.streaming_query_manager_command_result }

which would result to python lint error because the second line has length 115 (long indentation before it in the original file)

Looks fine to me.

python/pyspark/sql/tests/connect/streaming/test_parity_streaming.py

python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map_with_state.py

rangadi

LGTM. Suggested one change, but that is ok, we could merge this.

connector/connect/common/src/main/protobuf/spark/connect/commands.proto

WweiL · 2023-05-02T06:46:44Z

Also verified null query names:

>>> q = spark.readStream.format("rate").load().writeStream.format("console").start()
>>> q1 = spark.streams.get(q.id)
>>> q1.name
''
>>> q.name
''
>>> q.name == q1.name
True

xinrong-meng · 2023-05-02T19:57:02Z

Merged to master, thanks!

### What changes were proposed in this pull request? Add support of `StreamingQueryManager()` to CONNECT PYTHON client. ### Why are the changes needed? Now users can use typical streaming query manager method by calling `session.streams` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Manual test and unit test ``` Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0.dev0 /_/ Using Python version 3.9.16 (main, Dec 7 2022 01:11:58) Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. >>> q = spark.readStream.format("rate").load().writeStream.format("memory").queryName("test").start() 23/04/19 23:10:43 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-913e48b9-26d8-448f-899f-d9f5ae08707d. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort. 23/04/19 23:10:43 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. >>> spark.streams.active [<pyspark.sql.connect.streaming.query.StreamingQuery object at 0x7f4590400d90>] >>> q1 = spark.streams.active[0] >>> q1.id == q.id True >>> q1.runId == q.runId True >>> q1.runId == q.runId True >>> q.name 'test' >>> q1.name 'test' >>> q == q1 False >>> q1.stop() >>> q <pyspark.sql.connect.streaming.query.StreamingQuery object at 0x7f4590400b20> >>> q1 <pyspark.sql.connect.streaming.query.StreamingQuery object at 0x7f4590400ee0> >>> q.isActive False ``` Closes apache#40861 from WweiL/SPARK-43032-streaming-query-manager. Lead-authored-by: Wei Liu <wei.liu@databricks.com> Co-authored-by: Wei Liu <z920631580@gmail.com> Signed-off-by: Xinrong Meng <xinrong@apache.org>

github-actions bot added CONNECT CORE PYTHON SQL STRUCTURED STREAMING labels Apr 19, 2023

amaliujia reviewed Apr 19, 2023

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-43032] streaming query manager~~ [SPARK-43032][CONNECT][SS] Adds Streaming query manager Apr 20, 2023

WweiL changed the title ~~[SPARK-43032][CONNECT][SS] Adds Streaming query manager~~ [SPARK-43032][CONNECT][SS] Add Streaming query manager Apr 20, 2023

WweiL and others added 13 commits April 28, 2023 21:29

wip

e5fc1f1

wip

878b830

wip

25b74d8

test

229fe45

wip

8846b95

switch order of sqm in base.proto and commands.proto, generate python…

14c0fdc

… files

minor

7ef7b07

remove skipped test cases

d881fa4

remove in unsupported test

43bfb9a

Update query.py

3418538

python format

7d07b1f

merge with session management PR -- nothing needs to change I think?

04b319b

with new

aced778

WweiL force-pushed the SPARK-43032-streaming-query-manager branch from f70fc9b to aced778 Compare May 1, 2023 05:07

fix test failures

3079bcd

WweiL marked this pull request as ready for review May 1, 2023 17:05

remove spark-42962 TODO

069caf4

rangadi reviewed May 1, 2023

View reviewed changes

WweiL requested a review from amaliujia May 1, 2023 19:37

WweiL added 2 commits May 1, 2023 14:47

refactor command.proto

d1f2ebc

merge with master

410fd39

style fix

d56a8d6

WweiL commented May 1, 2023

View reviewed changes

rangadi reviewed May 1, 2023

View reviewed changes

python/pyspark/sql/tests/connect/streaming/test_parity_streaming.py Show resolved Hide resolved

python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map_with_state.py Outdated Show resolved Hide resolved

HyukjinKwon approved these changes May 1, 2023

View reviewed changes

WweiL added 2 commits May 1, 2023 19:17

address comments

a2dff4d

merge master

7f57d5a

rangadi approved these changes May 2, 2023

View reviewed changes

connector/connect/common/src/main/protobuf/spark/connect/commands.proto Outdated Show resolved Hide resolved

WweiL added 2 commits May 1, 2023 23:45

handle null query name

8ffb139

remove empty line EOF

22bc9e4

rangadi approved these changes May 2, 2023

View reviewed changes

xinrong-meng closed this in 5cb7e6f May 2, 2023

Conversation

WweiL commented Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amaliujia Apr 19, 2023

Choose a reason for hiding this comment

Uh oh!

WweiL Apr 19, 2023

Choose a reason for hiding this comment

Uh oh!

rangadi left a comment

Choose a reason for hiding this comment

Uh oh!

rangadi May 1, 2023

Choose a reason for hiding this comment

Uh oh!

WweiL May 1, 2023

Choose a reason for hiding this comment

Uh oh!

rangadi May 1, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WweiL commented May 1, 2023

Uh oh!

WweiL May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rangadi May 1, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rangadi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WweiL commented May 2, 2023

Uh oh!

xinrong-meng commented May 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

WweiL commented Apr 19, 2023 •

edited

Loading

WweiL May 1, 2023 •

edited

Loading