[SPARK-43032][CONNECT][SS] Add Streaming query manager#40861
[SPARK-43032][CONNECT][SS] Add Streaming query manager#40861WweiL wants to merge 22 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
This is a breaking change...
probably you just do StreamingQueryManagerCommandResult streaming_query_manager_command_result = 11; and do not re-use existing proto number
There was a problem hiding this comment.
I see. TY i'll change it back
Just wanted to make similar names closer...
f70fc9b to
aced778
Compare
rangadi
left a comment
There was a problem hiding this comment.
Mostly LGTM. Left a couple of comments.
| string run_id = 2; | ||
|
|
||
| // (Optional) The name of this query. | ||
| optional string name = 3; |
There was a problem hiding this comment.
Why is this needed. Note that query-name is not part of real identity of the query. This can be extra field like response for 'WriteStreamOperation'.
There was a problem hiding this comment.
Right I'm also not very sure here. Basically I added this because here:
In the
sqm.get(queryID) handler, the query may or may not have a name. But in case it has, if we don't return the name then the client won't have it's name.
We could also maintain a local cache of query names, but that would add more complexity on cleaning up that cache though... What do you think
There was a problem hiding this comment.
We can return the name, but we don't need to change StreamingQueryInstanceId. We can return both StreamingQueryInstanceId and the name
connector/connect/common/src/main/protobuf/spark/connect/commands.proto
Outdated
Show resolved
Hide resolved
|
PTAL! @amaliujia @HyukjinKwon Thank you! |
| yield { | ||
| "streaming_query_manager_command_result": b.streaming_query_manager_command_result | ||
| } | ||
| cmd_result = b.streaming_query_manager_command_result |
There was a problem hiding this comment.
If I don't do this, dev/reformat-python always changes this line to
yield {
"streaming_query_manager_command_result": b.streaming_query_manager_command_result
}
which would result to python lint error because the second line has length 115 (long indentation before it in the original file)
python/pyspark/sql/tests/connect/test_parity_pandas_grouped_map_with_state.py
Outdated
Show resolved
Hide resolved
rangadi
left a comment
There was a problem hiding this comment.
LGTM. Suggested one change, but that is ok, we could merge this.
connector/connect/common/src/main/protobuf/spark/connect/commands.proto
Outdated
Show resolved
Hide resolved
|
Also verified null query names: |
|
Merged to master, thanks! |
### What changes were proposed in this pull request?
Add support of `StreamingQueryManager()` to CONNECT PYTHON client.
### Why are the changes needed?
Now users can use typical streaming query manager method by calling `session.streams`
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
Manual test and unit test
```
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.0.dev0
/_/
Using Python version 3.9.16 (main, Dec 7 2022 01:11:58)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
>>> q = spark.readStream.format("rate").load().writeStream.format("memory").queryName("test").start()
23/04/19 23:10:43 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-913e48b9-26d8-448f-899f-d9f5ae08707d. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/04/19 23:10:43 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
>>> spark.streams.active
[<pyspark.sql.connect.streaming.query.StreamingQuery object at 0x7f4590400d90>]
>>> q1 = spark.streams.active[0]
>>> q1.id == q.id
True
>>> q1.runId == q.runId
True
>>> q1.runId == q.runId
True
>>> q.name
'test'
>>> q1.name
'test'
>>> q == q1
False
>>> q1.stop()
>>> q
<pyspark.sql.connect.streaming.query.StreamingQuery object at 0x7f4590400b20>
>>> q1
<pyspark.sql.connect.streaming.query.StreamingQuery object at 0x7f4590400ee0>
>>> q.isActive
False
```
Closes apache#40861 from WweiL/SPARK-43032-streaming-query-manager.
Lead-authored-by: Wei Liu <wei.liu@databricks.com>
Co-authored-by: Wei Liu <z920631580@gmail.com>
Signed-off-by: Xinrong Meng <xinrong@apache.org>
What changes were proposed in this pull request?
Add support of
StreamingQueryManager()to CONNECT PYTHON client.Why are the changes needed?
Now users can use typical streaming query manager method by calling
session.streamsDoes this PR introduce any user-facing change?
Yes
How was this patch tested?
Manual test and unit test