[SPARK-40509][SS][PYTHON] Add example for applyInPandasWithState #38013

chaoqin-li1123 · 2022-09-27T06:08:32Z

What changes were proposed in this pull request?

An example for applyInPandasWithState usage. This example split lines into words, group by words as key and use the state per key to track session of each key.

Why are the changes needed?

To demonstrate the usage of applyInPandasWithState

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This is an example that can be run manually.

To run this on your local machine, you need to first run a Netcat server
$ nc -lk 9999
and then run the example
$ bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py localhost 9999

chaoqin-li1123 · 2022-09-27T06:10:12Z

@HeartSaVioR The applyInPandasWithState session window example.

HeartSaVioR · 2022-09-27T07:21:44Z

Thanks for the contribution @chaoqin-li1123 ! Looks like python linter is complaining - could you please look into this?
https://github.com/chaoqin-li1123/spark/actions/runs/3133327292/jobs/5086593368

Also would be good to explicitly document how you ran the example in the section of How was this patch tested?.

Thanks in advance!

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

AmplabJenkins · 2022-09-27T09:46:51Z

Can one of the admins verify this patch?

HeartSaVioR · 2022-09-27T09:49:35Z

One tip, unlike Scala/Java code, we can leverage dev/reformat-python to reformat python code automatically.

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

chaoqin-li1123 · 2022-09-27T18:51:52Z

One tip, unlike Scala/Java code, we can leverage dev/reformat-python to reformat python code automatically.

It seems that dev/reformat-python skip this file, I need to run python3 -m black examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

HeartSaVioR · 2022-09-27T21:15:35Z

@chaoqin-li1123
https://github.com/chaoqin-li1123/spark/actions/runs/3138156803/jobs/5097193712

Linter is still complaining. Could you take a look?

You can install necessary python dependency requirements from ./dev/requirements.txt and run ./dev/lint-python, and ensure everything passes.

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

HeartSaVioR

+1, but I'd like to have second eyes of reviews from PySpark experts to make sure the code looks good for them as well.

HyukjinKwon

LGTM2

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

HeartSaVioR · 2022-09-28T07:36:55Z

Thanks! Merging to master.

HyukjinKwon · 2022-09-29T12:02:10Z

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

+    )
+
+    def func(
+        key: Any, pdf_iter: Iterable[pd.DataFrame], state: GroupState


Sorry for post-hoc reviews.

Let's change Iterable to Iterator. This can only be an iterator.

The linter force me to annotate the type here as Iterable instead of Iterator, maybe we should do some investigation?

Ohh, okay NVM. Let's leave it as is for now 👍

HyukjinKwon · 2022-09-29T12:02:34Z

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

+        key: Any, pdf_iter: Iterable[pd.DataFrame], state: GroupState
+    ) -> Iterable[pd.DataFrame]:
+        if state.hasTimedOut:
+            count, start, end = state.get


I would extract the key like:

(session_id,) = key

or

(word,) = key

HyukjinKwon · 2022-09-29T12:03:43Z

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

+                    "sessionId": [key[0]],
+                    "count": [count],
+                    "start": [start],
+                    "end": [end],


Can we cast this and show as a timestamp in the console? Numeric timestamp values look a bit difficult to read.

HyukjinKwon · 2022-09-29T12:04:11Z

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

+    )
+
+    def func(
+        key: Any, pdf_iter: Iterable[pd.DataFrame], state: GroupState


BTW, I would name pdf_iter -> pdfs

HyukjinKwon · 2022-09-29T12:05:29Z

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

+            end = 0
+            count = 0
+            for pdf in pdf_iter:
+                start = min(start, min(pdf["timestamp"]))


I would use pandas API here instead of built-in Python function to show users that we can use pandas. e.g.) int(min(start, pdf["timestamp"].min())) and int(max(start, pdf["timestamp"].max()))

Just tried a similar scenario, this causes data type errors downstream:

executor driver: net.razorvine.pickle.PickleException (expected zero arguments for construction of ClassDict (for numpy.dtype). This happens when an unsupported/unregistered class is being unpickled that requires construction arguments. Fix it by registering a custom IObjectConstructor for this class.)

I suspect the return type is not matched to the SQL type provided. Do you mind show the reproducer?

HyukjinKwon · 2022-09-29T12:06:54Z

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

+                end = max(end, max(pdf["timestamp"]))
+                count = count + len(pdf)
+            if state.exists:
+                old_session = state.get


I would do:

(old_count, start, old_end) = state.get count = count + old_count end = max(end, old_end)

HyukjinKwon · 2022-09-29T12:08:22Z

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

+    sessions = events.groupBy(events["sessionId"]).applyInPandasWithState(
+        func,
+        session_schema,
+        session_state_schema,


FYI, you can just pass a DDL formatted string schema too. e.g., "sessionId STRING, count LONG, start LONG, end LONG" which will be shorter.

HyukjinKwon · 2022-09-29T12:08:40Z

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

+
+    def func(
+        key: Any, pdf_iter: Iterable[pd.DataFrame], state: GroupState
+    ) -> Iterable[pd.DataFrame]:


ditto. output should better be Iterator in this case.

HyukjinKwon · 2022-09-29T12:10:04Z

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py

+
+r"""
+ Split lines into words, group by words and use the state per key to track session of each key.
+


Can we add a bit of more explanation? e.g.) the timeout is set as 10 seconds so each session window lasts until there is no more input to the key for 10 seconds.

HyukjinKwon · 2022-09-29T12:10:46Z

@chaoqin-li1123, I happened to nitpick some. Would you mind creating a followup PR with reusing the same JIRA?

chaoqin-li1123 · 2022-09-30T04:29:21Z

No problem, I will create a new PR with the suggested change.

…thState followup ### What changes were proposed in this pull request? This is a followup of #38013 which introduce an example for applyInPandasWithState. Address some comments on code style. Closes #38066 from chaoqin-li1123/example_followup. Authored-by: Chaoqin Li <chaoqin.li@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

chaoqin-li1123 added 2 commits September 26, 2022 22:58

add example

dc3143f

fix comments

998f6fa

github-actions bot added EXAMPLES PYTHON SQL STRUCTURED STREAMING labels Sep 27, 2022

HeartSaVioR reviewed Sep 27, 2022

View reviewed changes

HyukjinKwon reviewed Sep 27, 2022

View reviewed changes

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py Outdated Show resolved Hide resolved

chaoqin-li1123 added 2 commits September 27, 2022 10:08

fix nit and reformat

8d37c24

reformat

cfb3b11

HeartSaVioR reviewed Sep 27, 2022

View reviewed changes

examples/src/main/python/sql/streaming/structured_network_wordcount_session_window.py Outdated Show resolved Hide resolved

chaoqin-li1123 added 2 commits September 27, 2022 15:30

remove key from state

20dbae6

fix type annotation

7ffac32

HeartSaVioR approved these changes Sep 28, 2022

View reviewed changes

HeartSaVioR requested a review from HyukjinKwon September 28, 2022 01:08

HyukjinKwon approved these changes Sep 28, 2022

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-40509][SS][PYTHON] add example for applyInPandasWithState~~ [SPARK-40509][SS][PYTHON] Add example for applyInPandasWithState Sep 28, 2022

chaoqin-li1123 added 3 commits September 27, 2022 22:29

fix import order and nits

a10c5e6

remove debug log config

c181d40

add comments and make code more readable

26affae

HyukjinKwon approved these changes Sep 28, 2022

View reviewed changes

HeartSaVioR closed this in 38599e9 Sep 28, 2022

HyukjinKwon reviewed Sep 29, 2022

View reviewed changes

chaoqin-li1123 mentioned this pull request Oct 1, 2022

[SPARK-40509][SS][PYTHON][FOLLLOW-UP] Add example for applyInPandasWithState followup #38066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40509][SS][PYTHON] Add example for applyInPandasWithState #38013

[SPARK-40509][SS][PYTHON] Add example for applyInPandasWithState #38013

chaoqin-li1123 commented Sep 27, 2022 •

edited

chaoqin-li1123 commented Sep 27, 2022

HeartSaVioR commented Sep 27, 2022

AmplabJenkins commented Sep 27, 2022

HeartSaVioR commented Sep 27, 2022

chaoqin-li1123 commented Sep 27, 2022 •

edited

HeartSaVioR commented Sep 27, 2022

HeartSaVioR left a comment

HyukjinKwon left a comment

HeartSaVioR commented Sep 28, 2022

HyukjinKwon Sep 29, 2022

chaoqin-li1123 Sep 30, 2022

HyukjinKwon Sep 30, 2022

HyukjinKwon Sep 29, 2022

HyukjinKwon Sep 29, 2022

HyukjinKwon Sep 29, 2022

HyukjinKwon Sep 29, 2022

leandrohmvieira-db Nov 22, 2022 •

edited

HyukjinKwon Nov 23, 2022

HyukjinKwon Sep 29, 2022 •

edited

HyukjinKwon Sep 29, 2022

HyukjinKwon Sep 29, 2022

HyukjinKwon Sep 29, 2022

HyukjinKwon commented Sep 29, 2022

chaoqin-li1123 commented Sep 30, 2022


		r"""
		Split lines into words, group by words and use the state per key to track session of each key.

[SPARK-40509][SS][PYTHON] Add example for applyInPandasWithState #38013

[SPARK-40509][SS][PYTHON] Add example for applyInPandasWithState #38013

Conversation

chaoqin-li1123 commented Sep 27, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

chaoqin-li1123 commented Sep 27, 2022

HeartSaVioR commented Sep 27, 2022

AmplabJenkins commented Sep 27, 2022

HeartSaVioR commented Sep 27, 2022

chaoqin-li1123 commented Sep 27, 2022 • edited

HeartSaVioR commented Sep 27, 2022

HeartSaVioR left a comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Sep 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leandrohmvieira-db Nov 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Sep 29, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Sep 29, 2022

chaoqin-li1123 commented Sep 30, 2022

chaoqin-li1123 commented Sep 27, 2022 •

edited

chaoqin-li1123 commented Sep 27, 2022 •

edited

leandrohmvieira-db Nov 22, 2022 •

edited

HyukjinKwon Sep 29, 2022 •

edited