[SPARK-36314][SS] Update Sessionization examples to use native support of session window by HeartSaVioR · Pull Request #33548 · apache/spark

HeartSaVioR · 2021-07-27T22:50:58Z

What changes were proposed in this pull request?

This PR proposes to update Sessionization examples to use native support of session window. It also adds the example for PySpark as native support of session window is available to PySpark as well.

Why are the changes needed?

We should guide the simplest way to achieve the same workload. I'll provide another example for cases we can't do with native support of session window.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually tested.

… of session window

HeartSaVioR · 2021-07-27T22:54:58Z

Target branches are master/3.2.

HeartSaVioR · 2021-07-27T22:59:44Z

cc. @viirya @xuanyuanking

SparkQA · 2021-07-27T23:28:08Z

Test build #141737 has finished for PR 33548 at commit c62ea25.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-27T23:57:08Z

Test build #141741 has finished for PR 33548 at commit 039f53a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-28T00:13:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46250/

viirya · 2021-07-28T00:18:11Z

examples/src/main/python/sql/streaming/structured_sessionization.py

+
+if __name__ == "__main__":
+    if len(sys.argv) != 3 and len(sys.argv) != 2:
+        msg = "Usage: structured_network_wordcount_windowed.py <hostname> <port> "


structured_sessionization.py?

viirya

Looks good. One typo found.

SparkQA · 2021-07-28T00:25:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46254/

SparkQA · 2021-07-28T00:59:23Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46254/

SparkQA · 2021-07-28T01:04:28Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46250/

SparkQA · 2021-07-28T01:39:07Z

Test build #141744 has finished for PR 33548 at commit f6cf587.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking

LGTM

SparkQA · 2021-07-28T02:07:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46257/

SparkQA · 2021-07-28T02:42:50Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46257/

viirya · 2021-07-28T03:09:43Z

Thanks! Merging to master/3.2.

…t of session window ### What changes were proposed in this pull request? This PR proposes to update Sessionization examples to use native support of session window. It also adds the example for PySpark as native support of session window is available to PySpark as well. ### Why are the changes needed? We should guide the simplest way to achieve the same workload. I'll provide another example for cases we can't do with native support of session window. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes #33548 from HeartSaVioR/SPARK-36314. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 1fafa8e) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

HeartSaVioR · 2021-07-28T03:13:27Z

Thanks all for the quick reviews and merge!

thenooz · 2022-08-29T07:17:48Z

Hello fellow spark structured streaming sessionizers. According to the apache spark docs, output mode update is not allowed with session window. I tried running the example code in my local machine for 3.2.1 and 3.2.0 and it failed with error:

org.apache.spark.sql.AnalysisException: Update output mode not supported for session window on streaming DataFrames/DataSets;

According to this example code it can be done. What am I missing?

HeartSaVioR · 2022-08-29T07:47:32Z

Ah we seem to miss updating the example... We allowed update mode first and disallowed due to unclear semantic.

Could you please raise a PR to update the output mode to append for Scala/Java/Python examples for session window if you don't mind? I can jump in if you're not available to address this. Thanks in advance!

thenooz · 2022-08-29T08:53:51Z

Sure, but you see, if we do it with append mode, then that means the sessions will be returned only after they are closed. Due to this, possibility and purpose of real-time session tagging is defeated. Should we not revert the example to mapGroupsWithState method?

HeartSaVioR · 2022-08-29T09:58:24Z

We still have an example of session window against (flat)mapGroupsWithState. We just don't want to duplicate the code example.

The main reason I proposed dropping the functionality of update mode for session window was due to the characteristic of session window. When someone uses update mode for the streaming query, it does not only mean they want to get update immediately, but also mean they are going to "upsert" the target table.

Upsertion is normally performed by replacing the existing value(s) for specific key to new value(s). If we consider the out-of-order events, session window will expand arbitrarily - both start and end of session can change. (Technically saying, the old example of session window was incorrect in terms of event-time semantic.) Given Spark does not produce the old value(s), the query can't always safely upsert the result. E.g. user A has multiple sessions in the target table and some of sessions were already closed. Spark never gives them, and the query is in a risk of dropping these old-but-valid sessions.

thenooz · 2022-08-29T10:30:46Z

Got it. So the problem is we run a risk of taking wrong real-time decisions on the basis of immediate updates and it is very much possible the session we calculated was incorrect due to late events. So, the safer and more accurate solution would be append mode with watermarking.
But still, back to a probably annoying question: Can we allow update mode with session window with a warning? You know, for people who take big risks.
Also, will check on updating the example code. I am new to this.
Thank you very much.

[SPARK-36314][SS] Update Sessionization example to use native support…

c62ea25

… of session window

github-actions bot added EXAMPLES PYTHON SQL STRUCTURED STREAMING labels Jul 27, 2021

Fix pylint

039f53a

viirya reviewed Jul 28, 2021

View reviewed changes

fix

f6cf587

xuanyuanking approved these changes Jul 28, 2021

View reviewed changes

viirya approved these changes Jul 28, 2021

View reviewed changes

viirya closed this in 1fafa8e Jul 28, 2021

Conversation

HeartSaVioR commented Jul 27, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Jul 27, 2021

Uh oh!

SparkQA commented Jul 27, 2021

Uh oh!

SparkQA commented Jul 27, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

viirya Jul 28, 2021

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

xuanyuanking left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

viirya commented Jul 28, 2021

Uh oh!

HeartSaVioR commented Jul 28, 2021

Uh oh!

thenooz commented Aug 29, 2022

Uh oh!

HeartSaVioR commented Aug 29, 2022

Uh oh!

thenooz commented Aug 29, 2022

Uh oh!

HeartSaVioR commented Aug 29, 2022

Uh oh!

thenooz commented Aug 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HeartSaVioR commented Jul 27, 2021 •

edited

Loading