Skip to content

[SPARK-36314][SS] Update Sessionization examples to use native support of session window#33548

Closed
HeartSaVioR wants to merge 3 commits intoapache:masterfrom
HeartSaVioR:SPARK-36314
Closed

[SPARK-36314][SS] Update Sessionization examples to use native support of session window#33548
HeartSaVioR wants to merge 3 commits intoapache:masterfrom
HeartSaVioR:SPARK-36314

Conversation

@HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to update Sessionization examples to use native support of session window. It also adds the example for PySpark as native support of session window is available to PySpark as well.

Why are the changes needed?

We should guide the simplest way to achieve the same workload. I'll provide another example for cases we can't do with native support of session window.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually tested.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Jul 27, 2021

Target branches are master/3.2.

@HeartSaVioR
Copy link
Contributor Author

cc. @viirya @xuanyuanking

@SparkQA
Copy link

SparkQA commented Jul 27, 2021

Test build #141737 has finished for PR 33548 at commit c62ea25.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2021

Test build #141741 has finished for PR 33548 at commit 039f53a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46250/


if __name__ == "__main__":
if len(sys.argv) != 3 and len(sys.argv) != 2:
msg = "Usage: structured_network_wordcount_windowed.py <hostname> <port> "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

structured_sessionization.py?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One typo found.

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46254/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46254/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46250/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Test build #141744 has finished for PR 33548 at commit f6cf587.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@xuanyuanking xuanyuanking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46257/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46257/

@viirya
Copy link
Member

viirya commented Jul 28, 2021

Thanks! Merging to master/3.2.

@viirya viirya closed this in 1fafa8e Jul 28, 2021
viirya pushed a commit that referenced this pull request Jul 28, 2021
…t of session window

### What changes were proposed in this pull request?

This PR proposes to update Sessionization examples to use native support of session window. It also adds the example for PySpark as native support of session window is available to PySpark as well.

### Why are the changes needed?

We should guide the simplest way to achieve the same workload. I'll provide another example for cases we can't do with native support of session window.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested.

Closes #33548 from HeartSaVioR/SPARK-36314.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 1fafa8e)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
@HeartSaVioR
Copy link
Contributor Author

Thanks all for the quick reviews and merge!

@thenooz
Copy link

thenooz commented Aug 29, 2022

Hello fellow spark structured streaming sessionizers. According to the apache spark docs, output mode update is not allowed with session window. I tried running the example code in my local machine for 3.2.1 and 3.2.0 and it failed with error:

org.apache.spark.sql.AnalysisException: Update output mode not supported for session window on streaming DataFrames/DataSets;

According to this example code it can be done. What am I missing?

@HeartSaVioR
Copy link
Contributor Author

Ah we seem to miss updating the example... We allowed update mode first and disallowed due to unclear semantic.

Could you please raise a PR to update the output mode to append for Scala/Java/Python examples for session window if you don't mind? I can jump in if you're not available to address this. Thanks in advance!

@thenooz
Copy link

thenooz commented Aug 29, 2022

Sure, but you see, if we do it with append mode, then that means the sessions will be returned only after they are closed. Due to this, possibility and purpose of real-time session tagging is defeated. Should we not revert the example to mapGroupsWithState method?

@HeartSaVioR
Copy link
Contributor Author

We still have an example of session window against (flat)mapGroupsWithState. We just don't want to duplicate the code example.

The main reason I proposed dropping the functionality of update mode for session window was due to the characteristic of session window. When someone uses update mode for the streaming query, it does not only mean they want to get update immediately, but also mean they are going to "upsert" the target table.

Upsertion is normally performed by replacing the existing value(s) for specific key to new value(s). If we consider the out-of-order events, session window will expand arbitrarily - both start and end of session can change. (Technically saying, the old example of session window was incorrect in terms of event-time semantic.) Given Spark does not produce the old value(s), the query can't always safely upsert the result. E.g. user A has multiple sessions in the target table and some of sessions were already closed. Spark never gives them, and the query is in a risk of dropping these old-but-valid sessions.

@thenooz
Copy link

thenooz commented Aug 29, 2022

Got it. So the problem is we run a risk of taking wrong real-time decisions on the basis of immediate updates and it is very much possible the session we calculated was incorrect due to late events. So, the safer and more accurate solution would be append mode with watermarking.
But still, back to a probably annoying question: Can we allow update mode with session window with a warning? You know, for people who take big risks.
Also, will check on updating the example code. I am new to this.
Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants