[SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window #33689

HeartSaVioR · 2021-08-10T03:34:19Z

What changes were proposed in this pull request?

This PR proposes to prohibit update mode in streaming aggregation with session window.

UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode.

This PR also cleans up test code via deduplicating.

Why are the changes needed?

The semantic of "update" mode for session window based streaming aggregation is quite unclear.

For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed.

This doesn't hold true for session window based streaming aggregation, as session range is changing.

If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows.

If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions.

Does this PR introduce any user-facing change?

No, as we haven't released this feature.

How was this patch tested?

Updated tests.

…indow

HeartSaVioR · 2021-08-10T03:45:33Z

To respect the semantic of "update" properly, we need "retraction" to provide two different events, remove old session and insert updated session. This could be implemented manually with flatMapGroupsWithState, though I'm not sure this is performant enough in practice, since it requires two operations "delete" and "insert" against external storage.

HeartSaVioR · 2021-08-10T03:52:35Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingSessionWindowSuite.scala

-        "numEvents")
-
-    sessionUpdates.explain()
+    val sessionUpdates = sessionWindowQuery(inputData)


The most changes on this suite are about deduplication on queries. We can simply use two queries (keyed window vs global window) regardless of output mode.

SparkQA · 2021-08-10T04:50:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46760/

SparkQA · 2021-08-10T05:28:33Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46760/

viirya

I think we also need to update (or mention it) in the document.

viirya

Looks okay to me. I will look this again after the document is changed too.

SparkQA · 2021-08-10T09:00:17Z

Test build #142253 has finished for PR 33689 at commit 5eabdaf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-10T10:00:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46776/

SparkQA · 2021-08-10T10:40:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46776/

SparkQA · 2021-08-10T13:56:13Z

Test build #142269 has finished for PR 33689 at commit 188fe70.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-08-10T20:12:58Z

This doesn't hold true for session window based streaming aggregation. If you're trying to upsert the output based on the grouping key, it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows.

To be clear, I looked the current approach of UPDATE mode. We considers the row is updated if there is no existing row with the state key (session key + session start time), or the stored value isn't the same as the current value that will be stored.

But it is also possible that the session window was extended backward at the session start time. So the updated rows are not actually accurate.

HeartSaVioR · 2021-08-11T01:16:01Z

I think the semantic is meaningful only when end users can store the output correctly. That said, we should evaluate the semantic in point of end users' view. They will evaluate whether they need to see the grouping key as grouping key vs grouping key + session. grouping key + session start is something Spark internally uses as state key, which end users wouldn't know, so no meaning in point of end users' view.

If they leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy) which I already demonstrated the problem.

If they consider the key as grouping key, there's a chance for end users to upsert the session correctly, though only the last updated session will be stored, so it won't work with event time processing which there could be multiple active sessions.

viirya · 2021-08-11T01:27:48Z

Thanks for updating the description. lgtm

HeartSaVioR · 2021-08-11T01:33:01Z

Thanks again @viirya for the quick reviewing! Merging to master/3.2.

…session window ### What changes were proposed in this pull request? This PR proposes to prohibit update mode in streaming aggregation with session window. UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode. This PR also cleans up test code via deduplicating. ### Why are the changes needed? The semantic of "update" mode for session window based streaming aggregation is quite unclear. For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed. This doesn't hold true for session window based streaming aggregation, as session range is changing. If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows. If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions. ### Does this PR introduce _any_ user-facing change? No, as we haven't released this feature. ### How was this patch tested? Updated tests. Closes #33689 from HeartSaVioR/SPARK-36463. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit ed60aaa) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…session window This PR proposes to prohibit update mode in streaming aggregation with session window. UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode. This PR also cleans up test code via deduplicating. The semantic of "update" mode for session window based streaming aggregation is quite unclear. For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed. This doesn't hold true for session window based streaming aggregation, as session range is changing. If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows. If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions. No, as we haven't released this feature. Updated tests. Closes apache#33689 from HeartSaVioR/SPARK-36463. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit ed60aaa) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

[SPARK-36463][SS] Prohibit update mode in native support of session w…

5eabdaf

…indow

github-actions bot added SQL STRUCTURED STREAMING labels Aug 10, 2021

HeartSaVioR commented Aug 10, 2021

View reviewed changes

viirya reviewed Aug 10, 2021

View reviewed changes

document restrictions on session window in streaming query

188fe70

github-actions bot added the DOCS label Aug 10, 2021

viirya approved these changes Aug 10, 2021

View reviewed changes

HeartSaVioR closed this in ed60aaa Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window #33689

[SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window #33689

HeartSaVioR commented Aug 10, 2021 •

edited

HeartSaVioR commented Aug 10, 2021

HeartSaVioR Aug 10, 2021

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

viirya left a comment

viirya left a comment

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

viirya commented Aug 10, 2021

HeartSaVioR commented Aug 11, 2021 •

edited

viirya commented Aug 11, 2021

HeartSaVioR commented Aug 11, 2021

[SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window #33689

[SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window #33689

Conversation

HeartSaVioR commented Aug 10, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HeartSaVioR commented Aug 10, 2021

HeartSaVioR Aug 10, 2021

Choose a reason for hiding this comment

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

viirya left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

SparkQA commented Aug 10, 2021

viirya commented Aug 10, 2021

HeartSaVioR commented Aug 11, 2021 • edited

viirya commented Aug 11, 2021

HeartSaVioR commented Aug 11, 2021

HeartSaVioR commented Aug 10, 2021 •

edited

HeartSaVioR commented Aug 11, 2021 •

edited