-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window #33689
Conversation
To respect the semantic of "update" properly, we need "retraction" to provide two different events, remove old session and insert updated session. This could be implemented manually with flatMapGroupsWithState, though I'm not sure this is performant enough in practice, since it requires two operations "delete" and "insert" against external storage. |
"numEvents") | ||
|
||
sessionUpdates.explain() | ||
val sessionUpdates = sessionWindowQuery(inputData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most changes on this suite are about deduplication on queries. We can simply use two queries (keyed window vs global window) regardless of output mode.
Kubernetes integration test starting |
Kubernetes integration test status success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to update (or mention it) in the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay to me. I will look this again after the document is changed too.
Test build #142253 has finished for PR 33689 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #142269 has finished for PR 33689 at commit
|
To be clear, I looked the current approach of UPDATE mode. We considers the row is updated if there is no existing row with the state key (session key + session start time), or the stored value isn't the same as the current value that will be stored. But it is also possible that the session window was extended backward at the session start time. So the updated rows are not actually accurate. |
I think the semantic is meaningful only when end users can store the output correctly. That said, we should evaluate the semantic in point of end users' view. They will evaluate whether they need to see the grouping key as If they leverage their knowledge about streaming aggregation, they will consider the key as If they consider the key as |
Thanks for updating the description. lgtm |
Thanks again @viirya for the quick reviewing! Merging to master/3.2. |
…session window ### What changes were proposed in this pull request? This PR proposes to prohibit update mode in streaming aggregation with session window. UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode. This PR also cleans up test code via deduplicating. ### Why are the changes needed? The semantic of "update" mode for session window based streaming aggregation is quite unclear. For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed. This doesn't hold true for session window based streaming aggregation, as session range is changing. If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows. If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions. ### Does this PR introduce _any_ user-facing change? No, as we haven't released this feature. ### How was this patch tested? Updated tests. Closes #33689 from HeartSaVioR/SPARK-36463. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit ed60aaa) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…session window This PR proposes to prohibit update mode in streaming aggregation with session window. UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode. This PR also cleans up test code via deduplicating. The semantic of "update" mode for session window based streaming aggregation is quite unclear. For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed. This doesn't hold true for session window based streaming aggregation, as session range is changing. If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows. If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions. No, as we haven't released this feature. Updated tests. Closes apache#33689 from HeartSaVioR/SPARK-36463. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit ed60aaa) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
What changes were proposed in this pull request?
This PR proposes to prohibit update mode in streaming aggregation with session window.
UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode.
This PR also cleans up test code via deduplicating.
Why are the changes needed?
The semantic of "update" mode for session window based streaming aggregation is quite unclear.
For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed.
This doesn't hold true for session window based streaming aggregation, as session range is changing.
If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows.
If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions.
Does this PR introduce any user-facing change?
No, as we haven't released this feature.
How was this patch tested?
Updated tests.