New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34893][SS] Support session window natively #33081
[SPARK-34893][SS] Support session window natively #33081
Conversation
@@ -181,7 +181,8 @@ class UpdatingSessionsIterator( | |||
|
|||
private val valueProj = GenerateUnsafeProjection.generate(valuesExpressions, inputSchema) | |||
private val restoreProj = GenerateUnsafeProjection.generate(inputSchema, | |||
groupingExpressions.map(_.toAttribute) ++ valuesExpressions.map(_.toAttribute)) | |||
groupingWithoutSession.map(_.toAttribute) ++ Seq(sessionExpression.toAttribute) ++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a case session column is not placed at the end of the key. This line addresses such case.
Same here; I marked this as draft as other PRs has to be reviewed and merged earlier. I'll rebase this PR once all other PRs are merged. |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #140314 has finished for PR 33081 at commit
|
af538b9
to
0af2fd1
Compare
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140999 has finished for PR 33081 at commit
|
0af2fd1
to
11b44bb
Compare
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test status success |
Kubernetes integration test status success |
Test build #141015 has finished for PR 33081 at commit
|
Test build #141013 has finished for PR 33081 at commit
|
Test build #141022 has finished for PR 33081 at commit
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
cc. @viirya @xuanyuanking This is the last PR for session window feature. Please take a look. Thanks! |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
I will take another looks tonight. |
4a6cff3
to
e7a2a37
Compare
Rebased with master branch to see whether the pyspark lint is incurred from other PR and being fixed or not. |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #141110 has finished for PR 33081 at commit
|
val providerOptions = Seq( | ||
classOf[HDFSBackedStateStoreProvider].getCanonicalName).map { value => | ||
(SQLConf.STATE_STORE_PROVIDER_CLASS.key, value.stripSuffix("$")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to test against RocksDBStateStoreProvider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing spot. Nice finding!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one question about test coverage for RocksDB. Otherwise lgtm. Thanks for the work!
Test build #141131 has finished for PR 33081 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
GA passed. Thanks! Merging to master/3.2! |
Introduction: this PR is the last part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR proposes to support native session window. Please refer the comments/design doc in SPARK-10816 for more details on the rationalization and design (could be outdated a bit compared to the PR). The definition of the boundary of "session window" is [the timestamp of start event ~ the timestamp of last event + gap duration). That said, unlike time window, session window is a dynamic window which can expand if new input row is added to the session. To handle expansion of session window, Spark defines session window per input row, and "merge" windows if they can be merged (boundaries are overlapped). This PR leverages two different approaches on merging session windows: 1. merging session windows with Spark's aggregation logic (a variant of sort aggregation) 2. updating session window for all rows bound to the same session, and applying aggregation logic afterwards First one is preferable as it outperforms compared to the second one, though it can be only used if merging session window can be applied altogether with aggregation. It is not applicable on all the cases, so second one is used to cover the remaining cases. This PR also applies the optimization on merging input rows and existing sessions with retaining the order (group keys + start timestamp of session window), leveraging the fact the number of existing sessions per group key won't be huge. The state format is versioned, so that we can bring a new state format if we find a better one. ### Why are the changes needed? For now, to deal with sessionization, Spark requires end users to play with (flat)MapGroupsWithState directly which has a couple of major drawbacks: 1. (flat)MapGroupsWithState is lower level API and end users have to code everything in details for defining session window and merging windows 2. built-in aggregate functions cannot be used and end users have to deal with aggregation by themselves 3. (flat)MapGroupsWithState is only available in Scala/Java. With native support of session window, end users simply use "session_window" like they use "window" for tumbling/sliding window, and leverage built-in aggregate functions as well as UDAFs to simply define aggregations. Quoting the query example from test suite: ``` val inputData = MemoryStream[(String, Long)] // Split the lines into words, treat words as sessionId of events val events = inputData.toDF() .select($"_1".as("value"), $"_2".as("timestamp")) .withColumn("eventTime", $"timestamp".cast("timestamp")) .selectExpr("explode(split(value, ' ')) AS sessionId", "eventTime") .withWatermark("eventTime", "30 seconds") val sessionUpdates = events .groupBy(session_window($"eventTime", "10 seconds") as 'session, 'sessionId) .agg(count("*").as("numEvents")) .selectExpr("sessionId", "CAST(session.start AS LONG)", "CAST(session.end AS LONG)", "CAST(session.end AS LONG) - CAST(session.start AS LONG) AS durationMs", "numEvents") ``` which is same as StructuredSessionization (native session window is shorter and clearer even ignoring model classes). https://github.com/apache/spark/blob/39542bb81f8570219770bb6533c077f44f6cbd2a/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala#L66-L105 (Worth noting that the code in StructuredSessionization only works with processing time. The code doesn't consider old event can update the start time of old session.) ### Does this PR introduce _any_ user-facing change? Yes. This PR brings the new feature to support session window on both batch and streaming query, which adds a new function "session_window" which usage is similar with "window". ### How was this patch tested? New test suites. Also tested with benchmark code. Closes #33081 from HeartSaVioR/SPARK-34893-SPARK-10816-PR-31570-part-5. Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit f2bf8b0) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Thanks all for reviewing! I merged this into master/3.2. I filed https://issues.apache.org/jira/browse/SPARK-36172 to deal with documentation - probably need to explain about session window and how to use it in structured streaming guide doc. |
Test build #141145 has finished for PR 33081 at commit
|
Thank you @HeartSaVioR and @xuanyuanking! |
Introduction: this PR is the last part of SPARK-10816 (EventTime based sessionization (session window)). Please refer apache#31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) This PR proposes to support native session window. Please refer the comments/design doc in SPARK-10816 for more details on the rationalization and design (could be outdated a bit compared to the PR). The definition of the boundary of "session window" is [the timestamp of start event ~ the timestamp of last event + gap duration). That said, unlike time window, session window is a dynamic window which can expand if new input row is added to the session. To handle expansion of session window, Spark defines session window per input row, and "merge" windows if they can be merged (boundaries are overlapped). This PR leverages two different approaches on merging session windows: 1. merging session windows with Spark's aggregation logic (a variant of sort aggregation) 2. updating session window for all rows bound to the same session, and applying aggregation logic afterwards First one is preferable as it outperforms compared to the second one, though it can be only used if merging session window can be applied altogether with aggregation. It is not applicable on all the cases, so second one is used to cover the remaining cases. This PR also applies the optimization on merging input rows and existing sessions with retaining the order (group keys + start timestamp of session window), leveraging the fact the number of existing sessions per group key won't be huge. The state format is versioned, so that we can bring a new state format if we find a better one. For now, to deal with sessionization, Spark requires end users to play with (flat)MapGroupsWithState directly which has a couple of major drawbacks: 1. (flat)MapGroupsWithState is lower level API and end users have to code everything in details for defining session window and merging windows 2. built-in aggregate functions cannot be used and end users have to deal with aggregation by themselves 3. (flat)MapGroupsWithState is only available in Scala/Java. With native support of session window, end users simply use "session_window" like they use "window" for tumbling/sliding window, and leverage built-in aggregate functions as well as UDAFs to simply define aggregations. Quoting the query example from test suite: ``` val inputData = MemoryStream[(String, Long)] // Split the lines into words, treat words as sessionId of events val events = inputData.toDF() .select($"_1".as("value"), $"_2".as("timestamp")) .withColumn("eventTime", $"timestamp".cast("timestamp")) .selectExpr("explode(split(value, ' ')) AS sessionId", "eventTime") .withWatermark("eventTime", "30 seconds") val sessionUpdates = events .groupBy(session_window($"eventTime", "10 seconds") as 'session, 'sessionId) .agg(count("*").as("numEvents")) .selectExpr("sessionId", "CAST(session.start AS LONG)", "CAST(session.end AS LONG)", "CAST(session.end AS LONG) - CAST(session.start AS LONG) AS durationMs", "numEvents") ``` which is same as StructuredSessionization (native session window is shorter and clearer even ignoring model classes). https://github.com/apache/spark/blob/39542bb81f8570219770bb6533c077f44f6cbd2a/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala#L66-L105 (Worth noting that the code in StructuredSessionization only works with processing time. The code doesn't consider old event can update the start time of old session.) Yes. This PR brings the new feature to support session window on both batch and streaming query, which adds a new function "session_window" which usage is similar with "window". New test suites. Also tested with benchmark code. Closes apache#33081 from HeartSaVioR/SPARK-34893-SPARK-10816-PR-31570-part-5. Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Introduction: this PR is the last part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.)
What changes were proposed in this pull request?
This PR proposes to support native session window. Please refer the comments/design doc in SPARK-10816 for more details on the rationalization and design (could be outdated a bit compared to the PR).
The definition of the boundary of "session window" is [the timestamp of start event ~ the timestamp of last event + gap duration). That said, unlike time window, session window is a dynamic window which can expand if new input row is added to the session. To handle expansion of session window, Spark defines session window per input row, and "merge" windows if they can be merged (boundaries are overlapped).
This PR leverages two different approaches on merging session windows:
First one is preferable as it outperforms compared to the second one, though it can be only used if merging session window can be applied altogether with aggregation. It is not applicable on all the cases, so second one is used to cover the remaining cases.
This PR also applies the optimization on merging input rows and existing sessions with retaining the order (group keys + start timestamp of session window), leveraging the fact the number of existing sessions per group key won't be huge.
The state format is versioned, so that we can bring a new state format if we find a better one.
Why are the changes needed?
For now, to deal with sessionization, Spark requires end users to play with (flat)MapGroupsWithState directly which has a couple of major drawbacks:
With native support of session window, end users simply use "session_window" like they use "window" for tumbling/sliding window, and leverage built-in aggregate functions as well as UDAFs to simply define aggregations.
Quoting the query example from test suite:
which is same as StructuredSessionization (native session window is shorter and clearer even ignoring model classes).
spark/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala
Lines 66 to 105 in 39542bb
(Worth noting that the code in StructuredSessionization only works with processing time. The code doesn't consider old event can update the start time of old session.)
Does this PR introduce any user-facing change?
Yes. This PR brings the new feature to support session window on both batch and streaming query, which adds a new function "session_window" which usage is similar with "window".
How was this patch tested?
New test suites. Also tested with benchmark code.