Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20030][SS] Event-time-based timeout for MapGroupsWithState #17361

Closed
wants to merge 10 commits into from

Conversation

tdas
Copy link
Contributor

@tdas tdas commented Mar 20, 2017

What changes were proposed in this pull request?

Adding event time based timeout. The user sets the timeout timestamp directly using KeyedState.setTimeoutTimestamp. The keys times out when the watermark crosses the timeout timestamp.

How was this patch tested?

Unit tests

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74881 has finished for PR 17361 at commit 3f77c01.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74894 has finished for PR 17361 at commit 6e9f408.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74910 has finished for PR 17361 at commit ac17886.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74919 has finished for PR 17361 at commit f6d2143.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (outputMode != InternalOutputModes.Update) {
throwError("flatMapGroupsWithState in update mode is not supported with " +
// mapGroupsWithState and flatMapGroupsWithState
case m: FlatMapGroupsWithState if m.isStreaming =>
Copy link
Contributor Author

@tdas tdas Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this to contains all tests related to map/flatMapGroupsWithState under a single case statement. this way its easier to reason whether all the possible combinations of operator+output-mode+aggregation has been covered.

Also it consolidates all the "valid combinations" of mode + aggs on which additional checks can be made (presence of watermark when timeoutConf = EventTimeTimeout)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactoring passes all existing tests in UnsupportedOperationsSuite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is getting complicated...

@tdas tdas changed the title [SPARK-20030][SS][WIP]Event-time-based timeout for MapGroupsWithState [SPARK-20030][SS] Event-time-based timeout for MapGroupsWithState Mar 21, 2017
@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74969 has finished for PR 17361 at commit 2c5592c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74971 has finished for PR 17361 at commit 0523aaf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #75003 has finished for PR 17361 at commit d0758eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #75001 has finished for PR 17361 at commit 6759165.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Timeout based on processing time. The duration of timeout can be set for each group in
* `map/flatMapGroupsWithState` by calling `KeyedState.setTimeoutDuration()`.
*/
public static KeyedStateTimeout ProcessingTimeTimeout() { return ProcessingTimeTimeout$.MODULE$; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd consider removing the Timeout here and as its kind of redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its just that if someone this does import KeyedStateTimeout._ the code boils down to
flatMapGroupsWithState(Update, ProcessingTime) { ... } with no reference to timeout.

Fine either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably still remove it.

}
if (watermarkAttributes.isEmpty) {
throwError(
"Event time timeout is not supported in a [map|flatMap]GroupsWithState " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are hyphenating event-time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we? I didnt know there was a policy. I am fine hyphenating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want it to be consistent and the docs hyphenate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, I get it. The apache docs does have hyphenation.

if (watermarkAttributes.isEmpty) {
throwError(
"Event time timeout is not supported in a [map|flatMap]GroupsWithState " +
"without watermark. Use '[Dataset/DataFrame].withWatermark()' to " +
Copy link
Contributor

@marmbrus marmbrus Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider making this an affirmative statement. "You must define a watermark on a dataframe in order to use event-time based timeouts".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

.withWatermark("eventTime", "10 seconds")
.as[(String, Long)]
.groupByKey[String]((x: (String, Long)) => x._1)
.flatMapGroupsWithState[Long, (String, Int)](Update, EventTimeTimeout)(stateFunc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types are just here for testing? (i.e. we didn't break inference right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debugging and I left them there thinking it help readability of tests. I can remove them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as they aren't required its okay.

* Timeout based on event time. The event time timestamp for timeout can be set for each
* group in `map/flatMapGroupsWithState` by calling `KeyedState.setTimeoutTimestamp()`.
* In addition, you have to define the watermark in the query using `Dataset.withWatermark`.
* When the watermark advances beyond the set timestamp of a group, then the group times out.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And no data has arrived for that group.

* @param isTimeoutEnabled Whether timeout is enabled. This will be used to check whether the user
* is allowed to configure timeouts.
* @param timeoutConf Type of timeout configured. Based on this, different operations will
* be supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent is inconsistent

@marmbrus
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75010 has finished for PR 17361 at commit 64b6abf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75017 has finished for PR 17361 at commit 9c9668b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #3604 has finished for PR 17361 at commit 9c9668b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in c1e87e3 Mar 22, 2017
@hhbyyh
Copy link
Contributor

hhbyyh commented Mar 22, 2017

@tdas Just FYI, I'm getting lint-java error:

yuhao@yuhao-devbox:~/workspace/github/hhbyyh/spark$ ./dev/lint-java
~Using mvn from path: /usr/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/streaming/KeyedStateTimeout.java:[41,35] (naming) MethodName: Method name 'ProcessingTimeTimeout' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/KeyedStateTimeout.java:[51,35] (naming) MethodName: Method name 'EventTimeTimeout' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]
$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/KeyedStateTimeout.java:[54,35] (naming) MethodName: Method name 'NoTimeout' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.

Is it just me? Maybe we should suppress the style error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants