Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35897][SS] Support user defined initial state with flatMapGroupsWithState in Structured Streaming #33093

Closed

Conversation

rahulsmahadev
Copy link
Contributor

@rahulsmahadev rahulsmahadev commented Jun 25, 2021

What changes were proposed in this pull request?

This PR aims to add support for specifying a user defined initial state for arbitrary structured streaming stateful processing using [flat]MapGroupsWithState operator.

Why are the changes needed?

Users can load previous state of their stateful processing as an initial state instead of redoing the entire processing once again.

Does this PR introduce any user-facing change?

Yes this PR introduces new API

  def mapGroupsWithState[S: Encoder, U: Encoder](
      timeoutConf: GroupStateTimeout,
      initialState: KeyValueGroupedDataset[K, S])(
      func: (K, Iterator[V], GroupState[S]) => U): Dataset[U] 

  def flatMapGroupsWithState[S: Encoder, U: Encoder](
      outputMode: OutputMode,
      timeoutConf: GroupStateTimeout,
      initialState: KeyValueGroupedDataset[K, S])(
      func: (K, Iterator[V], GroupState[S]) => Iterator[U])


How was this patch tested?

Through unit tests in FlatMapGroupsWithStateSuite

@rahulsmahadev
Copy link
Contributor Author

@tdas can you enable CI on this

@rahulsmahadev
Copy link
Contributor Author

cc: @tdas can you enable CI on this

@zsxwing
Copy link
Member

zsxwing commented Jun 25, 2021

add to whitelist

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Test build #140337 has finished for PR 33093 at commit 4a0b591.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44868/

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Test build #140342 has finished for PR 33093 at commit 69b3c7e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44873/

@zsxwing
Copy link
Member

zsxwing commented Jun 25, 2021

ok to test

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44877/

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Test build #140347 has finished for PR 33093 at commit 69b3c7e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon marked this pull request as draft June 28, 2021 03:17
@HyukjinKwon
Copy link
Member

@rahulsmahadev, mind keeping the PR description template (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE)?

@SparkQA
Copy link

SparkQA commented Jun 29, 2021

Test build #140405 has finished for PR 33093 at commit 6c1443c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait WatermarkSupport

@SparkQA
Copy link

SparkQA commented Jun 29, 2021

Test build #140407 has finished for PR 33093 at commit ec2b972.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44920/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44922/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44920/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44922/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44928/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44930/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44930/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Test build #140413 has finished for PR 33093 at commit f221358.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Test build #140415 has finished for PR 33093 at commit 673008d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45031/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140514 has finished for PR 33093 at commit 57c9c2f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

("keyInStateAndData-2", Seq[String](), "1"),
("keyOnlyInData", Seq[String]("keyOnlyInData"), "1") // inc by 1
),
assertNumStateRows(total = 5, updated = 5),
Copy link
Contributor

@tdas tdas Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not testing whether the initial group state is actually being saved or not. you could have just created th e input GroupState object with the initial state and not saved to state store, and this test will still pass. So you need to run another batch to retrieve and test the save state.

Furthermore, you need to explicitly test whether the initial state is saved to store even if you dont call GroupState.update(). Right now in your test function, you are always calling update. So even if you incorrectly did not save the initial state store, the update will always make sure the state store is updated. So you need to test for more cases, with more keys.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140518 has finished for PR 33093 at commit 4e85a62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing zsxwing marked this pull request as ready for review July 1, 2021 19:57
@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45051/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45051/

def testWithTimeout(timeoutConf: GroupStateTimeout): Unit = {
test("SPARK-20714: watermark does not fail query when timeout = " + timeoutConf) {
// Function to maintain running count up to 2, and then remove the count
// Returns the data and the count (-1 if count reached beyond 2 and state was just removed)
val stateFunc =
(key: String, values: Iterator[(String, Long)], state: GroupState[RunningCount]) => {
(key: String, values: Iterator[(String, Long)], state: GroupState[RunningCount]) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix this

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45056/

assert(state.exists)
assertCanGetProcessingTime { state.getCurrentProcessingTimeMs() >= 0 }
assertCannotGetWatermark { state.getCurrentWatermarkMs() }
assert(!state.hasTimedOut)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the last 3 tests need to be inside if (valList.isEmpty) {?

// We need to check if not explicitly calling update will still save the init state or not
if (valList.nonEmpty || state.getOption.map(_.count).getOrElse(0L) != 2L) {
// this is not reached when valList is empty and the state count is 2
state.update(new RunningCount(count))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing this complicated logic of no updating when some magical set of conditions are met .. isnt it simpler to have

if (!key.contains("NoUpdate")) state.update(...)

and then pass a key name keyOnlyInStateButNoUpate or keyInStateAndDataButNoUpate??

("keyOnlyInState-1", Seq[String](), "1"),
("keyOnlyInState-2", Seq[String](), "2"),
("keyInStateAndData-2", Seq[String]("keyInStateAndData-2"), "3"), // inc by 1
("keyInStateAndData-1", Seq[String](), "1"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyInStateAndData-1 is NOT in data. this is confusing!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it is added in second batch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nit, the point naming it like this is its in both state and data of the first batch which is what we are mainly testing. hence it is confusing.

assert(state.exists)
assertCanGetProcessingTime { state.getCurrentProcessingTimeMs() >= 0 }
assertCannotGetWatermark { state.getCurrentWatermarkMs() }
assert(!state.hasTimedOut)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same questions as above

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140545 has finished for PR 33093 at commit b47ac23.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45056/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45058/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Test build #140538 has finished for PR 33093 at commit 0374bdb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45063/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45063/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Test build #140543 has finished for PR 33093 at commit b8c70ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

Thanks, merging to master

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Test build #140550 has finished for PR 33093 at commit eb83b68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants