[SPARK-24699][SS][WIP] Watermark / Append mode should work with Trigger.Once #21676

c-horn · 2018-06-29T23:05:15Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-24699

Structured streaming using Trigger.Once does not persist watermark state between batches, causing streams to never yield output. I will attach some scripts that reproduce this behavior in the Jira issue.

It seems like the microbatcher only calculates the watermark off of the previous batch's input and emits new aggs based off of that timestamp. I believe the issue here is that the previous batch state is not persisted to the checkpoint, and therefore cannot be used when the stream is started again with Trigger.Once.

This behavior can be seen when restarting a normal stream from checkpoint, output is never generated on the first batch.

I will investigate ways of fixing this but I am definitely interested in input from anyone who worked on SS.

My assumption is that the watermarking should update with at least the batch-to-batch latency that it does under microbatch/Trigger.ProcessingTime.

How was this patch tested?

Failing unit test provided.

AmplabJenkins · 2018-06-29T23:07:58Z

Can one of the admins verify this patch?

c-horn · 2018-06-30T00:11:13Z

@tdas @marmbrus

c-horn · 2018-07-02T21:09:58Z

Changing OneTimeExecutor like this resolves the test case:

case class OneTimeExecutor() extends TriggerExecutor {

  /**
   * Execute a single batch using `batchRunner`.
   */
-  override def execute(batchRunner: () => Boolean): Unit = batchRunner()
+  override def execute(batchRunner: () => Boolean): Unit = batchRunner() && batchRunner()
}

... but the type becomes semantically incorrect.

Is this an acceptable solution? it appears that a lot of the MicroBatchExecution code makes assumptions about state from the previous batch, which may or may not be realized in the first iteration of a stream restart.

tdas · 2018-07-05T23:15:30Z

I think the right solution is to record the updated watermark in the commit log so that the updated watermark can be read back from the commit log next time the stream is started. Right now, there is no information written in the commit log, only the existence of the commit file is used as a proof that the batch as completed. will add a new field in the json written out to a commit file which will store the updated watermark. And this should be done in a back-compatible way such that old checkpoints that do not have the new field can recover as well.

c-horn · 2018-07-06T15:52:02Z

I was under the assumption that the offset log contained this data?

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L32
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L81
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L260

It does not seem to read the correct watermark (or rather any watermark).

tdas · 2018-07-11T03:33:50Z

The offset log contains the watermark value that is going to be used in the batch corresponding to that offset. For example, "checkpoint/offsets/10" will contain the watermark value to be used for batch 10. The problem is that when batch 10 completes and new watermark values is computed, it is not saved in a persistent location until batch 11 is planned and "offsets/11" is written out. In trigger.once, this never happens as the query is terminated as soon as batch 10 completes. So the new watermark value is not saved. If the query running in trigger.once mode right from the beginning, that is batch 0, then no new watermark value is ever written, and so the watermark shows up always as 0.

Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com> Co-authored-by: c-horn

tdas · 2018-07-11T09:30:15Z

Here is my solution based on my suggestion - #21746
I stole your unit test from this PR :) Thank you! I will add you as a co-author in that PR.

c-horn · 2018-07-11T17:44:09Z

@tdas I merged your changes into my branch, test passed, thank you 👍

tdas · 2018-07-20T23:18:41Z

hey, @c-horn , I am ready to merge my PR #21746 (I added more tests) and to add you as a coauthor, I think I need to know your email address associated with your github account. Can you provide me that?

tdas · 2018-07-23T07:51:20Z

ping ^^^

c-horn · 2018-07-23T18:59:25Z

Hi @tdas sorry for delay.
My email for github account: chorn4033@gmail.com

This looks fine to me, we can close this PR (and jira ticket) when yours is merged.

c-horn · 2018-07-23T20:25:22Z

already resolved by #21746

a failing test case

1b42cc4

c-horn changed the title ~~[SPARK-24699][SQL][WIP] Watermark / Append mode should work with Trigger.Once~~ [SPARK-24699][SS][WIP] Watermark / Append mode should work with Trigger.Once Jun 30, 2018

c-horn added 3 commits July 2, 2018 17:13

make it clear that Trigger variations are being tested

c746a15

Merge remote-tracking branch 'origin' into SPARK-24699

6b00de5

Merge remote-tracking branch 'origin/master' into SPARK-24699

6ce1137

Fixed bug

7e54a89

Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com> Co-authored-by: c-horn

c-horn added 2 commits July 11, 2018 13:21

Merge remote-tracking branch 'origin' into SPARK-24699

aec0593

Merge remote-tracking branch 'tdas/SPARK-24699' into SPARK-24699

9c2e7fe

c-horn closed this Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24699][SS][WIP] Watermark / Append mode should work with Trigger.Once #21676

[SPARK-24699][SS][WIP] Watermark / Append mode should work with Trigger.Once #21676

c-horn commented Jun 29, 2018 •

edited

AmplabJenkins commented Jun 29, 2018

c-horn commented Jun 30, 2018

c-horn commented Jul 2, 2018 •

edited

tdas commented Jul 5, 2018 •

edited

c-horn commented Jul 6, 2018 •

edited

tdas commented Jul 11, 2018

tdas commented Jul 11, 2018

c-horn commented Jul 11, 2018

tdas commented Jul 20, 2018 •

edited

tdas commented Jul 23, 2018

c-horn commented Jul 23, 2018

c-horn commented Jul 23, 2018

[SPARK-24699][SS][WIP] Watermark / Append mode should work with Trigger.Once #21676

[SPARK-24699][SS][WIP] Watermark / Append mode should work with Trigger.Once #21676

Conversation

c-horn commented Jun 29, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Jun 29, 2018

c-horn commented Jun 30, 2018

c-horn commented Jul 2, 2018 • edited

tdas commented Jul 5, 2018 • edited

c-horn commented Jul 6, 2018 • edited

tdas commented Jul 11, 2018

tdas commented Jul 11, 2018

c-horn commented Jul 11, 2018

tdas commented Jul 20, 2018 • edited

tdas commented Jul 23, 2018

c-horn commented Jul 23, 2018

c-horn commented Jul 23, 2018

c-horn commented Jun 29, 2018 •

edited

c-horn commented Jul 2, 2018 •

edited

tdas commented Jul 5, 2018 •

edited

c-horn commented Jul 6, 2018 •

edited

tdas commented Jul 20, 2018 •

edited