[SPARK-18790][SS] Keep a general offset history of stream batches #16219

tcondie · 2016-12-08T20:59:00Z

What changes were proposed in this pull request?

Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
the offsets that are present in each batch
versions of the state store
the files lists stored for the FileStreamSource
the metadata log stored by the FileStreamSink

@marmbrus @zsxwing

How was this patch tested?

The following tests were added.

StreamExecution offset metadata

Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain

CompactibleFileStreamLog

Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.

Please review http://spark.apache.org/contributing.html before opening a pull request.

…tion parameter

SparkQA · 2016-12-08T22:49:27Z

Test build #69879 has finished for PR 16219 at commit fc1557e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-09T00:26:43Z

Test build #69882 has finished for PR 16219 at commit 4791209.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-09T03:26:02Z

Test build #69898 has finished for PR 16219 at commit 7b6538c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Overall looks good. Just left some comments.

zsxwing · 2016-12-09T18:37:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

@@ -58,6 +58,8 @@ class StreamExecution(

  private val pollingDelayMs = sparkSession.sessionState.conf.streamingPollingDelay

+  private val minBatchesToRetain = sparkSession.sessionState.conf.minBatchesToRetain
+


It's better to add a require(...) here to make sure minBatchesToRetain >= 1.

zsxwing · 2016-12-09T18:39:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

    if (isCompactionBatch(batchId, compactInterval)) {
-      compact(batchId, logs)
+      batchAdded = compact(batchId, logs)


nit: you can rewrite it like

val batchAdded = if (isCompactionBatch(batchId, compactInterval)) { compact(batchId, logs) } else { super.add(batchId, logs) }

zsxwing · 2016-12-09T18:46:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreSuite.scala

-    val storeConf = StateStoreConf.empty
+    val sqlConf = new SQLConf()
+    sqlConf.setConf(SQLConf.MIN_BATCHES_TO_RETAIN, 2)
+    val storeConf = StateStoreConf(sqlConf) // StateStoreConf.empty


nit: this is not StateStoreConf.empty. Please remove it.

zsxwing · 2016-12-09T18:53:47Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

@@ -303,7 +303,6 @@ private[state] class HDFSBackedStateStoreProvider(
      val mapFromFile = readSnapshotFile(version).getOrElse {
        val prevMap = loadMap(version - 1)
        val newMap = new MapType(prevMap)
-        newMap.putAll(prevMap)


Why remove this line? newMap should be prevMap + delta file in such case.

new MapType(prevMap) will make a call that is equivalent to newMap.putAll(prevMap). So basically, newMap.putAll(prevMap) is redundant work.

zsxwing · 2016-12-09T18:57:50Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala


+    spark.conf.set("spark.sql.streaming.minBatchesToRetain", 1)


nit: you can use

withSQLConf(SQLConf.MIN_BATCHES_TO_RETAIN.key -> "1") { ... }

to simplify the codes. withSQLConf will recover the conf.

zsxwing · 2016-12-09T19:35:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

-        } catch {
-          case _: NumberFormatException =>
-            false
+  private def deleteExpiredLog(currentBatchId: Long): Unit = {


Could you also update the comments since they are out of date?

zsxwing · 2016-12-09T19:43:00Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala

    withSQLConf(
      SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3",
-      SQLConf.FILE_SINK_LOG_CLEANUP_DELAY.key -> "0") {
+      SQLConf.FILE_SINK_LOG_CLEANUP_DELAY.key -> "0",
+      SQLConf.MIN_BATCHES_TO_RETAIN.key -> "1") {


Could you also test other value of SQLConf.MIN_BATCHES_TO_RETAIN? Testing only 1 is not enough.

SparkQA · 2016-12-10T00:59:02Z

Test build #69938 has finished for PR 16219 at commit 4dbada0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-10T01:33:19Z

Test build #69943 has finished for PR 16219 at commit 0830349.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-12-12T06:54:32Z

LGTM

SparkQA · 2016-12-12T07:36:18Z

Test build #3493 has finished for PR 16219 at commit 0830349.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-12-12T07:38:02Z

Thanks! Merging to master and 2.1.

## What changes were proposed in this pull request? Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches: the offsets that are present in each batch versions of the state store the files lists stored for the FileStreamSource the metadata log stored by the FileStreamSink marmbrus zsxwing ## How was this patch tested? The following tests were added. ### StreamExecution offset metadata Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain ### CompactibleFileStreamLog Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tyson Condie <tcondie@gmail.com> Closes #16219 from tcondie/offset_hist. (cherry picked from commit 83a4289) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

## What changes were proposed in this pull request? Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches: the offsets that are present in each batch versions of the state store the files lists stored for the FileStreamSource the metadata log stored by the FileStreamSink marmbrus zsxwing ## How was this patch tested? The following tests were added. ### StreamExecution offset metadata Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain ### CompactibleFileStreamLog Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tyson Condie <tcondie@gmail.com> Closes apache#16219 from tcondie/offset_hist.

tcondie added 3 commits December 8, 2016 11:02

revised log history maintenence based on minBatchesToRetain configura…

2d96af3

…tion parameter

add test for metadata garbage collection based on minBatchesToRetain

fc1557e

revise offset tests

4791209

update state store tests

7b6538c

zsxwing requested changes Dec 9, 2016

View reviewed changes

tcondie added 2 commits December 9, 2016 14:21

address feedback from Ryan

4dbada0

update

0830349

asfgit closed this in 83a4289 Dec 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18790][SS] Keep a general offset history of stream batches #16219

[SPARK-18790][SS] Keep a general offset history of stream batches #16219

tcondie commented Dec 8, 2016

SparkQA commented Dec 8, 2016

SparkQA commented Dec 9, 2016

SparkQA commented Dec 9, 2016

zsxwing left a comment

zsxwing Dec 9, 2016

zsxwing Dec 9, 2016

zsxwing Dec 9, 2016

zsxwing Dec 9, 2016

tcondie Dec 9, 2016

zsxwing Dec 9, 2016

zsxwing Dec 9, 2016

zsxwing Dec 9, 2016

SparkQA commented Dec 10, 2016

SparkQA commented Dec 10, 2016

zsxwing commented Dec 12, 2016

SparkQA commented Dec 12, 2016

zsxwing commented Dec 12, 2016

		@@ -58,6 +58,8 @@ class StreamExecution(

		private val pollingDelayMs = sparkSession.sessionState.conf.streamingPollingDelay

		private val minBatchesToRetain = sparkSession.sessionState.conf.minBatchesToRetain

[SPARK-18790][SS] Keep a general offset history of stream batches #16219

[SPARK-18790][SS] Keep a general offset history of stream batches #16219

Conversation

tcondie commented Dec 8, 2016

What changes were proposed in this pull request?

How was this patch tested?

StreamExecution offset metadata

CompactibleFileStreamLog

SparkQA commented Dec 8, 2016

SparkQA commented Dec 9, 2016

SparkQA commented Dec 9, 2016

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 10, 2016

SparkQA commented Dec 10, 2016

zsxwing commented Dec 12, 2016

SparkQA commented Dec 12, 2016

zsxwing commented Dec 12, 2016