Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18790][SS] Keep a general offset history of stream batches #16219

Closed
wants to merge 6 commits into from

Conversation

tcondie
Copy link
Contributor

@tcondie tcondie commented Dec 8, 2016

What changes were proposed in this pull request?

Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
the offsets that are present in each batch
versions of the state store
the files lists stored for the FileStreamSource
the metadata log stored by the FileStreamSink

@marmbrus @zsxwing

How was this patch tested?

The following tests were added.

StreamExecution offset metadata

Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain

CompactibleFileStreamLog

Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69879 has finished for PR 16219 at commit fc1557e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2016

Test build #69882 has finished for PR 16219 at commit 4791209.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2016

Test build #69898 has finished for PR 16219 at commit 7b6538c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Just left some comments.

@@ -58,6 +58,8 @@ class StreamExecution(

private val pollingDelayMs = sparkSession.sessionState.conf.streamingPollingDelay

private val minBatchesToRetain = sparkSession.sessionState.conf.minBatchesToRetain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add a require(...) here to make sure minBatchesToRetain >= 1.

if (isCompactionBatch(batchId, compactInterval)) {
compact(batchId, logs)
batchAdded = compact(batchId, logs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can rewrite it like

    val batchAdded =
      if (isCompactionBatch(batchId, compactInterval)) {
        compact(batchId, logs)
      } else {
        super.add(batchId, logs)
      }

val storeConf = StateStoreConf.empty
val sqlConf = new SQLConf()
sqlConf.setConf(SQLConf.MIN_BATCHES_TO_RETAIN, 2)
val storeConf = StateStoreConf(sqlConf) // StateStoreConf.empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not StateStoreConf.empty. Please remove it.

@@ -303,7 +303,6 @@ private[state] class HDFSBackedStateStoreProvider(
val mapFromFile = readSnapshotFile(version).getOrElse {
val prevMap = loadMap(version - 1)
val newMap = new MapType(prevMap)
newMap.putAll(prevMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this line? newMap should be prevMap + delta file in such case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new MapType(prevMap) will make a call that is equivalent to newMap.putAll(prevMap). So basically, newMap.putAll(prevMap) is redundant work.


spark.conf.set("spark.sql.streaming.minBatchesToRetain", 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use

    withSQLConf(SQLConf.MIN_BATCHES_TO_RETAIN.key -> "1") {
      ...
    }

to simplify the codes. withSQLConf will recover the conf.

} catch {
case _: NumberFormatException =>
false
private def deleteExpiredLog(currentBatchId: Long): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update the comments since they are out of date?

withSQLConf(
SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3",
SQLConf.FILE_SINK_LOG_CLEANUP_DELAY.key -> "0") {
SQLConf.FILE_SINK_LOG_CLEANUP_DELAY.key -> "0",
SQLConf.MIN_BATCHES_TO_RETAIN.key -> "1") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also test other value of SQLConf.MIN_BATCHES_TO_RETAIN? Testing only 1 is not enough.

@SparkQA
Copy link

SparkQA commented Dec 10, 2016

Test build #69938 has finished for PR 16219 at commit 4dbada0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 10, 2016

Test build #69943 has finished for PR 16219 at commit 0830349.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Dec 12, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Dec 12, 2016

Test build #3493 has finished for PR 16219 at commit 0830349.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Dec 12, 2016

Thanks! Merging to master and 2.1.

asfgit pushed a commit that referenced this pull request Dec 12, 2016
## What changes were proposed in this pull request?

Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
the offsets that are present in each batch
versions of the state store
the files lists stored for the FileStreamSource
the metadata log stored by the FileStreamSink

marmbrus zsxwing

## How was this patch tested?

The following tests were added.

### StreamExecution offset metadata
Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain

### CompactibleFileStreamLog
Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes #16219 from tcondie/offset_hist.

(cherry picked from commit 83a4289)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
@asfgit asfgit closed this in 83a4289 Dec 12, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?

Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
the offsets that are present in each batch
versions of the state store
the files lists stored for the FileStreamSource
the metadata log stored by the FileStreamSink

marmbrus zsxwing

## How was this patch tested?

The following tests were added.

### StreamExecution offset metadata
Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain

### CompactibleFileStreamLog
Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes apache#16219 from tcondie/offset_hist.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
the offsets that are present in each batch
versions of the state store
the files lists stored for the FileStreamSource
the metadata log stored by the FileStreamSink

marmbrus zsxwing

## How was this patch tested?

The following tests were added.

### StreamExecution offset metadata
Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain

### CompactibleFileStreamLog
Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes apache#16219 from tcondie/offset_hist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants