[SPARK-43421][SS] Implement Changelog based Checkpointing for RocksDB State Store Provider #41099

chaoqin-li1123 · 2023-05-09T06:25:45Z

What changes were proposed in this pull request?

In order to reduce the checkpoint duration and end to end latency, we propose Changelog Based Checkpointing for RocksDB State Store Provider. Below is the mechanism.

Changelog checkpoint: Upon each put() delete() call to local rocksdb instance, log the operation to a changelog file. During the state change commit, sync the compressed change log of the current batch to DFS as checkpointDir/{version}.delta.
Version reconstruction: For version j, find latest snapshot i.zip such that i <= j, load snapshot i, and replay i+1.delta ~ j.delta. This is used in loading the initial state as well as creating the latest version snapshot. Note: If a query is shutdown without exception, there won’t be changelog replay during query restart because a maintenance task is executed before the state store instance is unloaded.
Background snapshot: A maintenance thread in executors will launch maintenance tasks periodically. Inside the maintenance task, sync the latest RocksDB local snapshot to DFS as checkpointDir/{version}.zip. Snapshot enables faster failure recovery and allows old versions to be purged.
Garbage collection: Inside the maintenance task, delete snapshot and delta files from DFS for versions that is out of retention range(default retained version number is 100)

Why are the changes needed?

We have identified state checkpointing latency as one of the major performance bottlenecks for stateful streaming queries. Currently, RocksDB state store pauses the RocksDB instances to upload a snapshot to the cloud when committing a batch, which is heavy weight and has unpredictable performance.
With changelog based checkpointing, we allow the RocksDB instance to run uninterruptibly, which improves RocksDB operation performance. This also dramatically reduces the commit time and batch duration because we are uploading a smaller amount of data during state commit. With this change, stateful query with RocksDB state store will have lower and more predictable latency.

How was this patch tested?

Add unit test for changelog checkpointing utility.
Add unit test and integration test that check backward compatibility with existing checkpoint.
Enable RocksDB state store unit test and stateful streaming query integration test to run with changelog checkpointing enabled.

chaoqin-li1123 · 2023-05-09T07:16:10Z

@HeartSaVioR

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

siying · 2023-05-09T20:12:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+      assert(changelogWriter.isDefined)
+      val newVersion = loadedVersion + 1
+      newVersion - fileManager.getLastUploadedSnapshotVersion() >= conf.minDeltasForSnapshot ||
+        changelogWriter.get.size > 1000


1000 entries feel relatively small to trigger a snapshop upload. Consider to remove the limit, or make it very large, such as 1 million.

This does not trigger a snapshot upload, it simply flush and create a local checkpoint. I can increase it to 10K, which guarantee that the changelog replay between every 2 snapshot is < 50k records.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

anishshri-db · 2023-05-10T18:00:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

@@ -56,6 +56,15 @@ class RocksDB(
    hadoopConf: Configuration = new Configuration,
    loggingId: String = "") extends Logging {

+  case class RocksDBCheckpoint(checkpointDir: File, version: Long, numKeys: Long) {


Is case class preferred pattern for something like this ? cc - @HeartSaVioR

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

anishshri-db · 2023-05-10T18:39:17Z

@chaoqin-li1123 - is the test failure related to the change ?

 - Handle unsupported input of message type *** FAILED *** (12 milliseconds)

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

anishshri-db · 2023-05-10T21:40:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreSuite.scala

@@ -279,7 +280,7 @@ class StateStoreSuite extends StateStoreSuiteBase[HDFSBackedStateStoreProvider]
    val timeoutDuration = 1.minute

    quietly {
-      withSpark(new SparkContext(conf)) { sc =>
+      withSpark(SparkContext.getOrCreate(conf)) { sc =>


Why do we need this change ?

Because now we use a shared spark session, creating a new spark context will have conflict.

SparkContext.getOrCreate() does not allow you to set the config if there is active SparkContext. If there is no active SparkContext then config takes effect. Hopefully we don't seem to set any configs so we don't mess up anything, but worth noting.

If you stop active SparkContext (withSpark), SparkSession containing SparkContext will also be merely not functioning till you set the SparkContext again.

That said, if you have a test which intends to spin a new SparkContext, it (test suite) should not extend shared SparkSession. There is another trait named LocalSparkContext which spins up and tears down a new SparkContext per test, but this may be also overkill for this test suite because only a few tests require SparkContext instance to be set.

I'd say let's roll back unless your change requires SparkSession to be set.

I feel like you had to do this because withSQLConf may require to have SparkSession. But I suspect this test won't use SQLConf being initialized in shared SparkSession, which makes withSQLConf be no-op.

Have you checked that all tests are running intentionally, testing both incremental checkpointing and changelog checkpointing?

I don't mean to override the config here, the reason why I need this is because I extends SharedSparkSession, which already have an active spark context, if I don't modify the test, it will throw an error when trying to create a new spark context.

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

HeartSaVioR

Done first pass in code part. Looking through tests.

One major concern about tests is that we define trait which runs test for two different config values, leveraging withSQLConf, and blindly apply existing test suites. I suspect many tests may not even pick up the change, because they "intentionally" do not use config in shared SparkSession. The test suite even didn't use shared SparkSession at all.

Can we double-check on that?

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala

HeartSaVioR · 2023-05-25T05:16:21Z

docs/structured-streaming-programming-guide.md

@@ -2389,6 +2394,12 @@ If you want to cap RocksDB memory usage in your Spark Structured Streaming deplo
 You can also determine the max allowed memory for RocksDB instances by setting the `spark.sql.streaming.stateStore.rocksdb.maxMemoryUsageMB` value to a static number or as a fraction of the physical memory available on the node.
 Limits for individual RocksDB instances can also be configured by setting `spark.sql.streaming.stateStore.rocksdb.writeBufferSizeMB` and `spark.sql.streaming.stateStore.rocksdb.maxWriteBufferNumber` to the required values. By default, RocksDB internal defaults are used for these settings.

+##### RocksDB State Store Changelog Checkpointing


Can we provide higher-level of description how this works? We even don't explain what does changelog means.

I understand we have no explanation for changelog checkpointing for HDFS backed state store provider which is unfortunate, but for RocksDB state store provider, users have to make a decision whether to use old one (incremental checkpointing) or new one, which requires them to understand the characteristics of two options before choosing one.

Drafted a newer version with chatgpt, PTAL.

Looks OK :)

docs/structured-streaming-programming-guide.md

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala

HeartSaVioR · 2023-05-25T11:43:22Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

@@ -865,10 +1076,14 @@ class RocksDBSuite extends SparkFunSuite {
    }
  }

+  private def sqlConf = SQLConf.get


I don't feel like this is a good idea in terms of isolation. The reason we do withSQLConf is simple - we don't want to leave the modified config to be applied for remaining tests. Unlike the intention of using def, this will give the same config instance across multiple calls, because SparkSession is shared among test in this suite, and any modification will affect all remaining tests.

I feel it is fine in rocksdb unit test because we modify a small number of configs here. I do it so that I can propagate sql conf of changelog checkpointing conveniently.

I'd rather say, clone the config and use cloned one rather than modifying the config in session directly. If it does break any tests I'd consider it as not properly isolated.

HeartSaVioR · 2023-05-25T11:43:48Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

@@ -865,10 +1076,14 @@ class RocksDBSuite extends SparkFunSuite {
    }
  }

+  private def sqlConf = SQLConf.get
+
+  private def dbConf = RocksDBConf(StateStoreConf(sqlConf))


Same here, doesn't seem to be good in terms of isolation.

This should be fine because it creates a new RocksDBConf every time, changed the above code to clone the sqlConf

HeartSaVioR · 2023-05-25T11:48:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreSuite.scala

@@ -279,7 +280,7 @@ class StateStoreSuite extends StateStoreSuiteBase[HDFSBackedStateStoreProvider]
    val timeoutDuration = 1.minute

    quietly {
-      withSpark(new SparkContext(conf)) { sc =>
+      withSpark(SparkContext.getOrCreate(conf)) { sc =>


I feel like you had to do this because withSQLConf may require to have SparkSession. But I suspect this test won't use SQLConf being initialized in shared SparkSession, which makes withSQLConf be no-op.

Have you checked that all tests are running intentionally, testing both incremental checkpointing and changelog checkpointing?

chaoqin-li1123 · 2023-05-26T06:31:29Z

I can verify that all the streaming query integration tests we override inherits StateStoreMetricsTest, which use a shared spark session. And they also use SQL conf to override state format version, so they should also respect the changelog checkpointing config we set.
RocksDBStateStoreIntegrationSuite extends StreamTest, which have a shared spark session and use sql conf.
I do find that RocksDBStateStore suite ignore the sqlconf, have fixed that.

HeartSaVioR

Mostly minors. Thanks for the patience!

HeartSaVioR · 2023-05-31T02:57:47Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

+    SQLConf.get.getConfString(rocksdbChangelogCheckpointingConfKey) == "true"
+}
+
+class RocksDBSuite extends AlsoTestWithChangelogCheckpointingEnabled with SharedSparkSession {


Shall we extract out tests which do not require executing two times? Either separate suite, or some tag/annotation on the test. I'll comment as "check" which I figure out, but would be nice if you can revisit tests and do more.

Add testWithChangelogCheckpointingEnabled and testWithChangelogCheckpointingDisabled for tests that don't need to be duplicated.

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

HeartSaVioR · 2023-05-31T03:01:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

+    }
+  }
+
+  test("RocksDBFileManager: read and write changelog") {


HeartSaVioR · 2023-05-31T03:04:19Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

@@ -588,28 +822,6 @@ class RocksDBSuite extends SparkFunSuite {
        verifyMetrics(putCount = 2, getCount = 3, metrics = db.metrics)
      }
    }
-
-    // force compaction and check the compaction metrics


Can we retain this part of test and be selectively executed for incremental checkpoint? Please try not to remove the test where the functionality is not removed.

(The config is even applied to the snapshot of changelog checkpointing, although we do not do that in every microbatch hence not easy to test that. OK to skip test for changelog checkpointing.)

Added back and test with changelog checkpointing disabled.

HeartSaVioR · 2023-05-31T03:09:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

@@ -865,10 +1076,14 @@ class RocksDBSuite extends SparkFunSuite {
    }
  }

+  private def sqlConf = SQLConf.get


I'd rather say, clone the config and use cloned one rather than modifying the config in session directly. If it does break any tests I'd consider it as not properly isolated.

HeartSaVioR · 2023-05-31T03:12:43Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala

+  extends StreamingAggregationSuite with RocksDBStateStoreTest {
+  import testImplicits._
+
+  def snapshotVersionsPresent(dir: File): Seq[Long] = {


Maybe better to have an helper object for RocksDB test to deduplicate this and below methods.

Moved to a TestUtil trait.

HeartSaVioR · 2023-05-31T03:14:21Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala

+      .sorted
+  }
+
+  test("Streaming aggregation RocksDB State Store backward compatibility.") {


This can move to RocksDB side suite. Let's not put provider-specific test to the operator test suite. RocksDBStateStoreIntegrationSuite?

Moved to RocksDBStateStoreIntegrationSuite.

HeartSaVioR

+1 pending CI. Thanks for the work!

HeartSaVioR · 2023-06-01T02:35:28Z

@chaoqin-li1123
Could you please look at how long the module [Run / Build modules: sql - other tests](https://github.com/chaoqin-li1123/spark/actions/runs/5135977794/jobs/9242562743#logs) takes in latest master branch?

I think tests we are adding here should not increase the build time like an hour. If it has been already exceeding 5.5 hours or around then we will probably have to look at how to deal with (one probably easy way is moving test suites to the module for sql - slow tests), but if it doesn't take like so, I would doubt the tests we touch are functioning really well.

HeartSaVioR · 2023-06-01T05:33:33Z

@chaoqin-li1123
I see you've made another attempt and it passed. Thanks for doing that!
I'll merge to unblock, but let's file a JIRA ticket and investigate to ensure that this change is NOT an offending commit.

HeartSaVioR · 2023-06-01T05:33:53Z

Thanks! Merging to master.

chaoqin-li1123 · 2023-06-01T05:39:49Z

I suspect 5.5 hours is an accident in the test runner side(it has never happened before), but sure we can do some investigation.

HeartSaVioR · 2023-06-01T05:47:21Z

Yeah... it would be awesome if we can do some manual comparison between attempt #1 vs #2 via looking into log file and track the elapsed time for test suite. We actually saw the timeout from previous commits in this PR as well (what we did for fixing that btw?), so we take the responsibility for looking into it, at least giving a try.

chaoqin-li1123 · 2023-06-02T06:26:11Z

I don't know why there was timeout in the previous commit. But it was gone after I removed the new streaming join suite that uses rocksdb state store.
I will try to see what happen to the 5 hours ci, I suspect that is because of unavailability in github testing infra. @HeartSaVioR

… State Store Provider ### What changes were proposed in this pull request? In order to reduce the checkpoint duration and end to end latency, we propose Changelog Based Checkpointing for RocksDB State Store Provider. Below is the mechanism. 1. Changelog checkpoint: Upon each put() delete() call to local rocksdb instance, log the operation to a changelog file. During the state change commit, sync the compressed change log of the current batch to DFS as checkpointDir/{version}.delta. 2. Version reconstruction: For version j, find latest snapshot i.zip such that i <= j, load snapshot i, and replay i+1.delta ~ j.delta. This is used in loading the initial state as well as creating the latest version snapshot. Note: If a query is shutdown without exception, there won’t be changelog replay during query restart because a maintenance task is executed before the state store instance is unloaded. 3. Background snapshot: A maintenance thread in executors will launch maintenance tasks periodically. Inside the maintenance task, sync the latest RocksDB local snapshot to DFS as checkpointDir/{version}.zip. Snapshot enables faster failure recovery and allows old versions to be purged. 4. Garbage collection: Inside the maintenance task, delete snapshot and delta files from DFS for versions that is out of retention range(default retained version number is 100) ### Why are the changes needed? We have identified state checkpointing latency as one of the major performance bottlenecks for stateful streaming queries. Currently, RocksDB state store pauses the RocksDB instances to upload a snapshot to the cloud when committing a batch, which is heavy weight and has unpredictable performance. With changelog based checkpointing, we allow the RocksDB instance to run uninterruptibly, which improves RocksDB operation performance. This also dramatically reduces the commit time and batch duration because we are uploading a smaller amount of data during state commit. With this change, stateful query with RocksDB state store will have lower and more predictable latency. ### How was this patch tested? Add unit test for changelog checkpointing utility. Add unit test and integration test that check backward compatibility with existing checkpoint. Enable RocksDB state store unit test and stateful streaming query integration test to run with changelog checkpointing enabled. Closes apache#41099 from chaoqin-li1123/changelog. Authored-by: Chaoqin Li <chaoqin.li@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

chaoqin-li1123 added 7 commits May 4, 2023 10:10

initial implementation

da54bb5

add conf

5bda37f

remove unused import

8bc8552

test correctness with changelog checkpointing

be7846b

add unit test and fix bug

0512806

clean up

974a47e

address comment

2d81dc2

github-actions bot added SQL STRUCTURED STREAMING labels May 9, 2023

chaoqin-li1123 changed the title ~~[SS] Implement Changelog based Checkpointing for RocksDB State Store Provider~~ [SPARK-43421][SS] Implement Changelog based Checkpointing for RocksDB State Store Provider May 9, 2023

fix build

9df1340

respect minDeltasForSnapshot in changelog checkpointing

1b0f94c

siying reviewed May 9, 2023

View reviewed changes

fix checkpoint interval bug and address comment

bf30cf1

anishshri-db reviewed May 10, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala Show resolved Hide resolved

address comments

19b8355

anishshri-db reviewed May 10, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala Outdated Show resolved Hide resolved

anishshri-db reviewed May 10, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala Outdated Show resolved Hide resolved

anishshri-db reviewed May 10, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala Show resolved Hide resolved

anishshri-db reviewed May 10, 2023

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala Show resolved Hide resolved

chaoqin-li1123 added 5 commits May 11, 2023 23:34

clean up conf for changelog checkpointing

4ff442f

enable streaming aggregation suite to run with rocksdb

b3cc436

Merge branch 'master' of github.com:chaoqin-li1123/spark into changelog

0ef2fc9

clean up

e59d43f

add doc

1e46adc

chaoqin-li1123 added 6 commits May 22, 2023 23:04

comment out tests

573e0e9

comment out tests

fc8c1bd

move tests around to pass ci

6480621

move tests around to pass ci

590f21c

move tests around to pass ci

bb58556

move tests around to pass ci

14d7b91

HeartSaVioR reviewed May 25, 2023

View reviewed changes

chaoqin-li1123 added 5 commits May 25, 2023 09:51

fix nits

99f5e0a

improve doc

b1d3809

fix test nits

050f214

use NextIterator

f723840

make rocksdb state store suite use sqlconf in shared spark session

da7aa99

HeartSaVioR reviewed May 31, 2023

View reviewed changes

chaoqin-li1123 added 3 commits May 30, 2023 22:27

address testing comments

4f9b0a7

Merge branch 'master' of github.com:chaoqin-li1123/spark into changelog

7d52ed5

add after each

91d0075

HeartSaVioR approved these changes May 31, 2023

View reviewed changes

chaoqin-li1123 added 2 commits May 31, 2023 10:18

fix test failure

5732fbd

Merge branch 'master' of github.com:chaoqin-li1123/spark into changelog

6cb6d0b

HeartSaVioR closed this in c02c8be Jun 1, 2023

[SPARK-43421][SS] Implement Changelog based Checkpointing for RocksDB State Store Provider #41099

[SPARK-43421][SS] Implement Changelog based Checkpointing for RocksDB State Store Provider #41099

Conversation

chaoqin-li1123 commented May 9, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

chaoqin-li1123 commented May 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anishshri-db commented May 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaoqin-li1123 May 31, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaoqin-li1123 commented May 26, 2023

HeartSaVioR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Jun 1, 2023

HeartSaVioR commented Jun 1, 2023

HeartSaVioR commented Jun 1, 2023

chaoqin-li1123 commented Jun 1, 2023

HeartSaVioR commented Jun 1, 2023

chaoqin-li1123 commented Jun 2, 2023

chaoqin-li1123 commented May 9, 2023 •

edited

chaoqin-li1123 May 31, 2023 •

edited