[SPARK-54590][SS] Support Checkpoint V2 for State Rewriter and Repartitioning #53720

zifeif2 · 2026-01-08T00:29:25Z

What changes were proposed in this pull request?

Support checkpointV2 for repartition writer and StateRewriter by returning the checkpoint Id to caller function after write is done.
Changes include

RocksDB loadWithCheckpointId supports loadEmpty
StatePartitionAllColumnFamiliesWriter return StateStoreCheckpointInfo
StateRewriter also propagate StateStoreCheckpointInfo back to the RepartitionRunner
RepartitionRunner stores the checkpointIds in commitLog

Why are the changes needed?

This is required in PrPr for repartition project

Does this PR introduce any user-facing change?

No

How was this patch tested?

See added unit tests on moth operator with single state store and multiple state stores

Was this patch authored or co-authored using generative AI tooling?

Yes. Sonnet 4.5

github-actions · 2026-01-08T00:29:34Z

JIRA Issue Information

=== New Feature SPARK-54590 ===
Summary: State Writer supports checkpoint V2
Assignee: None
Status: Open
Affected: ["4.1.0"]

This comment was automatically generated by GitHub Actions

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

...ain/scala/org/apache/spark/sql/execution/streaming/state/OfflineStateRepartitionRunner.scala

.../apache/spark/sql/execution/streaming/state/StatePartitionAllColumnFamiliesWriterSuite.scala

...ain/scala/org/apache/spark/sql/execution/streaming/state/OfflineStateRepartitionRunner.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateRewriter.scala

...test/scala/org/apache/spark/sql/execution/streaming/state/OfflineStateRepartitionSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

.../apache/spark/sql/execution/streaming/state/StatePartitionAllColumnFamiliesWriterSuite.scala

...test/scala/org/apache/spark/sql/execution/streaming/state/OfflineStateRepartitionSuite.scala

...ain/scala/org/apache/spark/sql/execution/streaming/state/OfflineStateRepartitionRunner.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateRewriter.scala

micheal-o

I am stamping this PR so we can move forward, but please lets correctly address the review comments. Thanks

common/utils/src/main/resources/error/error-conditions.json

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateRewriter.scala

.../apache/spark/sql/execution/streaming/state/StatePartitionAllColumnFamiliesWriterSuite.scala

micheal-o · 2026-01-15T04:24:59Z

@zifeif2 Also fix the PR title to: Support Checkpoint V2 for State Rewriter and Repartitioning

Fix logging format in StateRewriter.scala

common/utils/src/main/resources/error/error-conditions.json

anishshri-db · 2026-01-15T21:28:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

    try {
-      if (loadedVersion != version || (loadedStateStoreCkptId.isEmpty ||
-          stateStoreCkptId.get != loadedStateStoreCkptId.get)) {
+      if (loadEmpty || loadedVersion != version || loadedStateStoreCkptId.isEmpty ||


Why do we need this ?

The reason why we need to support loadEmpty in loadWithCheckpointId in RocksDB is in repartition, we don't need to read previous data, that's why we need to add loadEmpty in RocksDB

I put loadEmpty in this if statement along with loadedVersion != version || loadedStateStoreCkptId.isEmpty ||... to reduce some duplicate code, but looks like it makes it harder to understand. I can refactor the code to make loadEmpty its separate block

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

anishshri-db · 2026-01-15T21:29:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

-        log"with uniqueId ${MDC(LogKeys.UUID, stateStoreCkptId)}")
+      if (loadEmpty) {
+        logInfo(log"Loaded empty store at version ${MDC(LogKeys.VERSION_NUM, version)} " +
+          log"with uniqueId")


Is unique Id not available here ?

Nope, we don't expect caller function to pass in uniqueId when they are calling loadWithCheckpointId when loadEmpty = true, because we are not load any previous versions of data when loadEmpty=true. We also have a require check above.

require(stateStoreCkptId.isEmpty, "stateStoreCkptId should be empty when loadEmpty is true")

I can change it to a less confusing message

anishshri-db · 2026-01-15T21:30:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateRewriter.scala

-      partitionWriter.write(partitionIter)
-    }
+      Iterator(partitionWriter.write(partitionIter))
+    }.collect()


Why are we calling collect here ?

I thought we need to add collect() to make the rewrite actually happen and get a list of StateStoreCheckpointInfo

zifeif2 · 2026-01-15T22:38:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

      // Since we cleared the local dir, we should also clear the local file mapping
      rocksDBFileMapping.clear()
+      // Set empty metrics since we're not loading any files from DFS
+      loadCheckpointMetrics = RocksDBFileManagerMetrics.EMPTY_METRICS


Added this line to make sure that loadCheckpointMetrics is set correctly when we are loading empty store cc @anishshri-db .
RocksDB will run fileManagerMetrics = fileManager.latestLoadCheckpointMetrics, and latestLoadCheckpointMetrics return loadCheckpointMetrics

anishshri-db

lgtm pending green CI

github-actions bot added SQL STRUCTURED STREAMING labels Jan 8, 2026

zifeif2 force-pushed the repartition-cp-v2 branch from 80c03a2 to 1199413 Compare January 8, 2026 01:15

zifeif2 commented Jan 8, 2026

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala Outdated Show resolved Hide resolved

zifeif2 marked this pull request as ready for review January 8, 2026 01:17

zifeif2 force-pushed the repartition-cp-v2 branch from 1199413 to 4879f3a Compare January 9, 2026 23:43

zifeif2 added 2 commits January 9, 2026 23:46

initial commit for ckptV2

5f0411e

reduce duplicate code

33b9fc1

zifeif2 force-pushed the repartition-cp-v2 branch from 4879f3a to 672daff Compare January 9, 2026 23:46

rebase on offlineStateRepartitionRunner

327039f

zifeif2 force-pushed the repartition-cp-v2 branch from 672daff to 327039f Compare January 10, 2026 00:00

fix compile error

96c8198

micheal-o reviewed Jan 11, 2026

View reviewed changes

address comment

72efef7

zifeif2 force-pushed the repartition-cp-v2 branch from 546e49c to 72efef7 Compare January 13, 2026 01:38

Update StateRewriter.scala

213fa3e

micheal-o reviewed Jan 13, 2026

View reviewed changes

zifeif2 commented Jan 13, 2026

View reviewed changes

address comments and set correct checkpoitn

f2ac9e5

zifeif2 force-pushed the repartition-cp-v2 branch from 3253637 to f2ac9e5 Compare January 14, 2026 05:57

zifeif2 commented Jan 14, 2026

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateRewriter.scala Outdated Show resolved Hide resolved

address more comments

9e6b29a

zifeif2 force-pushed the repartition-cp-v2 branch from 545968e to 9e6b29a Compare January 14, 2026 22:40

micheal-o approved these changes Jan 15, 2026

View reviewed changes

zifeif2 changed the title ~~[SPARK-54590][SS] Support CheckpointV2 for StatePartitionAllColumnFamiliesWriter~~ [SPARK-54590][SS] Support Checkpoint V2 for State Rewriter and Repartitioning Jan 15, 2026

zifeif2 force-pushed the repartition-cp-v2 branch from 34eba2e to 8e940a1 Compare January 15, 2026 18:49

address comments

6568663

Fix logging format in StateRewriter.scala

zifeif2 force-pushed the repartition-cp-v2 branch from ce6a893 to 6568663 Compare January 15, 2026 20:00

anishshri-db reviewed Jan 15, 2026

View reviewed changes

common/utils/src/main/resources/error/error-conditions.json Outdated Show resolved Hide resolved

anishshri-db reviewed Jan 15, 2026

View reviewed changes

zifeif2 commented Jan 15, 2026

View reviewed changes

address comments

72878bb

zifeif2 force-pushed the repartition-cp-v2 branch from 9777f6b to 72878bb Compare January 15, 2026 22:39

anishshri-db approved these changes Jan 16, 2026

View reviewed changes

anishshri-db closed this in 813f757 Jan 16, 2026

[SPARK-54590][SS] Support Checkpoint V2 for State Rewriter and Repartitioning #53720

[SPARK-54590][SS] Support Checkpoint V2 for State Rewriter and Repartitioning #53720

Conversation

zifeif2 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Jan 8, 2026

JIRA Issue Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

micheal-o left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

micheal-o commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

anishshri-db Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

zifeif2 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anishshri-db Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

zifeif2 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

anishshri-db Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

zifeif2 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

zifeif2 Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anishshri-db left a comment

Choose a reason for hiding this comment

zifeif2 commented Jan 8, 2026 •

edited

Loading

micheal-o commented Jan 15, 2026 •

edited

Loading

zifeif2 Jan 15, 2026 •

edited

Loading