[SPARK-53751][SDP] Explicit Versioned Checkpoint Location #52487

JiaqiWang18 · 2025-09-30T07:24:18Z

What changes were proposed in this pull request?

Add a storage field in pipeline spec to allow users specify locations of metadata such as streaming checkpoints.
Below is the structure of the directory, which offer supports for multi-flow and versioned directory where version number is incremented after a full refresh.

storage-root/
└── _checkpoints/ # checkpoint root
      ├── myst/
      │    ├── flow1/
      │    │    ├── 0/ # version 0
      │    │    │    ├── commits/
      │    │    │    ├── offsets/
      │    │    │    └── sources/
      │    │    └── 1/ # version 1
      │    │
      │    └── flow2/
      │         ├── 0/
      │         │    ├── commits/
      │         │    ├── offsets/
      │         │    └── sources/
      │         └── 1/
      │
      └── mysink/
            └── flowA/
                 ├── 0/
                 │    ├── commits/
                 │    ├── offsets/
                 │    └── sources/
                 └── 1/

Why are the changes needed?

Currently, SDP stores streaming flow ckpts in the table path. This does not allow support for versioned checkpoints and does not work for sinks.

Does this PR introduce any user-facing change?

How was this patch tested?

New and existing tests

Was this patch authored or co-authored using generative AI tooling?

No

JiaqiWang18 · 2025-09-30T07:29:20Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/PipelineExecution.scala

    val resolvedGraph = resolveGraph()
+    if (context.fullRefreshTables.nonEmpty) {
+      State.reset(resolvedGraph, context)
+    }


with explicit storage location for checkpoint, we shouldn't need to create the tables and obtain its path beforehand. resolvedGraph should suffice.

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala

JiaqiWang18 · 2025-10-06T23:58:49Z

sql/connect/server/src/test/scala/org/apache/spark/sql/connect/SparkConnectServerTest.scala

  }

-  override def afterEach(): Unit = {
+  protected override def afterEach(): Unit = {


keep consistency with signature of BeforeAndAfterEach and fix an inheritance error when introducing StorageRootMixin

JiaqiWang18 · 2025-10-07T00:01:37Z

sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/utils/StorageRootMixin.scala

+ * The path to the temporary directory is available via the `storageRoot` variable.
+ */
+trait StorageRootMixin extends BeforeAndAfterEach { self: Suite =>
+


extracting this out from PipelineTest so spark connect pipeline test can also extend this

JiaqiWang18 · 2025-10-07T05:45:53Z

@sryza diff is a bit large but tried to include only the necessary changes for checkpoints

sryza

Just one small comment about a comment. Otherwise, this looks great.

sryza · 2025-10-07T13:28:18Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/State.scala

+
+  /**
+   * Resets the checkpoint for the given flow by creating the next consecutive directory. Also
+   * clears out batch append state if it exists.


I don't think this clears out batch append state, right?

yeah stale comment, let me remove

…th `4.1.0-preview3` RC1 ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview3` RC1. ### Why are the changes needed? There are many changes between Apache Spark 4.1.0-preview2 and preview3. - apache/spark#52685 - apache/spark#52613 - apache/spark#52553 - apache/spark#52532 - apache/spark#52517 - apache/spark#52514 - apache/spark#52487 - apache/spark#52328 - apache/spark#52200 - apache/spark#52154 - apache/spark#51344 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview3`. ``` $ git clone -b v4.1.0-preview3 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * | awk -F: '{print $1}' | xargs rm ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. I manually tested with `Apache Spark 4.1.0-preview3` (with the two SDP ignored tests). ``` $ swift test --no-parallel ... ✔ Test run with 203 tests in 21 suites passed after 19.088 seconds. ``` ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #252 from dongjoon-hyun/SPARK-54043. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added SQL CONNECT labels Sep 30, 2025

JiaqiWang18 commented Sep 30, 2025

View reviewed changes

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala Outdated Show resolved Hide resolved

github-actions bot added the PYTHON label Oct 6, 2025

JiaqiWang18 changed the title ~~[Prototype][SPARK-53751][SDP] Explicit Checkpoint Location~~ [SPARK-53751][SDP] Explicit Versioned Checkpoint Location Oct 6, 2025

JiaqiWang18 force-pushed the SPARK-53751-explicit-versioned-multiflow-checkpoint branch from f3533ff to 60cfada Compare October 6, 2025 23:44

JiaqiWang18 marked this pull request as ready for review October 6, 2025 23:56

JiaqiWang18 commented Oct 6, 2025

View reviewed changes

JiaqiWang18 commented Oct 7, 2025

View reviewed changes

JiaqiWang18 force-pushed the SPARK-53751-explicit-versioned-multiflow-checkpoint branch 4 times, most recently from 7ecc4dd to 22a4c49 Compare October 7, 2025 03:32

sryza approved these changes Oct 7, 2025

View reviewed changes

jackywang-db added 14 commits October 8, 2025 10:06

wip

aa2ff17

non-option

481a1ac

more tests

f24a016

fix double truncate for full refresh mv

1716d98

fix persisted view

140b32f

fix tests and prototype with spark conf

966e9d2

nit

6951ea6

fmt

05dd6a7

finish impl and test

1ccfd06

defer table reset

1d1f458

fix test

50d207d

init cli and py test

8b97fd0

add logging

81a46cd

fmt scala

e443a38

jackywang-db added 3 commits October 8, 2025 10:06

nit

c0f7eb7

remove stale comment

4d600c6

fix proto conflict

dcad414

JiaqiWang18 force-pushed the SPARK-53751-explicit-versioned-multiflow-checkpoint branch from 9413fd8 to dcad414 Compare October 8, 2025 17:06

sryza closed this in 9a1c742 Oct 8, 2025

dongjoon-hyun mentioned this pull request Oct 27, 2025

[SPARK-54043] Update Spark Connect-generated Swift source code with 4.1.0-preview3 RC1 apache/spark-connect-swift#252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[SPARK-53751][SDP] Explicit Versioned Checkpoint Location #52487

[SPARK-53751][SDP] Explicit Versioned Checkpoint Location #52487

Uh oh!

JiaqiWang18 commented Sep 30, 2025 •

edited

Loading

Uh oh!

JiaqiWang18 Sep 30, 2025

Uh oh!

Uh oh!

JiaqiWang18 Oct 6, 2025

Uh oh!

JiaqiWang18 Oct 7, 2025

Uh oh!

JiaqiWang18 commented Oct 7, 2025

Uh oh!

sryza left a comment

Uh oh!

sryza Oct 7, 2025

Uh oh!

JiaqiWang18 Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[SPARK-53751][SDP] Explicit Versioned Checkpoint Location #52487

[SPARK-53751][SDP] Explicit Versioned Checkpoint Location #52487

Uh oh!

Conversation

JiaqiWang18 commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

JiaqiWang18 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JiaqiWang18 Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 commented Oct 7, 2025

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

sryza Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JiaqiWang18 commented Sep 30, 2025 •

edited

Loading