-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-53751][SDP] Explicit Versioned Checkpoint Location #52487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-53751][SDP] Explicit Versioned Checkpoint Location #52487
Conversation
| val resolvedGraph = resolveGraph() | ||
| if (context.fullRefreshTables.nonEmpty) { | ||
| State.reset(resolvedGraph, context) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with explicit storage location for checkpoint, we shouldn't need to create the tables and obtain its path beforehand. resolvedGraph should suffice.
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala
Outdated
Show resolved
Hide resolved
f3533ff to
60cfada
Compare
| } | ||
|
|
||
| override def afterEach(): Unit = { | ||
| protected override def afterEach(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep consistency with signature of BeforeAndAfterEach and fix an inheritance error when introducing StorageRootMixin
| * The path to the temporary directory is available via the `storageRoot` variable. | ||
| */ | ||
| trait StorageRootMixin extends BeforeAndAfterEach { self: Suite => | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extracting this out from PipelineTest so spark connect pipeline test can also extend this
7ecc4dd to
22a4c49
Compare
|
@sryza diff is a bit large but tried to include only the necessary changes for checkpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small comment about a comment. Otherwise, this looks great.
|
|
||
| /** | ||
| * Resets the checkpoint for the given flow by creating the next consecutive directory. Also | ||
| * clears out batch append state if it exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this clears out batch append state, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah stale comment, let me remove
9413fd8 to
dcad414
Compare
…th `4.1.0-preview3` RC1 ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview3` RC1. ### Why are the changes needed? There are many changes between Apache Spark 4.1.0-preview2 and preview3. - apache/spark#52685 - apache/spark#52613 - apache/spark#52553 - apache/spark#52532 - apache/spark#52517 - apache/spark#52514 - apache/spark#52487 - apache/spark#52328 - apache/spark#52200 - apache/spark#52154 - apache/spark#51344 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview3`. ``` $ git clone -b v4.1.0-preview3 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * | awk -F: '{print $1}' | xargs rm ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. I manually tested with `Apache Spark 4.1.0-preview3` (with the two SDP ignored tests). ``` $ swift test --no-parallel ... ✔ Test run with 203 tests in 21 suites passed after 19.088 seconds. ``` ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #252 from dongjoon-hyun/SPARK-54043. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Add a
storagefield in pipeline spec to allow users specify locations of metadata such as streaming checkpoints.Below is the structure of the directory, which offer supports for multi-flow and versioned directory where version number is incremented after a full refresh.
Why are the changes needed?
Currently, SDP stores streaming flow ckpts in the table path. This does not allow support for versioned checkpoints and does not work for sinks.
Does this PR introduce any user-facing change?
How was this patch tested?
New and existing tests
Was this patch authored or co-authored using generative AI tooling?
No