Skip to content

[SPARK-XXXXX][SS] Use latestCommittedBatchId as currentBatchId when resuming late batch#38716

Closed
viirya wants to merge 2 commits into
apache:masterfrom
viirya:fix_ss_resume_from_late_batch
Closed

[SPARK-XXXXX][SS] Use latestCommittedBatchId as currentBatchId when resuming late batch#38716
viirya wants to merge 2 commits into
apache:masterfrom
viirya:fix_ss_resume_from_late_batch

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented Nov 18, 2022

What changes were proposed in this pull request?

This patch changes currentBatchId when MicroBatchExecution tries to resume from late batch from offset log. Previously it takes latestBatchId from offset log. This patch changes it to latestCommittedBatchId.

Why are the changes needed?

We have customer streaming job which is unable to restart from failed status after it failed to commit delta files. For example, if previous run failed to commit 14.delta, when the job restarted, it tried to read 14.delta. Because 14.delta doesn't exist (not committed), the job cannot be restarted to resume from late batch.

When MicroBatchExecution populates start offsets, it reads late batch id (latestBatchId) from offset log and committed batch id (latestCommittedBatchId) from commit log. Currently if latestCommittedBatchId == latestBatchId - 1, it means that we resume from late batch. But it uses latestBatchId as currentBatchId to run batch. Obviously, latestBatchId is 14 for above example, and latestCommittedBatchId is 13.

Because IncrementalExecution uses currentBatchId to load checkpointed states, it tries to load version 14 of delta files. But version 14 is not committed in late run. So resume run always fails because it cannot load non existing delta file.

We should use latestCommittedBatchId as currentBatchId instead of latestBatchId for the case resuming last batch.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

@viirya
Copy link
Copy Markdown
Member Author

viirya commented Nov 18, 2022

retest this please

@viirya viirya force-pushed the fix_ss_resume_from_late_batch branch from 309ee28 to 3faf2f6 Compare November 19, 2022 00:32
@viirya viirya marked this pull request as draft November 19, 2022 02:57
@viirya viirya closed this Nov 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant