[FLINK-4717] Add CancelJobWithSavepoint by uce · Pull Request #2609 · apache/flink

uce · 2016-10-07T15:59:30Z

This builds on top of #2608.

Adds CancelJobWithSavepoint message, which triggers a savepoint before cancelling the respective job. The periodic checkpoint scheduler is stopped before doing this so that no checkpoints happen between the savepoint and cancellation. Note that users could manually trigger another savepoint while cancel-with-savepoint happens, but I think this is tolerable.
Adds -s [targetDirectory] option to CLI cancel command:
- Regular cancelling:bin/flink cancel <jobID>
- Cancel with savepoint to default dir: bin/flink cancel -s <jobID>
- Cancel with savepoint to target dir: bin/flink cancel -s [targetDir] <jobID>

uce · 2016-10-11T09:08:12Z

To help review, the main change is in JobManager and CliFrontend. Some questions there may be:

Should we keep the behaviour that the job is not cancelled if the savepoint fails? My reasoning for this was that the user might otherwise accidentally loose state when cancelling the job.
Does the way that the cancellation is specified on the CLI make sense? Should we keep the -s option or add another proper command like bin/flink cancel-with-savepoint?

Furthermore, only the 2nd commit is relevant in this PR as the 1st one is from #2608.

tillrohrmann

Changes look again very good @uce :-)

I had a comment concerning stopping the checkpoint scheduler which should be fixed imo.

A minor comment concerned the order of cancel -s arguments. But I understand the reason why it is how it is. I prefer to not have a dedicated command for the cancel with savepoint. Thus, I like the approach you've chosen.

Concerning not executing the cancel operation if the savepoint fails, I think that this is the right way to go (with the same arguments as you've presented).

Apart from that +1 for merging.

tillrohrmann · 2016-10-11T14:31:53Z

docs/setup/cli.md


+-   Cancel a job with a savepoint:
+
+        ./bin/flink cancel -s [targetDirectory] <jobID>


It is a little bit confusing that the savepoint command has a different argument order.

We could also change the savepoint trigger command to savepoint -t [targetDir] <jobID>.

Or we could allow cancel <jobID> [targetDirectory] in addition to the -s option variant, but that might be slightly confusing, too. We need the -s option without an argument for cancel with savepoint to the default target.

What do you think?

tillrohrmann · 2016-10-11T15:07:42Z

flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala

+          case Some((executionGraph, _)) =>
+            // We don't want any checkpoint between the savepoint and cancellation
+            val coord = executionGraph.getCheckpointCoordinator
+            coord.stopCheckpointScheduler()


I think it's not enough to simply call stopCheckpointScheduler. If I'm not mistaken, then the following could happen: You call stopCheckpointScheduler which will try to cancel the last currentPeriodicTrigger. Now assume that the last TimerTask to trigger the next checkpoint has just been triggered but not executed (just before cancelling it). Now the stopCheckpointScheduler finishes without the TimerTask having completed. Now the TimerTask can still trigger a checkpoint even though we've stopped the checkpoint scheduler.

The way to fix this (admittedly academic corner case), is to filter out outdated TimerTask calls in the CheckpointCoordinator by having a kind of fencing tokens for the trigger checkpoint calls.

uce · 2016-10-12T09:31:05Z

Thanks for this review, too. Good catch with the broken stopCheckpointScheduler. I will add your proposed fix as a separate commit.

uce · 2016-10-12T12:11:13Z

Addressed the issue and found another bug in #2608 that I've fixed in f769e8e. If Travis gives the green light, I will rebase on #2608 and merge this later today.

uce · 2016-10-13T09:34:30Z

Local Travis build passed (https://travis-ci.org/uce/flink/builds/167108720). I'm going to rebase and merge this later.

[FLINK-4509] [FLIP-10] Specify savepoint directory per savepoint [FLINK-4507] [FLIP-10] Deprecate savepoint backend config

- Adds CancelJobWithSavepoint message, which triggers a savepoint before cancelling the respective job. - Adds -s [targetDirectory] option to CLI cancel command: * bin/flink cancel <jobID> (regular cancelling) * bin/flink cancel -s <jobID> (cancel with savepoint to default dir) * bin/flink cancek -s <targetDir> <jobID> (cancel with savepoint to targetDir)

- Adds CancelJobWithSavepoint message, which triggers a savepoint before cancelling the respective job. - Adds -s [targetDirectory] option to CLI cancel command: * bin/flink cancel <jobID> (regular cancelling) * bin/flink cancel -s <jobID> (cancel with savepoint to default dir) * bin/flink cancel -s <targetDir> <jobID> (cancel with savepoint to targetDir) This closes apache#2609.

uce force-pushed the 4717-cancel_with_savepoint branch 2 times, most recently from 673dba2 to 25ffc04 Compare October 10, 2016 16:54

tillrohrmann requested changes Oct 11, 2016

View reviewed changes

uce mentioned this pull request Oct 12, 2016

[FLINK-4787] [runtime-web] Expose cancel-with-savepoint via REST API #2626

Closed

uce force-pushed the 4717-cancel_with_savepoint branch from 3e175f2 to ecd53ec Compare October 13, 2016 13:44

uce added 2 commits October 13, 2016 17:07

[FLINK-4512] [FLIP-10] Add option to persist periodic checkpoints

8339965

[FLINK-4509] [FLIP-10] Specify savepoint directory per savepoint [FLINK-4507] [FLIP-10] Deprecate savepoint backend config

uce force-pushed the 4717-cancel_with_savepoint branch from ecd53ec to 4357b25 Compare October 13, 2016 15:07

asfgit closed this in 5783671 Oct 14, 2016

uce deleted the 4717-cancel_with_savepoint branch November 2, 2016 09:15

rmetzger added the component=Runtime/StateBackends label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-4717] Add CancelJobWithSavepoint #2609

[FLINK-4717] Add CancelJobWithSavepoint #2609
uce wants to merge 2 commits intoapache:masterfrom
uce:4717-cancel_with_savepoint

uce commented Oct 7, 2016

Uh oh!

uce commented Oct 11, 2016 •

edited

Loading

Uh oh!

tillrohrmann left a comment

Uh oh!

tillrohrmann Oct 11, 2016 •

edited

Loading

Uh oh!

uce Oct 12, 2016

Uh oh!

tillrohrmann Oct 11, 2016

Uh oh!

uce commented Oct 12, 2016

Uh oh!

uce commented Oct 12, 2016

Uh oh!

uce commented Oct 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		- Cancel a job with a savepoint:

		./bin/flink cancel -s [targetDirectory] <jobID>

Conversation

uce commented Oct 7, 2016

Uh oh!

uce commented Oct 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

tillrohrmann Oct 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uce Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

tillrohrmann Oct 11, 2016

Choose a reason for hiding this comment

Uh oh!

uce commented Oct 12, 2016

Uh oh!

uce commented Oct 12, 2016

Uh oh!

uce commented Oct 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

uce commented Oct 11, 2016 •

edited

Loading

tillrohrmann Oct 11, 2016 •

edited

Loading