Skip to content

[FLINK-4717] Add CancelJobWithSavepoint #2609

Closed
uce wants to merge 2 commits intoapache:masterfrom
uce:4717-cancel_with_savepoint
Closed

[FLINK-4717] Add CancelJobWithSavepoint #2609
uce wants to merge 2 commits intoapache:masterfrom
uce:4717-cancel_with_savepoint

Conversation

@uce
Copy link
Contributor

@uce uce commented Oct 7, 2016

This builds on top of #2608.

  • Adds CancelJobWithSavepoint message, which triggers a savepoint before cancelling the respective job. The periodic checkpoint scheduler is stopped before doing this so that no checkpoints happen between the savepoint and cancellation. Note that users could manually trigger another savepoint while cancel-with-savepoint happens, but I think this is tolerable.
  • Adds -s [targetDirectory] option to CLI cancel command:
    • Regular cancelling:bin/flink cancel <jobID>
    • Cancel with savepoint to default dir: bin/flink cancel -s <jobID>
    • Cancel with savepoint to target dir: bin/flink cancel -s [targetDir] <jobID>

@uce uce force-pushed the 4717-cancel_with_savepoint branch 2 times, most recently from 673dba2 to 25ffc04 Compare October 10, 2016 16:54
@uce
Copy link
Contributor Author

uce commented Oct 11, 2016

To help review, the main change is in JobManager and CliFrontend. Some questions there may be:

  • Should we keep the behaviour that the job is not cancelled if the savepoint fails? My reasoning for this was that the user might otherwise accidentally loose state when cancelling the job.
  • Does the way that the cancellation is specified on the CLI make sense? Should we keep the -s option or add another proper command like bin/flink cancel-with-savepoint?

Furthermore, only the 2nd commit is relevant in this PR as the 1st one is from #2608.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look again very good @uce :-)

I had a comment concerning stopping the checkpoint scheduler which should be fixed imo.

A minor comment concerned the order of cancel -s arguments. But I understand the reason why it is how it is. I prefer to not have a dedicated command for the cancel with savepoint. Thus, I like the approach you've chosen.

Concerning not executing the cancel operation if the savepoint fails, I think that this is the right way to go (with the same arguments as you've presented).

Apart from that +1 for merging.


- Cancel a job with a savepoint:

./bin/flink cancel -s [targetDirectory] <jobID>
Copy link
Contributor

@tillrohrmann tillrohrmann Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little bit confusing that the savepoint command has a different argument order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also change the savepoint trigger command to savepoint -t [targetDir] <jobID>.

Or we could allow cancel <jobID> [targetDirectory] in addition to the -s option variant, but that might be slightly confusing, too. We need the -s option without an argument for cancel with savepoint to the default target.

What do you think?

case Some((executionGraph, _)) =>
// We don't want any checkpoint between the savepoint and cancellation
val coord = executionGraph.getCheckpointCoordinator
coord.stopCheckpointScheduler()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not enough to simply call stopCheckpointScheduler. If I'm not mistaken, then the following could happen: You call stopCheckpointScheduler which will try to cancel the last currentPeriodicTrigger. Now assume that the last TimerTask to trigger the next checkpoint has just been triggered but not executed (just before cancelling it). Now the stopCheckpointScheduler finishes without the TimerTask having completed. Now the TimerTask can still trigger a checkpoint even though we've stopped the checkpoint scheduler.

The way to fix this (admittedly academic corner case), is to filter out outdated TimerTask calls in the CheckpointCoordinator by having a kind of fencing tokens for the trigger checkpoint calls.

@uce
Copy link
Contributor Author

uce commented Oct 12, 2016

Thanks for this review, too. Good catch with the broken stopCheckpointScheduler. I will add your proposed fix as a separate commit.

@uce
Copy link
Contributor Author

uce commented Oct 12, 2016

Addressed the issue and found another bug in #2608 that I've fixed in f769e8e. If Travis gives the green light, I will rebase on #2608 and merge this later today.

@uce
Copy link
Contributor Author

uce commented Oct 13, 2016

Local Travis build passed (https://travis-ci.org/uce/flink/builds/167108720). I'm going to rebase and merge this later.

@uce uce force-pushed the 4717-cancel_with_savepoint branch from 3e175f2 to ecd53ec Compare October 13, 2016 13:44
uce added 2 commits October 13, 2016 17:07
[FLINK-4509] [FLIP-10] Specify savepoint directory per savepoint
[FLINK-4507] [FLIP-10] Deprecate savepoint backend config
- Adds CancelJobWithSavepoint message, which triggers a savepoint
  before cancelling the respective job.
- Adds -s [targetDirectory] option to CLI cancel command:
    * bin/flink cancel <jobID> (regular cancelling)
    * bin/flink cancel -s <jobID> (cancel with savepoint to default dir)
    * bin/flink cancek -s <targetDir> <jobID> (cancel with savepoint to targetDir)
@uce uce force-pushed the 4717-cancel_with_savepoint branch from ecd53ec to 4357b25 Compare October 13, 2016 15:07
@asfgit asfgit closed this in 5783671 Oct 14, 2016
@uce uce deleted the 4717-cancel_with_savepoint branch November 2, 2016 09:15
liuyuzhong pushed a commit to liuyuzhong/flink that referenced this pull request Dec 5, 2016
- Adds CancelJobWithSavepoint message, which triggers a savepoint
  before cancelling the respective job.
- Adds -s [targetDirectory] option to CLI cancel command:
    * bin/flink cancel <jobID> (regular cancelling)
    * bin/flink cancel -s <jobID> (cancel with savepoint to default dir)
    * bin/flink cancel -s <targetDir> <jobID> (cancel with savepoint to targetDir)

This closes apache#2609.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants