[FLINK-4717] Add CancelJobWithSavepoint #2609
Conversation
673dba2 to
25ffc04
Compare
|
To help review, the main change is in
Furthermore, only the 2nd commit is relevant in this PR as the 1st one is from #2608. |
tillrohrmann
left a comment
There was a problem hiding this comment.
Changes look again very good @uce :-)
I had a comment concerning stopping the checkpoint scheduler which should be fixed imo.
A minor comment concerned the order of cancel -s arguments. But I understand the reason why it is how it is. I prefer to not have a dedicated command for the cancel with savepoint. Thus, I like the approach you've chosen.
Concerning not executing the cancel operation if the savepoint fails, I think that this is the right way to go (with the same arguments as you've presented).
Apart from that +1 for merging.
|
|
||
| - Cancel a job with a savepoint: | ||
|
|
||
| ./bin/flink cancel -s [targetDirectory] <jobID> |
There was a problem hiding this comment.
It is a little bit confusing that the savepoint command has a different argument order.
There was a problem hiding this comment.
We could also change the savepoint trigger command to savepoint -t [targetDir] <jobID>.
Or we could allow cancel <jobID> [targetDirectory] in addition to the -s option variant, but that might be slightly confusing, too. We need the -s option without an argument for cancel with savepoint to the default target.
What do you think?
| case Some((executionGraph, _)) => | ||
| // We don't want any checkpoint between the savepoint and cancellation | ||
| val coord = executionGraph.getCheckpointCoordinator | ||
| coord.stopCheckpointScheduler() |
There was a problem hiding this comment.
I think it's not enough to simply call stopCheckpointScheduler. If I'm not mistaken, then the following could happen: You call stopCheckpointScheduler which will try to cancel the last currentPeriodicTrigger. Now assume that the last TimerTask to trigger the next checkpoint has just been triggered but not executed (just before cancelling it). Now the stopCheckpointScheduler finishes without the TimerTask having completed. Now the TimerTask can still trigger a checkpoint even though we've stopped the checkpoint scheduler.
The way to fix this (admittedly academic corner case), is to filter out outdated TimerTask calls in the CheckpointCoordinator by having a kind of fencing tokens for the trigger checkpoint calls.
|
Thanks for this review, too. Good catch with the broken |
|
Local Travis build passed (https://travis-ci.org/uce/flink/builds/167108720). I'm going to rebase and merge this later. |
3e175f2 to
ecd53ec
Compare
[FLINK-4509] [FLIP-10] Specify savepoint directory per savepoint [FLINK-4507] [FLIP-10] Deprecate savepoint backend config
- Adds CancelJobWithSavepoint message, which triggers a savepoint
before cancelling the respective job.
- Adds -s [targetDirectory] option to CLI cancel command:
* bin/flink cancel <jobID> (regular cancelling)
* bin/flink cancel -s <jobID> (cancel with savepoint to default dir)
* bin/flink cancek -s <targetDir> <jobID> (cancel with savepoint to targetDir)
ecd53ec to
4357b25
Compare
- Adds CancelJobWithSavepoint message, which triggers a savepoint
before cancelling the respective job.
- Adds -s [targetDirectory] option to CLI cancel command:
* bin/flink cancel <jobID> (regular cancelling)
* bin/flink cancel -s <jobID> (cancel with savepoint to default dir)
* bin/flink cancel -s <targetDir> <jobID> (cancel with savepoint to targetDir)
This closes apache#2609.
This builds on top of #2608.
-s [targetDirectory]option to CLI cancel command:bin/flink cancel <jobID>bin/flink cancel -s <jobID>bin/flink cancel -s [targetDir] <jobID>