[FLINK-2354] Add job graph and checkpoint recovery #1153

uce · 2015-09-21T07:08:11Z

tl;dr

This PR introduces JobGraph and SuccessfulCheckpoint recovery for submitted programs in case of JobManager failures.

General Idea

The general idea is to persist job graphs and successful checkpoints in ZooKeeper.

We have introduced JobManager high availability via ZooKeeper in #1016. My PR builds on top of this and adds initial support for program recovery. We can recover both programs and successful checkpoints in case of a JobManager failure as soon as a standby job manager is granted leadership.

ZooKeeper's sweet spot is rather small data (in KB range), but job graph and checkpoint state can grow larger. Therefore we don't directly persist the actual metadata, but use the state backend as a layer of indirection. We create state handles for the job graph and completed checkpoints and persist those. The state handle acts as a pointer to the actual data.

At the moment, only the file system state backend is supported for this. The state handles need to be accessible from both task and job managers (e.g. a DFS).

Configuration

The minimal required configuration:

recovery.mode: ZOOKEEPER
ha.zookeeper.quorum: <ZooKeeper quroum peers>
state.backend: FILESYSTEM
state.backend.fs.dir.recovery: /path/to/recovery

I don't like the current configuration keys. Until the next release, I would like a more consistent naming, e.g. prefix everything with recovery.zookeeper.

ZooKeeper Nodes Overview

Overview of ZNodes and components managing them:

O- /flink
|
+----O /flink/jobgraphs (SubmittedJobGraphs)
|    |
|    +----O /flink/jobgraphs/<job-id>
|
+----O /flink/checkpoints  (CompletedCheckpoints)
|    |
|    +----O /flink/checkpoints/<job-id>
|    .    |
|    .    +----O /flink/checkpoints/<job-id>/1
|    .    |
|    .    +----O /flink/checkpoints/<job-id>/N
|
+----O /flink/checkpoint-counter (CheckpointIDCounter)
     |
     +----O /flink/checkpoints/<job-id>

Implementation

Submission vs. Recovery (JobManager and SubmittedJobGraphs)

ZooKeeperSubmittedJobGraphs manages SubmittedJobGraph state handles in ZooKeeper
Submission and recovery follow mostly the same code paths (see JobManager#submitJob()).
On (initial) submission:
- After writing to ZooKeeper the JM checks synchronously whether she is still leader.
- If not, the job is not scheduled for execution, but kept in ZooKeeper. Future leading JobManagers need to recover it. The client currently sees this as successful submission. The job is not removed in this case, because another job manager might recover between the write and remove. In such a case, a job would be running without being in ZooKeeper and without being acked to the client.
On recovery:
- Recovery is triggered on granted leadership via the configured execution delay between retries.
- All available jobs are scheduled for execution.
The ZNode for job graphs is monitored for modifications during operations. This way, a job manager can (eventually) detect if another job manager adds/removes a job and react to it.

CompletedCheckpoints

ZooKeeperCompletedCheckpoints manages SuccessfulCheckpoint state handles in ZooKeeper (per job). Note that a SuccessfulCheckpoint has pointers to further state handles in most cases. In this case, we add another layer of indirection.
Every completed checkpoint is added to ZooKeeper and identified by its checkpoint ID.
On recovery, the latest checkpoint is recovered. If more than one checkpoint is available, we still only recover one in order to make sure that the history of checkpoints is consistent (currently we retain only 1 checkpoint anyways, but if we ever chose to retain more) in corner cases, where multiple job managers run the same job with checkpointing for some time.

CheckpointIDCounter

ZooKeeperCheckpointIDCounter manages a shared counter in ZooKeeper (per job).
The Checkpointed interface requires ascending checkpoint IDs for each checkpoint.
We use a shared counter (per job) via a Curator recipe for this.

Akka messages

This PR introduces two new JobManager message types:
- RecoverAllJobs
- RecoverJob(JobID)
The ZooKeeper operations are blocking and all JobManager actor calls needs to make sure to not block the JobManager. I've tried to cover all cases, where a ZooKeeper operation is triggered.
For tests, I didn't manage to stop the JobManager actor w/o running the postStop method. Because this method has some cleanup logic (removing job graphs and checkpoints), all JobManager recovery tests run the JobManager as a separate JobManagerProcess. This is quite heavy weight. If someone knows a way to stop the actor w/o the postStop being called, it would be great to refactor this.

Next Steps

Behaviour on recovery via fixed delay is too simplistic.
Client is not fully integrated and submits jobs in detached mode if recovery mode is set to ZooKeeper.

Tests

~~There was a Travis/AWS outage yesterday and I couldn't run as many builds as we should yet. I would like to run a couple of runs before we merge this.~~

tillrohrmann · 2015-09-21T07:55:12Z

Massive PR @uce :-) Reviewing it will take me 1-2 days I guess.

Concerning the problem with the postStop method, you could start a JobManager where you simply overwrite the postStop method to do nothing.

uce · 2015-09-23T18:02:59Z

@tillrohrmann I've rebased this on the current master and fixed two issues (the two new commits).

Session management has been added recently to the master. I don't think it works with HA at this point. I haven't checked this yet and would postpone until session management is exposed to the user.

tillrohrmann · 2015-09-24T13:05:19Z

Good to hear. I'll try to review it then. Won't probably before Monday,
because tomorrow I'm on vacation.

On Wed, Sep 23, 2015 at 8:03 PM, Ufuk Celebi notifications@github.com
wrote:

@tillrohrmann https://github.com/tillrohrmann I've rebased this on the
current master and fixed two issues (the two new commits).

Session management has been added recently to the master. I don't think it
works with HA at this point. I haven't checked this yet and would postpone
until session management is exposed to the user.

—
Reply to this email directly or view it on GitHub
#1153 (comment).

StephanEwen · 2015-09-24T13:57:25Z

The implication of sessions would only be that Jobs are kept in the "currentJobs" map, even once they are finished. That should transparently work with HA - the jobs would not be removes from ZooKeeper until they are disposed.

uce · 2015-09-24T15:25:24Z

Currently, ZooKeeper only stores the latest added job graph (existing ones are overwritten), whereas in the JobManager the JobGraph is attached to the existing ExecutionGraph. In case of recovery, only the latest "attached" JobGraph will be recovered.

uce · 2015-09-30T15:03:14Z

I've rebased this o the current master and added a manual ChaosMonkeyTest. The test is friendly in the sense that it waits for task managers to reconnect to the job manager before continuing to stop a jobmanager/taskmanager.

uce · 2015-10-02T09:57:02Z

While working on FLINK-2804 I've noticed an issue: the user jars are only uploaded to the leading job manager at submission time and then are not available to the other job managers on recovery. A simple solution is to make the user jars available in the file state backend as well.

tillrohrmann · 2015-10-02T14:09:31Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java


 		// some sanity checks
 		if (job == null || tasksToTrigger == null ||
 				tasksToWaitFor == null || tasksToCommitTo == null) {
 			throw new NullPointerException();
 		}
-		if (numSuccessfulCheckpointsToRetain < 1) {
-			throw new IllegalArgumentException("Must retain at least one successful checkpoint");
-		}
 		if (checkpointTimeout < 1) {
 			throw new IllegalArgumentException("Checkpoint timeout must be larger than zero");
 		}

 		this.job = job;


Maybe we could harmonize the null checking as you've done it.

Yes, I've changed it in some places and not in others

uce · 2015-10-08T18:01:36Z

Till found another issue in one of is Travis runs, which has been addressed in e54a86c.

This is now rebased on the current master.

…he#1153

…ctoryTest

uce · 2015-10-09T09:23:38Z

Rebased on the current master and incorporated the job manager state modification fix. Thanks for that!

Can we merge this after Travis gives the green light?

Internal actor states must only be modified within the actor thread. This avoids all the well-known issues coming with concurrency. Fix RemoveCachedJob by introducing RemoveJob Fix JobManagerITCase

…ubmittedJobGraphStore

… the state backend

tillrohrmann · 2015-10-09T18:01:50Z

I made some more fixes for the shading of the curator dependency. Once Travis gives green light, I'll merge it.

This closes apache#1153.

uce force-pushed the recovery-2354 branch 6 times, most recently from 2d4fbc0 to ae19782 Compare September 23, 2015 18:01

uce force-pushed the recovery-2354 branch from ae19782 to a50899b Compare September 30, 2015 14:55

tillrohrmann reviewed Oct 2, 2015
View reviewed changes

uce force-pushed the recovery-2354 branch from 630296c to aad3799 Compare October 8, 2015 18:00

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 8, 2015

[FLINK-2805] Apply RecoveryMode and ConfigConstants changes from apac…

8a7da8d

…he#1153

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 8, 2015

[FLINK-2805] Apply RecoveryMode and ConfigConstants changes from apac…

ecc830c

…he#1153

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 8, 2015

[FLINK-2805] Apply RecoveryMode and ConfigConstants changes from apac…

d2ccf60

…he#1153

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 9, 2015

[FLINK-2805] Apply RecoveryMode and ConfigConstants changes from apac…

19bec1e

…he#1153

uce added 4 commits October 9, 2015 11:19

[clients, temporary] Submit job in detached mode if recovery enabled

dc9daef

[FLINK-2652] [tests] Temporary ignore flakey PartitionRequestClientFa…

beb3d7c

…ctoryTest

[FLINK-2792] [jobmanager, logging] Set actor message log level to TRACE

d26d133

[FLINK-2354] [runtime] Add job graph and checkpoint recovery

8017dbd

uce force-pushed the recovery-2354 branch from 2ebb2f0 to 525beaf Compare October 9, 2015 09:21

uce force-pushed the recovery-2354 branch from 525beaf to 3558db1 Compare October 9, 2015 10:45

tillrohrmann added 3 commits October 9, 2015 12:47

[FLINK-2354] [runtime] Remove state changing futures in JobManager

d85e37b

Internal actor states must only be modified within the actor thread. This avoids all the well-known issues coming with concurrency. Fix RemoveCachedJob by introducing RemoveJob Fix JobManagerITCase

[FLINK-2354] [runtime] Add removeJob which maintains the job in the S…

f8e3801

…ubmittedJobGraphStore

[FLINK-2354] [runtime] Make revokeLeadership not remove the jobs from…

72abfd6

… the state backend

uce force-pushed the recovery-2354 branch from 3558db1 to 72abfd6 Compare October 9, 2015 10:47

tillrohrmann mentioned this pull request Oct 9, 2015

[FLINK-2804] [runtime] Add blocking job submission support for HA #1249

Closed

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 11, 2015

[FLINK-2354] [runtime] Add job graph and checkpoint recovery

ce6a943

This closes apache#1153.

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 15, 2015

[FLINK-2354] [runtime] Add job graph and checkpoint recovery

aa393ca

This closes apache#1153.

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 16, 2015

[FLINK-2354] [runtime] Add job graph and checkpoint recovery

44beea2

This closes apache#1153.

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 19, 2015

[FLINK-2354] [runtime] Add job graph and checkpoint recovery

d0ff99a

This closes apache#1153.

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 19, 2015

[FLINK-2354] [runtime] Add job graph and checkpoint recovery

b5b5d52

This closes apache#1153.

asfgit closed this in 73c73e9 Oct 20, 2015

cfmcgrady pushed a commit to cfmcgrady/flink that referenced this pull request Oct 23, 2015

[FLINK-2354] [runtime] Add job graph and checkpoint recovery

9af32a8

This closes apache#1153.

lofifnc pushed a commit to lofifnc/flink that referenced this pull request Oct 23, 2015

[FLINK-2354] [runtime] Add job graph and checkpoint recovery

30278e5

This closes apache#1153.

uce deleted the recovery-2354 branch December 24, 2015 17:38

rmetzger added the component=Runtime/Coordination label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-2354] Add job graph and checkpoint recovery #1153

[FLINK-2354] Add job graph and checkpoint recovery #1153

uce commented Sep 21, 2015

tillrohrmann commented Sep 21, 2015

uce commented Sep 23, 2015

tillrohrmann commented Sep 24, 2015

StephanEwen commented Sep 24, 2015

uce commented Sep 24, 2015

uce commented Sep 30, 2015

uce commented Oct 2, 2015

tillrohrmann Oct 2, 2015

uce Oct 5, 2015

uce Oct 7, 2015

uce commented Oct 8, 2015

uce commented Oct 9, 2015

tillrohrmann commented Oct 9, 2015

[FLINK-2354] Add job graph and checkpoint recovery #1153

[FLINK-2354] Add job graph and checkpoint recovery #1153

Conversation

uce commented Sep 21, 2015

tl;dr

General Idea

Configuration

ZooKeeper Nodes Overview

Implementation

Submission vs. Recovery (JobManager and SubmittedJobGraphs)

CompletedCheckpoints

CheckpointIDCounter

Akka messages

Next Steps

Tests

tillrohrmann commented Sep 21, 2015

uce commented Sep 23, 2015

tillrohrmann commented Sep 24, 2015

StephanEwen commented Sep 24, 2015

uce commented Sep 24, 2015

uce commented Sep 30, 2015

uce commented Oct 2, 2015

tillrohrmann Oct 2, 2015

Choose a reason for hiding this comment

uce Oct 5, 2015

Choose a reason for hiding this comment

uce Oct 7, 2015

Choose a reason for hiding this comment

uce commented Oct 8, 2015

uce commented Oct 9, 2015

tillrohrmann commented Oct 9, 2015