New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-2354] Add job graph and checkpoint recovery #1153
Conversation
Massive PR @uce :-) Reviewing it will take me 1-2 days I guess. Concerning the problem with the |
2d4fbc0
to
ae19782
Compare
@tillrohrmann I've rebased this on the current master and fixed two issues (the two new commits). Session management has been added recently to the master. I don't think it works with HA at this point. I haven't checked this yet and would postpone until session management is exposed to the user. |
Good to hear. I'll try to review it then. Won't probably before Monday, On Wed, Sep 23, 2015 at 8:03 PM, Ufuk Celebi notifications@github.com
|
The implication of sessions would only be that Jobs are kept in the "currentJobs" map, even once they are finished. That should transparently work with HA - the jobs would not be removes from ZooKeeper until they are disposed. |
Currently, ZooKeeper only stores the latest added job graph (existing ones are overwritten), whereas in the JobManager the JobGraph is attached to the existing ExecutionGraph. In case of recovery, only the latest "attached" JobGraph will be recovered. |
I've rebased this o the current master and added a manual ChaosMonkeyTest. The test is friendly in the sense that it waits for task managers to reconnect to the job manager before continuing to stop a jobmanager/taskmanager. |
While working on FLINK-2804 I've noticed an issue: the user jars are only uploaded to the leading job manager at submission time and then are not available to the other job managers on recovery. A simple solution is to make the user jars available in the file state backend as well. |
|
||
// some sanity checks | ||
if (job == null || tasksToTrigger == null || | ||
tasksToWaitFor == null || tasksToCommitTo == null) { | ||
throw new NullPointerException(); | ||
} | ||
if (numSuccessfulCheckpointsToRetain < 1) { | ||
throw new IllegalArgumentException("Must retain at least one successful checkpoint"); | ||
} | ||
if (checkpointTimeout < 1) { | ||
throw new IllegalArgumentException("Checkpoint timeout must be larger than zero"); | ||
} | ||
|
||
this.job = job; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could harmonize the null checking as you've done it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've changed it in some places and not in others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
Till found another issue in one of is Travis runs, which has been addressed in e54a86c. This is now rebased on the current master. |
Rebased on the current master and incorporated the job manager state modification fix. Thanks for that! Can we merge this after Travis gives the green light? |
Internal actor states must only be modified within the actor thread. This avoids all the well-known issues coming with concurrency. Fix RemoveCachedJob by introducing RemoveJob Fix JobManagerITCase
…ubmittedJobGraphStore
… the state backend
I made some more fixes for the shading of the curator dependency. Once Travis gives green light, I'll merge it. |
tl;dr
This PR introduces
JobGraph
andSuccessfulCheckpoint
recovery for submitted programs in case of JobManager failures.General Idea
The general idea is to persist job graphs and successful checkpoints in ZooKeeper.
We have introduced JobManager high availability via ZooKeeper in #1016. My PR builds on top of this and adds initial support for program recovery. We can recover both programs and successful checkpoints in case of a JobManager failure as soon as a standby job manager is granted leadership.
ZooKeeper's sweet spot is rather small data (in KB range), but job graph and checkpoint state can grow larger. Therefore we don't directly persist the actual metadata, but use the state backend as a layer of indirection. We create state handles for the job graph and completed checkpoints and persist those. The state handle acts as a pointer to the actual data.
At the moment, only the file system state backend is supported for this. The state handles need to be accessible from both task and job managers (e.g. a DFS).
Configuration
The minimal required configuration:
I don't like the current configuration keys. Until the next release, I would like a more consistent naming, e.g. prefix everything with
recovery.zookeeper
.ZooKeeper Nodes Overview
Overview of ZNodes and components managing them:
Implementation
Submission vs. Recovery (JobManager and SubmittedJobGraphs)
ZooKeeperSubmittedJobGraphs
managesSubmittedJobGraph
state handles in ZooKeeperJobManager#submitJob()
).CompletedCheckpoints
ZooKeeperCompletedCheckpoints
managesSuccessfulCheckpoint
state handles in ZooKeeper (per job). Note that aSuccessfulCheckpoint
has pointers to further state handles in most cases. In this case, we add another layer of indirection.CheckpointIDCounter
ZooKeeperCheckpointIDCounter
manages a shared counter in ZooKeeper (per job).Checkpointed
interface requires ascending checkpoint IDs for each checkpoint.Akka messages
postStop
method. Because this method has some cleanup logic (removing job graphs and checkpoints), all JobManager recovery tests run the JobManager as a separateJobManagerProcess
. This is quite heavy weight. If someone knows a way to stop the actor w/o thepostStop
being called, it would be great to refactor this.Next Steps
Tests
There was a Travis/AWS outage yesterday and I couldn't run as many builds as we should yet. I would like to run a couple of runs before we merge this.