fix(purge): Simplify graceful agent shutdown #332
Conversation
not sure if this is correct yet |
Can we spec out the expected behaviour in the functional tests? |
@jonboulle It'll be easier to tackle this after we land the state-machine changes. |
BTW, the TestNodeShutdown functional test will be broken until we sort this out. |
Got some thoughts down:
|
For now, I think we should treat all downtime as permanent. I do not believe we can guarantee downtime will be temporary enough to not have an impact on the system. We do need to handle the case where an agent loses network connectivity to etcd and its presence data times out, causing its jobs to get scheduled elsewhere. That's outside of the scope of what I'm trying to accomplish with this PR, but it's worth thinking about. |
I think it's fine to make this assumption at this early stage (I mean, we already do); but we should absolutely leave the door open to supporting "temporary downtime" in future. It's a powerful feature (e.g. consider the possibility of having jobs with varying SLAs), and would be a clear differentiator from other systems like Mesos, where any kind of transient failure results in a bunch of scheduling churn (because any "loss" is immediately and permanently irrevocable) |
Do you have something in mind for how it could conceivably "catch up" on events? |
When we start exploring temporary downtime further, it'll eventually be useful to make the further distinction between "anticipated" (fleet upgrades, CoreOS graceful reboots, reconfiguration) and "unanticipated" (cold reboots, loss of network connectivity). |
Actually, I lied - we need to do this better right now :-) - like binding the life of the fleet agent to the jobs it is running (as I think we discussed). Since right now a |
@jonboulle All I've done on the planning side of things here is sit and get my thoughts down in that markdown document. I'd like to land this PR as-is, as it does improve the situation. Let's brainstorm this further next monday. |
I missed that there was something to review here - I thought you just wanted some thoughts ;-) |
Both, please! |
// locally so it doesn't confuse later state calculations | ||
for _, j := range launched { | ||
a.registry.ClearJobHeartbeat(j) | ||
//TODO(bcwaldon): The agent should not have to ask the Registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeahhhh, we really need to track this better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could easily move into the AgentState.
fix(purge): Simplify graceful agent shutdown
No description provided.