fix(purge): Simplify graceful agent shutdown #332

bcwaldon · 2014-04-15T23:45:49Z

No description provided.

bcwaldon · 2014-04-16T01:23:52Z

not sure if this is correct yet

jonboulle · 2014-04-16T01:30:33Z

Can we spec out the expected behaviour in the functional tests?

bcwaldon · 2014-04-16T17:31:09Z

@jonboulle It'll be easier to tackle this after we land the state-machine changes.

bcwaldon · 2014-04-17T23:24:02Z

BTW, the TestNodeShutdown functional test will be broken until we sort this out.

bcwaldon · 2014-04-17T23:59:31Z

Got some thoughts down:

# fleet going down

There exist several situations where a fleet agent may halt operation.
We can group these situations together based on the whether or not the agent downtime is permanent or temporary.

## permanent downtime

A fleet agent may go down and stay down for several reasons:

* power loss
* node decommission

In any of these situations, any jobs scheduled to the agent must be rescheduled elsewhere.
The engine is going to rely on a loss of agent presence to trigger rescheduling.
If an agent is shutting down gracefully, explicitly removing presence data will expedite the rescheduling and make the cluster appear more responsive.

## temporary downtime

A fleet agent may go down temporarily for multiple reasons:

* upgrading fleet code
* upgrading CoreOS
* reconfiguration
* loss of network connectivity

As long as the downtime can be guaranteed as temporary, there is no need to take additional action before halting operation.
When rejoining the cluster, the fleet agent must attempt to catch back up and react to any events it missed while it was down.

bcwaldon · 2014-04-18T00:02:39Z

For now, I think we should treat all downtime as permanent. I do not believe we can guarantee downtime will be temporary enough to not have an impact on the system.

We do need to handle the case where an agent loses network connectivity to etcd and its presence data times out, causing its jobs to get scheduled elsewhere. That's outside of the scope of what I'm trying to accomplish with this PR, but it's worth thinking about.

jonboulle · 2014-04-19T00:12:28Z

For now, I think we should treat all downtime as permanent. I do not believe we can guarantee downtime will be temporary enough to not have an impact on the system.

I think it's fine to make this assumption at this early stage (I mean, we already do); but we should absolutely leave the door open to supporting "temporary downtime" in future. It's a powerful feature (e.g. consider the possibility of having jobs with varying SLAs), and would be a clear differentiator from other systems like Mesos, where any kind of transient failure results in a bunch of scheduling churn (because any "loss" is immediately and permanently irrevocable)

jonboulle · 2014-04-19T00:13:23Z

the fleet agent must attempt to catch back up and react to any events it missed while it was down.

Do you have something in mind for how it could conceivably "catch up" on events?

jonboulle · 2014-04-19T00:16:05Z

When we start exploring temporary downtime further, it'll eventually be useful to make the further distinction between "anticipated" (fleet upgrades, CoreOS graceful reboots, reconfiguration) and "unanticipated" (cold reboots, loss of network connectivity).

jonboulle · 2014-04-19T00:17:58Z

For now, I think we should treat all downtime as permanent. I do not believe we can guarantee downtime will be temporary enough to not have an impact on the system.

I think it's fine to make this assumption at this early stage (I mean, we already do);

Actually, I lied - we need to do this better right now :-) - like binding the life of the fleet agent to the jobs it is running (as I think we discussed). Since right now a pkill -9 fleet (or equivalent) will leave a bunch of jobs running that should not be (under this assumption)

bcwaldon · 2014-04-19T00:37:47Z

@jonboulle All I've done on the planning side of things here is sit and get my thoughts down in that markdown document. I'd like to land this PR as-is, as it does improve the situation. Let's brainstorm this further next monday.

jonboulle · 2014-04-19T00:38:45Z

I missed that there was something to review here - I thought you just wanted some thoughts ;-)

bcwaldon · 2014-04-19T00:38:58Z

Both, please!

jonboulle · 2014-04-19T00:50:54Z

agent/agent.go

-	// locally so it doesn't confuse later state calculations
-	for _, j := range launched {
-		a.registry.ClearJobHeartbeat(j)
+	//TODO(bcwaldon): The agent should not have to ask the Registry


yeahhhh, we really need to track this better

This could easily move into the AgentState.

jonboulle · 2014-04-19T00:51:05Z

fix(purge): Simplify graceful agent shutdown

bcwaldon mentioned this pull request Apr 17, 2014

Fix functional tests #345

Merged

jonboulle reviewed Apr 19, 2014
View reviewed changes

fix(purge): Simplify graceful agent shutdown

24b148d

bcwaldon added a commit that referenced this pull request Apr 19, 2014

Merge pull request #332 from bcwaldon/fix-agent-shutdown

5cf2c60

fix(purge): Simplify graceful agent shutdown

bcwaldon merged commit 5cf2c60 into coreos:master Apr 19, 2014

bcwaldon deleted the fix-agent-shutdown branch April 19, 2014 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(purge): Simplify graceful agent shutdown #332

fix(purge): Simplify graceful agent shutdown #332

bcwaldon commented Apr 15, 2014

bcwaldon commented Apr 16, 2014

jonboulle commented Apr 16, 2014

bcwaldon commented Apr 16, 2014

bcwaldon commented Apr 17, 2014

bcwaldon commented Apr 17, 2014

bcwaldon commented Apr 18, 2014

jonboulle commented Apr 19, 2014

jonboulle commented Apr 19, 2014

jonboulle commented Apr 19, 2014

jonboulle commented Apr 19, 2014

bcwaldon commented Apr 19, 2014

jonboulle commented Apr 19, 2014

bcwaldon commented Apr 19, 2014

jonboulle Apr 19, 2014

bcwaldon Apr 19, 2014

jonboulle commented Apr 19, 2014

fix(purge): Simplify graceful agent shutdown #332

fix(purge): Simplify graceful agent shutdown #332

Conversation

bcwaldon commented Apr 15, 2014

bcwaldon commented Apr 16, 2014

jonboulle commented Apr 16, 2014

bcwaldon commented Apr 16, 2014

bcwaldon commented Apr 17, 2014

bcwaldon commented Apr 17, 2014

bcwaldon commented Apr 18, 2014

jonboulle commented Apr 19, 2014

jonboulle commented Apr 19, 2014

jonboulle commented Apr 19, 2014

jonboulle commented Apr 19, 2014

bcwaldon commented Apr 19, 2014

jonboulle commented Apr 19, 2014

bcwaldon commented Apr 19, 2014

jonboulle Apr 19, 2014

Choose a reason for hiding this comment

bcwaldon Apr 19, 2014

Choose a reason for hiding this comment

jonboulle commented Apr 19, 2014