Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

fix(purge): Simplify graceful agent shutdown #332

Merged
merged 1 commit into from Apr 19, 2014

Conversation

bcwaldon
Copy link
Contributor

No description provided.

@bcwaldon
Copy link
Contributor Author

not sure if this is correct yet

@jonboulle
Copy link
Contributor

Can we spec out the expected behaviour in the functional tests?

@bcwaldon
Copy link
Contributor Author

@jonboulle It'll be easier to tackle this after we land the state-machine changes.

@bcwaldon
Copy link
Contributor Author

BTW, the TestNodeShutdown functional test will be broken until we sort this out.

@bcwaldon bcwaldon mentioned this pull request Apr 17, 2014
@bcwaldon
Copy link
Contributor Author

Got some thoughts down:

# fleet going down

There exist several situations where a fleet agent may halt operation.
We can group these situations together based on the whether or not the agent downtime is permanent or temporary.

## permanent downtime

A fleet agent may go down and stay down for several reasons:

* power loss
* node decommission

In any of these situations, any jobs scheduled to the agent must be rescheduled elsewhere.
The engine is going to rely on a loss of agent presence to trigger rescheduling.
If an agent is shutting down gracefully, explicitly removing presence data will expedite the rescheduling and make the cluster appear more responsive.

## temporary downtime

A fleet agent may go down temporarily for multiple reasons:

* upgrading fleet code
* upgrading CoreOS
* reconfiguration
* loss of network connectivity

As long as the downtime can be guaranteed as temporary, there is no need to take additional action before halting operation.
When rejoining the cluster, the fleet agent must attempt to catch back up and react to any events it missed while it was down.

@bcwaldon
Copy link
Contributor Author

For now, I think we should treat all downtime as permanent. I do not believe we can guarantee downtime will be temporary enough to not have an impact on the system.

We do need to handle the case where an agent loses network connectivity to etcd and its presence data times out, causing its jobs to get scheduled elsewhere. That's outside of the scope of what I'm trying to accomplish with this PR, but it's worth thinking about.

@jonboulle
Copy link
Contributor

For now, I think we should treat all downtime as permanent. I do not believe we can guarantee downtime will be temporary enough to not have an impact on the system.

I think it's fine to make this assumption at this early stage (I mean, we already do); but we should absolutely leave the door open to supporting "temporary downtime" in future. It's a powerful feature (e.g. consider the possibility of having jobs with varying SLAs), and would be a clear differentiator from other systems like Mesos, where any kind of transient failure results in a bunch of scheduling churn (because any "loss" is immediately and permanently irrevocable)

@jonboulle
Copy link
Contributor

the fleet agent must attempt to catch back up and react to any events it missed while it was down.

Do you have something in mind for how it could conceivably "catch up" on events?

@jonboulle
Copy link
Contributor

When we start exploring temporary downtime further, it'll eventually be useful to make the further distinction between "anticipated" (fleet upgrades, CoreOS graceful reboots, reconfiguration) and "unanticipated" (cold reboots, loss of network connectivity).

@jonboulle
Copy link
Contributor

For now, I think we should treat all downtime as permanent. I do not believe we can guarantee downtime will be temporary enough to not have an impact on the system.

I think it's fine to make this assumption at this early stage (I mean, we already do);

Actually, I lied - we need to do this better right now :-) - like binding the life of the fleet agent to the jobs it is running (as I think we discussed). Since right now a pkill -9 fleet (or equivalent) will leave a bunch of jobs running that should not be (under this assumption)

@bcwaldon
Copy link
Contributor Author

@jonboulle All I've done on the planning side of things here is sit and get my thoughts down in that markdown document. I'd like to land this PR as-is, as it does improve the situation. Let's brainstorm this further next monday.

@jonboulle
Copy link
Contributor

I missed that there was something to review here - I thought you just wanted some thoughts ;-)

@bcwaldon
Copy link
Contributor Author

Both, please!

// locally so it doesn't confuse later state calculations
for _, j := range launched {
a.registry.ClearJobHeartbeat(j)
//TODO(bcwaldon): The agent should not have to ask the Registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeahhhh, we really need to track this better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could easily move into the AgentState.

@jonboulle
Copy link
Contributor

:shipit:

bcwaldon added a commit that referenced this pull request Apr 19, 2014
fix(purge): Simplify graceful agent shutdown
@bcwaldon bcwaldon merged commit 5cf2c60 into coreos:master Apr 19, 2014
@bcwaldon bcwaldon deleted the fix-agent-shutdown branch April 19, 2014 00:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants