Skip to content
This repository has been archived by the owner on Aug 14, 2020. It is now read-only.

spec: define pod lifecycle #276

Open
5 tasks
jonboulle opened this issue Apr 1, 2015 · 22 comments
Open
5 tasks

spec: define pod lifecycle #276

jonboulle opened this issue Apr 1, 2015 · 22 comments
Milestone

Comments

@jonboulle
Copy link
Contributor

We need to address several aspects of pod lifecycles in the spec:

  • Garbage collection lifecycle of the pod filesystem including:
    • Format of app exit codes and signal
    • The refcounting plan for resources consumed by the ACE such as volumes
  • Lifecycle of apps within the pod including:
    • All exit or first to exit
    • Start and stop order
    • Can a pod's apps be updated while the pod is running?
@jonboulle jonboulle added this to the v1.0.0 milestone Apr 1, 2015
jonboulle added a commit to jonboulle/spec that referenced this issue Apr 1, 2015
Replace TODOs in spec text with GitHub issues:
- appc#276
- appc#277
- appc#278

Change block isolator wording to apply to specific devices (looks like
previous situation was copypastaed from network isolators)
@jonboulle
Copy link
Contributor Author

Another aspect to this, which has come up in various discussion channels, is the relationship between the pod's lifecycle and identity. What is a pod's identity, and to what extent should it be immutable?

Should a pod be considered a resource envelope (similar to the alloc in Google's Borg), which individual applications can be added to or removed from? Or is a pod's identity more tightly linked to its constituent applications and their respective lifetimes; i.e. if a single application dies within a pod, or it is desired to stop one of the applications within a pod, then it is necessary to tear the entire pod down and create it again? (Coming back to the old pets vs cattle metaphor - the pod/resource-envelope that can be mutated at-will is the pet, the pod which must be destroyed + recreated is the cattle.)

Opening the door to things like "updating an application within a pod" fundamentally throws into question the whole idea of identity (by updating or stopping/starting every application in the pod in turn, we can easily end up with an entirely new set of applications in the pod, so what is really left to link it back to its original identity?).

I am personally in strong favour of leaning towards the cattle side of the equation and retaining the pod as the fundamental scheduling unit; it is inherently easier to reason about, implement, and we don't need to define a complicated "inter-pod" API/ABI (i.e. what operations are permissible for individual apps vs on the pod as a whole). But there are practical arguments for allowing at least some operations to performed at a greater level of granularity.

/cc @thockin who has plenty of thoughts on this, /cc @vbatts

@kelseyhightower
Copy link
Collaborator

@jonboulle Will there be support for just containers without pods?

@jonboulle
Copy link
Contributor Author

@kelseyhightower please elaborate
Am 27.04.2015 11:04 schrieb "Kelsey Hightower" notifications@github.com:

@jonboulle https://github.com/jonboulle Will there be support for just
containers without pods?


Reply to this email directly or view it on GitHub
#276 (comment).

@vbatts
Copy link
Contributor

vbatts commented Apr 27, 2015

per chat, I think as pods become like templates, there ought to be a bullet that covers pod discovery and signature. At this point, they are seeming like their own *.acp or similar. Which would have a minimum a /pod manifest.

@jonboulle
Copy link
Contributor Author

@vbatts not to derail this thread too much (since I'd consider that a separate issue), but the nice part about that direction is that the spec then addresses the use case of related efforts like nulecule - it clearly fulfils those listed goals:

Provide a simple, flexible way to describe a multi-container application, including all dependencies.
Provide a way for an application designer to describe an application while allowing a sysadmin a clear way to parameterize the deployment at runtime.
Provide a versioned specification for developer tools and runtime implementations to agree on.

@vbatts
Copy link
Contributor

vbatts commented Apr 27, 2015

As for the cattle versus pet, one question is the content addressed image requirement in the pod manifest, and largely duplicating the image manifest components. There may need to be a floating aspect to it. For instance, I may want to enforce that I'm using image foo.com/bar from identity foo.com, maybe not a precise build.

@xiang90
Copy link
Contributor

xiang90 commented Apr 27, 2015

Should a pod be considered a resource envelope (similar to the alloc in Google's Borg), which individual applications can be added to or removed from?

From our discussion at #google-container around 4:22 PDT, a pod is a fixed resource envelope and a spec. alloc (a flexible resource envelope) is too flexible and all most everyone just wants to have a one-to-one mapping from a task(pod) to a alloc.

if a single application dies within a pod, or it is desired to stop one of the applications within a pod, then it is necessary to tear the entire pod down and create it again

In pod, there is a restart policy. The lifecycle of a pod should not be the same as a container inside it. So I guess if there is a container dies and needs a restart, we should only restart the container without tearing down the pod. (so in our spec, we should also support this use case?)

If the pod spec is changed, we should tear down the entire pod and restart all.

@thockin
Copy link
Contributor

thockin commented Apr 28, 2015

A whole bunch of topics. It's hard to say what are right answers and what are simply answers that we chose before.

Identity. Kubernetes pods have a mortal identity - once a pod is bound to a node, that pod's UID (GUID) is burned and never ever reused. If that pod dies and is restarted on the same node it is the same pod (same UID). If that pod is no longer viable on that node, for whatever reason, the pod is destroyed and a new one is created with a new UID. Now, this is somewhat different than what some people expect, and the point has been debated in recent weeks, but I believe fundamentally that this is the correct model.

Add/remove apps from a container. This is something we naturally support in Borg because of the Alloc abstraction, but we do not support in Kubernetes (but only by punt, not by intent). I could see an argument for not supporting add/remove of containers, but it has to consider...

Updates. If you can't update a pod's (or apps within that pod) resource requests without a restart, you've blown it. That is so fundamental to how we operate at scale that we would grind to a halt without it. Once you accept that, it's an easy walk to seeing how much you can support without a restart. We get a lot of value from in-place updates, and we go to great lengths to make all updates as minimally invasive as possible. One of the things that Kubernetes does differently from Borg is that replicas of a pod are not tightly coupled in a grouping abstraction. This means that, once birthed, a Pod is free to live its own life. One replica can get more memory while another gets less. They can be manipulated independently, but they are considered fungible from an administrative point of view (replication controller).

So now - can you update the version of a container that an app-in-pod is running (e.g. v1 -> v2)? Obviously that requires restarting the one app. Does it require killing the pod? Hopefully not. If not, can you update the actual image name that an app-in-pod is running (e.g. nginx -> haproxy). If so, that not the same as adding and removing? There's a line to draw somewhere, but it's really not an obvious line. Sometimes you need to allow things you think are a bad idea and let policy be set by higher levels of the stack.

so what is really left to link it back to its original identity?

Is this really an important question? Just because something is treated like cattle does not mean you can't individually care for the cattle. What harm comes of this evolution over time?

If the pod spec is changed, we should tear down the entire pod and restart all.

You might get away with that for now, but it should not be a resting state. There are lots of things about a pod that should not require app restarts to change.

@kelseyhightower
Copy link
Collaborator

@thockin Would I be correct in thinking the fact Docker does not implement pods contributed to the success of pods in Kubernetes? I could image a world where every container runtime implemented pods in a different way, which would make it really hard for Kubernetes to manage containers at the granularity you've outlined above.

@thockin
Copy link
Contributor

thockin commented Apr 28, 2015

Kelsey, an interesting take on it. Docker gave us primitives to build
higher abstractions. If Docker gave us similar but different pods, we'd
probably have to abstract or adapt.

On Mon, Apr 27, 2015 at 7:07 PM, Kelsey Hightower notifications@github.com
wrote:

@thockin https://github.com/thockin Would I be correct in thinking the
fact Docker does not implement pods contributed to the success of pods in
Kubernetes? I could image a world where every container runtime implemented
pods in a different way, which would make it really hard for Kubernetes to
manage containers at the granularity you've outlined above.


Reply to this email directly or view it on GitHub
#276 (comment).

@xiang90
Copy link
Contributor

xiang90 commented Apr 28, 2015

@thockin

Updates. If you can't update a pod's (or apps within that pod) resource requests without a restart, you've blown it. That is so fundamental to how we operate at scale that we would grind to a halt without it.

Does not update simply mean a restricted remove and add? The image name is not the identification of the container statically or dynamically, right? So this restriction only enforces the total number of containers in the original pod in reality.

They can be manipulated independently

This means the same pod (in the view of when it get scheduled and replicated) might diverge and it is by design?

There's a line to draw somewhere, but it's really not an obvious line.

Once we allow the update, then the line is super unclear. Right?

@jonboulle
Copy link
Contributor Author

well, let's not lose track of the fact that the whole point of the spec is
to define this stuff and try encourage consistency in implementations :-)

On Mon, Apr 27, 2015 at 10:37 PM, Tim Hockin notifications@github.com
wrote:

Kelsey, an interesting take on it. Docker gave us primitives to build
higher abstractions. If Docker gave us similar but different pods, we'd
probably have to abstract or adapt.

On Mon, Apr 27, 2015 at 7:07 PM, Kelsey Hightower <
notifications@github.com>
wrote:

@thockin https://github.com/thockin Would I be correct in thinking the

fact Docker does not implement pods contributed to the success of pods in
Kubernetes? I could image a world where every container runtime
implemented
pods in a different way, which would make it really hard for Kubernetes
to
manage containers at the granularity you've outlined above.


Reply to this email directly or view it on GitHub
#276 (comment).


Reply to this email directly or view it on GitHub
#276 (comment).

@thockin
Copy link
Contributor

thockin commented Apr 28, 2015

Updates. If you can't update a pod's (or apps within that pod) resource requests without a restart, you've blown it. That is so fundamental to how we operate at scale that we would grind to a halt without it.

Does not update simply mean a restricted remove and add? The image name is not the identification of the container statically or dynamically, right? So this restriction only enforces the total number of containers in the original pod in reality.

No, update here means "adjust the cgroup settings, don't touch the
running processes". Why would you kill my process just to increase
the memory limit? Even decreases can be done safely, though slowly.

They can be manipulated independently

This means the same pod (in the view of when it get scheduled and replicated) might diverge and it is by design?

Yes. Particularly in things like resources. You could maybe draw the
line at version updates. It seems like a reasonable starting place.
Don't underestimate the value of uptime. Once people have it, they
don't want to give it up.

@xiang90
Copy link
Contributor

xiang90 commented Apr 28, 2015

@thockin

Correct me if I misunderstand something.

So, during the life-cycle of a pod (not tear down the pod or kill all its containers), at least:

  1. we should be able to update its resources. (once we support this, a pod is not immutable)
  2. we should be able to restart a failed container based on the restart policy

We might also want to enable in place upgrade a container inside a pod (as i mentioned, this is like a restricted remove then add).

@thockin
Copy link
Contributor

thockin commented Apr 28, 2015

"at least" is right. It's hard to say "that's all" and be confident. :)

On Mon, Apr 27, 2015 at 11:04 PM, Xiang Li notifications@github.com wrote:

@thockin https://github.com/thockin

Correct me if I misunderstand something.

So, during the life-cycle of a pod (not tear down the pod or kill all its
containers), at least:

we should be able to update its resources. (once we support this, a
pod is not immutable)
2.

we should be able to restart a failed container based on the restart
policy

We might also want to enable in place upgrade a container inside a pod (as
i mentioned, this is like a restricted remove then add).


Reply to this email directly or view it on GitHub
#276 (comment).

@vbatts
Copy link
Contributor

vbatts commented Apr 28, 2015

start/stop order could be arbitrary, but often will not be. Likely just needing a vocabulary around it, maybe in annotations we could have After, Before, etc. Not terribly unlike systemd.unit

@jonboulle
Copy link
Contributor Author

Is this really an important question? Just because something is treated like cattle does not mean you can't individually care for the cattle. What harm comes of this evolution over time?

The harm comes if the individual cattle in the herd start to diverge from one another; then suddenly, when one of them catches smallpox and falls over dead, you can't simply replace it with another from a different herd because you've nurtured it into a special unique snowflake status. Also you can't trivially increase your herd size and expect an O(f(N)) improvement in return because there's no longer any consistency in N. I am honestly a bit baffled by how to reconcile this sentence (from replication-controller.md) without defining terms much more explicitly:

Pods created by a replication controller are intended to be fungible and semantically identical, though their configurations may become heterogeneous over time.

sounds like cognitive dissonance :-). Unless we explicitly restrict which aspects of the configuration may become heterogeneous I can't see how fungibility can possibly be maintained. (I tried to find what guidance Kubernetes has on this but came away with an overall impression of ambivalence/punting, please point me in the right direction if I'm missing something obvious.)

So now - can you update the version of a container that an app-in-pod is running (e.g. v1 -> v2)? Obviously that requires restarting the one app. Does it require killing the pod? Hopefully not. If not, can you update the actual image name that an app-in-pod is running (e.g. nginx -> haproxy). If so, that not the same as adding and removing? There's a line to draw somewhere, but it's really not an obvious line. Sometimes you need to allow things you think are a bad idea and let policy be set by higher levels of the stack.

Updating isolators is absolutely something we intend to support/address in the spec (has its own issue in #54). AIUI this can mostly be done online (for argument's sake let's assume that the equivalent of a sigstop/sigcont might be necessary, but not considered invasive to the processes), and I'd categorise this loosely as an "execution runtime adjustment" or so that's not necessarily elemental to the pod identity (with the caveat in my final paragraph).

For me there's a much clearer line with updating apps because that requires filesystem modification and (arguably arbitrary) intervention in the life of processes in the pod. I think you kinda glossed over this with "requires restarting the one app", strikes me that the reality is much more nuanced. Re: updating nginx->haproxy - since that's mutating every important facet of the identity of the app (name, image, processes) it's clearly equivalent to adding+removing as far as I'm concerned. I still feel like I need more convincing this is something with a strong enough use case to support, at least in a 1.0 version of the spec.


tl;dr, I say let's draw the line at filesystem/process lifecycle mutations now, but clearly permit online resource isolator adjustments.


Having said all that: the only major blocker here for me w.r.t the spec allowing pods to change over time is that we need to be clear that the reified Pod Manifest exposed by the Metadata Service is no longer a definitive runtime identity of the state of the pod (unless we were to stipulate that it must be updated accordingly as the pod changes, but that sounds like a whole new can of worms..)

@thockin
Copy link
Contributor

thockin commented Apr 29, 2015

Is this really an important question? Just because something is treated like cattle does not mean you can't individually care for the cattle. What harm comes of this evolution over time?

The harm comes if the individual cattle in the herd start to diverge from one another; then suddenly, when one of them catches smallpox and falls over dead, you can't simply replace it with another from a different herd because you've nurtured it into a special unique snowflake status. Also you can't trivially increase your herd size and expect an O(f(N)) improvement in return because there's no longer any consistency in N. I am honestly a bit baffled by how to reconcile this sentence (from replication-controller.md) without defining terms much more explicitly:

Pods created by a replication controller are intended to be fungible and semantically identical, though their configurations may become heterogeneous over time.

It's about perspective. If you are looking at an ocean of pods, they are individuals. They can be managed as individuals (e.g. auto-sized, health-checked, killed). If you are looking at a replication controller, the pods it selects for are assumed semantically identical (pure replicas) and fungible (any one of them can be killed with ~equal cost). This does allow for people to more radically diverge their pods (e.g. change what containers are running in some but not all replicas), but they do so at their own peril. If they pick one replica to be their special master instance, they risk their master being killed because the replication controller can't know their choice.

This stems in part from experience with the auto-control system being unhappy when it had to consider 100 replicas with varying load (whether because of locality or affinity or imperfect sharding) and try to make decisions like "should I add more memory?" for the whole set.

sounds like cognitive dissonance :-). Unless we explicitly restrict which aspects of the configuration may become heterogeneous I can't see how fungibility can possibly be maintained. (I tried to find what guidance Kubernetes has on this but came away with an overall impression of ambivalence/punting, please point me in the right direction if I'm missing something obvious.)

We have very restricted update semantics for now because it is all we can practically implement. We should tread carefully here, though it's somewhat easier for me to iterate some software than for you to iterate a spec that underpins many implementations.

We only allow updating the image name/tag for now.

So now - can you update the version of a container that an app-in-pod is running (e.g. v1 -> v2)? Obviously that requires restarting the one app. Does it require killing the pod? Hopefully not. If not, can you update the actual image name that an app-in-pod is running (e.g. nginx -> haproxy). If so, that not the same as adding and removing? There's a line to draw somewhere, but it's really not an obvious line. Sometimes you need to allow things you think are a bad idea and let policy be set by higher levels of the stack.

Updating isolators is absolutely something we intend to support/address in the spec (has its own issue in #54). AIUI this can mostly be done online (for argument's sake let's assume that the equivalent of a sigstop/sigcont might be necessary, but not considered invasive to the processes), and I'd categorise this loosely as an "execution runtime adjustment" or so that's not necessarily elemental to the pod identity (with the caveat in my final paragraph).

For me there's a much clearer line with updating apps because that requires filesystem modification and (arguably arbitrary) intervention in the life of processes in the pod. I think you kinda glossed over this with "requires restarting the one app", strikes me that the reality is much more nuanced. Re: updating nginx->haproxy - since that's mutating every important facet of the identity of the app (name, image, processes) it's clearly equivalent to adding+removing as far as I'm concerned. I still feel like I need more convincing this is something with a strong enough use case to support, at least in a 1.0 version of the spec.

It's hard to point to concrete cases in docker space since they don't really exist yet. I will say that people do this sort of update frequently in borg. They can rev one of their containers while keeping state (shared volumes, shared memory) intact. They can update helper apps (log-savers, etc) without killing their main apps. They can update apps without risking a re-schedule (latency, overcrowding). Some even do updates of a server by adding a new rev of the app, starting it, communicating between the two apps to transfer state, and finally terminating the old rev. This is actually very critical to the operation of some very large systems I shouldn't name, but I know you know.

Maybe these sorts of ultra-HA apps won't emerge in the rest of the world? I doubt it, but maybe...

tl;dr, I say let's draw the line at filesystem/process lifecycle mutations now, but clearly permit online resource isolator adjustments.

That might be good enough for v1. I don't know what the plans are to rev this spec over time.


Having said all that: the only major blocker here for me w.r.t the spec allowing pods to change over time is that we need to be clear that the reified Pod Manifest exposed by the Metadata Service is no longer a definitive runtime identity of the state of the pod (unless we were to stipulate that it must be updated accordingly as the pod changes, but that sounds like a whole new can of worms..)

@yifan-gu
Copy link
Contributor

Go back to this, trying to propose something simple and just for the stop/exit of the pod:

As I mentioned in rkt/rkt#1407 (comment) , we will need the container runtime to provide several options when a pod exits:

  • Pod exits when any of the apps exits (no matter what the exit codes are)
  • Pod exits when all of the apps exit (no matter what the exit codes are)
  • Pod exits when any of the apps exits and the exit code is non-zero

The first option can be used if any upper level orchestration tool want the pod to always restart, (e.g. Kubernetes' RestartPolicy=Always)
Similarly, the second option can enable something like Kubernetes' RestartPolicy=Never
And the third one is for RestartPolicy=OnFailure

@jonboulle
Copy link
Contributor Author

how about

  • any
  • all
  • anyOnFailure

@jonboulle
Copy link
Contributor Author

/cc @dchen1107

@yifan-gu
Copy link
Contributor

@dchen1107 Followed PR and discussion is in #500

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants