-
Notifications
You must be signed in to change notification settings - Fork 147
spec: define pod lifecycle #276
Comments
Another aspect to this, which has come up in various discussion channels, is the relationship between the pod's lifecycle and identity. What is a pod's identity, and to what extent should it be immutable? Should a pod be considered a resource envelope (similar to the alloc in Google's Borg), which individual applications can be added to or removed from? Or is a pod's identity more tightly linked to its constituent applications and their respective lifetimes; i.e. if a single application dies within a pod, or it is desired to stop one of the applications within a pod, then it is necessary to tear the entire pod down and create it again? (Coming back to the old pets vs cattle metaphor - the pod/resource-envelope that can be mutated at-will is the pet, the pod which must be destroyed + recreated is the cattle.) Opening the door to things like "updating an application within a pod" fundamentally throws into question the whole idea of identity (by updating or stopping/starting every application in the pod in turn, we can easily end up with an entirely new set of applications in the pod, so what is really left to link it back to its original identity?). I am personally in strong favour of leaning towards the cattle side of the equation and retaining the pod as the fundamental scheduling unit; it is inherently easier to reason about, implement, and we don't need to define a complicated "inter-pod" API/ABI (i.e. what operations are permissible for individual apps vs on the pod as a whole). But there are practical arguments for allowing at least some operations to performed at a greater level of granularity. /cc @thockin who has plenty of thoughts on this, /cc @vbatts |
@jonboulle Will there be support for just containers without pods? |
@kelseyhightower please elaborate
|
per chat, I think as pods become like templates, there ought to be a bullet that covers pod discovery and signature. At this point, they are seeming like their own |
@vbatts not to derail this thread too much (since I'd consider that a separate issue), but the nice part about that direction is that the spec then addresses the use case of related efforts like nulecule - it clearly fulfils those listed goals:
|
As for the cattle versus pet, one question is the content addressed image requirement in the pod manifest, and largely duplicating the image manifest components. There may need to be a floating aspect to it. For instance, I may want to enforce that I'm using image |
From our discussion at #google-container around 4:22 PDT, a pod is a fixed resource envelope and a spec.
In pod, there is a restart policy. The lifecycle of a pod should not be the same as a container inside it. So I guess if there is a container dies and needs a restart, we should only restart the container without tearing down the pod. (so in our spec, we should also support this use case?) If the pod spec is changed, we should tear down the entire pod and restart all. |
A whole bunch of topics. It's hard to say what are right answers and what are simply answers that we chose before. Identity. Kubernetes pods have a mortal identity - once a pod is bound to a node, that pod's UID (GUID) is burned and never ever reused. If that pod dies and is restarted on the same node it is the same pod (same UID). If that pod is no longer viable on that node, for whatever reason, the pod is destroyed and a new one is created with a new UID. Now, this is somewhat different than what some people expect, and the point has been debated in recent weeks, but I believe fundamentally that this is the correct model. Add/remove apps from a container. This is something we naturally support in Borg because of the Alloc abstraction, but we do not support in Kubernetes (but only by punt, not by intent). I could see an argument for not supporting add/remove of containers, but it has to consider... Updates. If you can't update a pod's (or apps within that pod) resource requests without a restart, you've blown it. That is so fundamental to how we operate at scale that we would grind to a halt without it. Once you accept that, it's an easy walk to seeing how much you can support without a restart. We get a lot of value from in-place updates, and we go to great lengths to make all updates as minimally invasive as possible. One of the things that Kubernetes does differently from Borg is that replicas of a pod are not tightly coupled in a grouping abstraction. This means that, once birthed, a Pod is free to live its own life. One replica can get more memory while another gets less. They can be manipulated independently, but they are considered fungible from an administrative point of view (replication controller). So now - can you update the version of a container that an app-in-pod is running (e.g. v1 -> v2)? Obviously that requires restarting the one app. Does it require killing the pod? Hopefully not. If not, can you update the actual image name that an app-in-pod is running (e.g. nginx -> haproxy). If so, that not the same as adding and removing? There's a line to draw somewhere, but it's really not an obvious line. Sometimes you need to allow things you think are a bad idea and let policy be set by higher levels of the stack.
Is this really an important question? Just because something is treated like cattle does not mean you can't individually care for the cattle. What harm comes of this evolution over time?
You might get away with that for now, but it should not be a resting state. There are lots of things about a pod that should not require app restarts to change. |
@thockin Would I be correct in thinking the fact Docker does not implement pods contributed to the success of pods in Kubernetes? I could image a world where every container runtime implemented pods in a different way, which would make it really hard for Kubernetes to manage containers at the granularity you've outlined above. |
Kelsey, an interesting take on it. Docker gave us primitives to build On Mon, Apr 27, 2015 at 7:07 PM, Kelsey Hightower notifications@github.com
|
Does not update simply mean a restricted remove and add? The image name is not the identification of the container statically or dynamically, right? So this restriction only enforces the total number of containers in the original pod in reality.
This means the same pod (in the view of when it get scheduled and replicated) might diverge and it is by design?
Once we allow the update, then the line is super unclear. Right? |
well, let's not lose track of the fact that the whole point of the spec is On Mon, Apr 27, 2015 at 10:37 PM, Tim Hockin notifications@github.com
|
No, update here means "adjust the cgroup settings, don't touch the
Yes. Particularly in things like resources. You could maybe draw the |
Correct me if I misunderstand something. So, during the life-cycle of a pod (not tear down the pod or kill all its containers), at least:
We might also want to enable in place upgrade a container inside a pod (as i mentioned, this is like a restricted remove then add). |
"at least" is right. It's hard to say "that's all" and be confident. :) On Mon, Apr 27, 2015 at 11:04 PM, Xiang Li notifications@github.com wrote:
|
start/stop order could be arbitrary, but often will not be. Likely just needing a vocabulary around it, maybe in annotations we could have After, Before, etc. Not terribly unlike |
The harm comes if the individual cattle in the herd start to diverge from one another; then suddenly, when one of them catches smallpox and falls over dead, you can't simply replace it with another from a different herd because you've nurtured it into a special unique snowflake status. Also you can't trivially increase your herd size and expect an
sounds like cognitive dissonance :-). Unless we explicitly restrict which aspects of the configuration may become heterogeneous I can't see how fungibility can possibly be maintained. (I tried to find what guidance Kubernetes has on this but came away with an overall impression of ambivalence/punting, please point me in the right direction if I'm missing something obvious.)
Updating isolators is absolutely something we intend to support/address in the spec (has its own issue in #54). AIUI this can mostly be done online (for argument's sake let's assume that the equivalent of a sigstop/sigcont might be necessary, but not considered invasive to the processes), and I'd categorise this loosely as an "execution runtime adjustment" or so that's not necessarily elemental to the pod identity (with the caveat in my final paragraph). For me there's a much clearer line with updating apps because that requires filesystem modification and (arguably arbitrary) intervention in the life of processes in the pod. I think you kinda glossed over this with "requires restarting the one app", strikes me that the reality is much more nuanced. Re: updating nginx->haproxy - since that's mutating every important facet of the identity of the app (name, image, processes) it's clearly equivalent to adding+removing as far as I'm concerned. I still feel like I need more convincing this is something with a strong enough use case to support, at least in a 1.0 version of the spec. tl;dr, I say let's draw the line at filesystem/process lifecycle mutations now, but clearly permit online resource isolator adjustments. Having said all that: the only major blocker here for me w.r.t the spec allowing pods to change over time is that we need to be clear that the reified Pod Manifest exposed by the Metadata Service is no longer a definitive runtime identity of the state of the pod (unless we were to stipulate that it must be updated accordingly as the pod changes, but that sounds like a whole new can of worms..) |
It's about perspective. If you are looking at an ocean of pods, they are individuals. They can be managed as individuals (e.g. auto-sized, health-checked, killed). If you are looking at a replication controller, the pods it selects for are assumed semantically identical (pure replicas) and fungible (any one of them can be killed with ~equal cost). This does allow for people to more radically diverge their pods (e.g. change what containers are running in some but not all replicas), but they do so at their own peril. If they pick one replica to be their special master instance, they risk their master being killed because the replication controller can't know their choice. This stems in part from experience with the auto-control system being unhappy when it had to consider 100 replicas with varying load (whether because of locality or affinity or imperfect sharding) and try to make decisions like "should I add more memory?" for the whole set.
We have very restricted update semantics for now because it is all we can practically implement. We should tread carefully here, though it's somewhat easier for me to iterate some software than for you to iterate a spec that underpins many implementations. We only allow updating the image name/tag for now.
It's hard to point to concrete cases in docker space since they don't really exist yet. I will say that people do this sort of update frequently in borg. They can rev one of their containers while keeping state (shared volumes, shared memory) intact. They can update helper apps (log-savers, etc) without killing their main apps. They can update apps without risking a re-schedule (latency, overcrowding). Some even do updates of a server by adding a new rev of the app, starting it, communicating between the two apps to transfer state, and finally terminating the old rev. This is actually very critical to the operation of some very large systems I shouldn't name, but I know you know. Maybe these sorts of ultra-HA apps won't emerge in the rest of the world? I doubt it, but maybe...
That might be good enough for v1. I don't know what the plans are to rev this spec over time.
|
Go back to this, trying to propose something simple and just for the stop/exit of the pod: As I mentioned in rkt/rkt#1407 (comment) , we will need the container runtime to provide several options when a pod exits:
The first option can be used if any upper level orchestration tool want the pod to |
how about
|
/cc @dchen1107 |
@dchen1107 Followed PR and discussion is in #500 |
We need to address several aspects of pod lifecycles in the spec:
The text was updated successfully, but these errors were encountered: