Skip to content
This repository has been archived by the owner on Feb 14, 2023. It is now read-only.

As an operator I would like my apps to stay online during Kubernetes upgrades #600

Open
braunsonm opened this issue Dec 21, 2020 · 54 comments

Comments

@braunsonm
Copy link

braunsonm commented Dec 21, 2020

Is your feature request related to a problem? Please describe.
During Kubernetes upgrades, nodes are drained which will shutdown the pod that is running CF applications before starting up on a new node. This could mean and application could go up and down for the duration of the upgrade.

Describe the solution you'd like
Some way for developers who push apps to configure a PodDisruptionBudget so that their application can stay online. It should only be possible to configure this budget when there is more than 1 instance so that upgrades can complete.

At first I was thinking CF-for-k8s could just add a budget with minimum available of 1, but on larger deployments that see a lot of traffic, that would probably cause 503's if the service got hammered on a single replica.

You should only be able to configure this if your replica count is more than 1. Configuring a minimum available when you only have a single replica will mean that upgrades cannot happen at all.

Describe alternatives you've considered
There are no alternatives. You can scale your CF apps and "hope for the best" that you won't have all them unscheduled at the same time.

Additional context
Add any other context or screenshots about the feature request here.

@cf-gitbot
Copy link
Collaborator

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/176243873

The labels on this github issue will be updated when the story is started.

@jamespollard8
Copy link
Contributor

Oh interesting, I'd expect that for production, HA (highly available) apps:

  • the default number of instances (as set via CF cli) would be 2 [set by the CF user]
  • and that the apps should already be fairly resilient to K8s cluster upgrades through standard K8s cluster draining and the nature of having 2+ app instances

@Birdrock @davewalter @acosta11 I'd be curious to hear y'all's input here.

@braunsonm
Copy link
Author

braunsonm commented Jan 21, 2021

I think your second point is mostly true. However, there are a few instances were apps would still go offline even with 2+ instances.

  1. Imagine I have a 5 node cluster in Azure with 2 instances of a production app and I need to upgrade Kubernetes. I want the upgrade to go quickly so my AKS Surge Capacity is set to 2 nodes. If AKS chooses to drain the two nodes that contain my app, I will experience downtime and I do not get much of a choice or guarantees in Kubernetes that it will remain online during the drain.

  2. Imagine I have a 5 node cluster near capacity with 2 instances on a production app and I need to upgrade Kubernetes. Kubernetes makes no guarantees about the pod scheduling but does a "best effort" to spread the distribution of pods. It is possible that a cluster near resource constraints or with complex node affinities for other apps will have the 2 instances created on the same node. Then no matter my surge capacity the Kubernetes upgrade may take my production app offline during draining.

You can see how these scenarios can happen no matter the number of instances I choose in CF and may be more likely to happen depending on the surge capacity. For instance scenario 1 is very unlikely to happen if you were to create 5 instances (since all 5 instances would have to be scheduled on just 2 nodes), but it is still possible and would be a waste of resources just to achieve HA. The second scenario is not very likely in the majority of workloads but the first is.

cf-for-k8s specifically makes this downtime more likely because it uses StatefulSets to define CF apps. This means that a pod must be completely terminated before another can be scheduled. In other deployments like ReplicaSets this is not the case. In a ReplicaSet if a pod is in a terminating state another is scheduled immediately.

The following has some useful documentation that details best practices around this:

Curious what others perhaps with more K8S experience think though.

@Birdrock
Copy link
Member

@braunsonm Thank you for the explanation - that makes sense. I'll take a deeper look into Eirini and see how it matches against Kubernetes recommendations. My feeling is that Kubernetes already has the machinery to handle situations like this, but we aren't properly taking advantage of it.

@braunsonm
Copy link
Author

No problem @Birdrock I actually just made a small edit which might be relevant to your investigating with Eirini.

cf-for-k8s specifically makes this downtime more likely because it uses StatefulSets to define CF apps. This means that a pod must be completely terminated before another can be scheduled. In other deployments like ReplicaSets this is not the case. In a ReplicaSet if a pod is in a terminating state another is scheduled immediately.

@loewenstein
Copy link

Could we solve this without exposing PDBs to the end-user? I imagine that Eirini could automatically setup a PDB with min available 1 or 50% of the desired instance count, or of the desired instance count.
@braunsonm @cloudfoundry/eirini WDYT?

@voelzmo
Copy link

voelzmo commented Jan 21, 2021

Absolutely agree on taking yet another look at the StatefulSet vs Deployment/ReplicaSet discussion. IIRC the only reason to make it StatefulSets was to allow for addressing individual instances. /cc @herrjulz and also cross-referencing cloudfoundry/eirini#94

If we could deprecate routing to individual instances, I guess using Deployments could be a thing?

@bkrannich
Copy link

Big +1 from my side on @voelzmo's comment. I recall a not-so-recent call with @herrjulz where I believe I recall that indeed single instance routing was the primary reason to stick with StatefulSets.

@herrjulz
Copy link

That's true, if we deprecate routing to individual instances we could switch to Deployments instead of using Statefulsets. @bkrannich @voelzmo

@loewenstein
Copy link

@herrjulz I guess this is still parallel to the question of PDBs, isn't it?

@herrjulz
Copy link

@loewenstein yes it is parallel. As PDBs are on a per app basis, It can be considered to make Eirini create a PDB for every app when there are more than one instance.

@braunsonm
Copy link
Author

Could we solve this without exposing PDBs to the end-user?

50% could work but it makes assumptions about the load that an app can handle. I would prefer it being user configurable.

@loewenstein
Copy link

@braunsonm one thought I had was that this setting might rather be a setting of the foundation instead of the individual app. Like, if you run an app in production on 20 instances you'll probably have some reason and likely fail to keep it available if you drop below idk 15 instances. If you didn't, why would you run 20 instances in the first place.

This might be different for staging, qa, playground systems though.
In short, if you want app HA, you probably want min available >50% if you don't want HA, you might be fine without any PDB or at least with a different min percentage..

WDYT?

@braunsonm
Copy link
Author

@loewenstein Hmm I'm confused by your reply. It makes perfect sense what you said but that's exactly why I was thinking it's better being an individual app setting vs foundational. Because I don't care about some app in my dev space going offline during an upgrade (I don't need a PDB), but my prod app I would want control over the PDB. For exactly the reason you said, some apps might be under different loads and need more available at anytime.

@loewenstein
Copy link

@braunsonm Good point. I was seeing dev foundation vs. prod foundation. With dev spaces and prod spaces in the same foundation, this is of course looking different.

I'd still prefer not to expose PDBs to app developers. They shouldn't know anything about Pods or the details of Kubernetes node update. How's this handled with Diego BTW?

@braunsonm
Copy link
Author

@loewenstein Ah yea still I don't think that would be preferred. For instance I might have an app with two instances but still not really worry about downtime (perhaps it's a consumer for a rabbit queue and not user facing) I wouldn't want to make assumptions about what availability it needs just because some other app needs 75%.

I'd prefer not to expose PDBs either. Not sure how Diego handles this. The only thing I could think of would be a new manifest property for minAvailable or something that supports a number or percent like the PDB. Eirini then makes the PDB for us?

@9numbernine9
Copy link

@loewenstein In Diego-land, an app developer doesn't need to specify anything more than instances: 2 in their app manifest (or cf push --instances 2). My understanding is that when a Diego cell needs to be shutdown for maintenance/upgrades, if all the instances of the app are on that one cell, then they will be terminated sequentially and started back up on a different cell (or cells), using the health-check defined for that app to know when the new instances are successfully migrated. Essentially, one instance of a given app is always guaranteed to be running as a result.

FYI, I'm a colleague of @braunsonm and CF-for-VMs operator, in case you're wondering where I'm coming from. 😄 In our CF-for-VMs foundations I don't think we've ever seen an app suffer total failure during a CF upgrade if it was running 2+ instances; I think this kind of behaviour is ultimately the goal that @braunsonm (and me, by extension!) are looking to have wiht cf-for-k8s.

@bkrannich
Copy link

Reading earlier comments in this thread, I came here to say what @9numbernine9 has meanwhile said: I believe that a good (IMHO: the best) default is to do what Diego does because we'll see people moving over from CF-for-VMs not expecting a behavior change when it comes to CF app scaling. If we later on want to add additional flexibility by considering something like dev vs. prod foundations this is fine, but I'd advocate for keeping the status quo first.

Re-reading @9numbernine9's comment, I'm not sure if it is suggesting to keep the exact same Diego behavior or if the suggestion is to at least keep an app instance up-and-running to be able to serve requests. As mentioned above, my strong preference would be to retain Diego behavior.

@loewenstein
Copy link

As mentioned above, my strong preference would be to retain Diego behavior.

That's why I've added the question about Diego behavior. My guess would be Diego drain makes sure all apps are properly evacuated to other cells. Getting the exact behavior could get complicated, though.

@bkrannich
Copy link

Adding in @PlamenDoychev, both for visibility but also to add comments around Diego draining behavior in case they have not been covered here already.

@9numbernine9
Copy link

@bkrannich Sorry, I should've expressed myself more clearly! I don't necessarily think that emulating Diego's draining behaviour exactly should be a requirement, but providing a behaviour that keeps a subset of app instances alive during an upgrade probably should be.

In my experience, Diego's behaviour seems quite reasonable to me. If an app is deployed with instances: 4 in the manifest (e.g.), it means that the person deploying it probably wants at least one instance at all times, but they might also never want more than 4 either. (Consider scenarios where the app is constrained by outside resources, e.g. a data store that might perform significantly worse if there were more than 4 instances of the app at any time, or a database connection provided by a cloud database provider that limits the maximum number of connections that could be used at any time.)

Personally, Diego's current drain behaviour makes existing infrastructure upgrades (e.g. a CF upgrade or upgrading all the stemcells) the kind of activity that can be done during working hours without disruption to clients, whereas if we didn't have this behaviour we would need to be scheduling upgrades during off-hours - and I hate working weekends. 😆

@braunsonm
Copy link
Author

At least one instance could cause issues under load. Or are we discussing having a PDB with minAvailable = instances-1?

Thus, minimizing the amount of app disruption

@loewenstein
Copy link

Depending on the ratio of app instances to k8s worker nodes, a PDB with min instances-1 is likely to block worker node updates I think. When instance count doubles the number of workers, an optimal spread would mean there's no worker that can be shut down.

@bkrannich
Copy link

@9numbernine9:

If an app is deployed with instances: 4 in the manifest (e.g.), it means that the person deploying it probably wants at least one instance at all times, but they might also never want more than 4 either.

So far, we have educated our users around the current Diego behavior which is that if they specify instances: 4 they express that they want 4 instances at least (and they will do the sizing of backing services accordingly). Our users (and this might be different for people who are not CF users today, have a K8s background and thus operate cf-for-k8s themselves or for people that both operate CF as well as their own apps) do not even know or care if an update is running - they care about their apps only and they expect updates to be a non-event for them. This includes the ability to keep processing the expected load they have forecasted.

As mentioned, I believe today's Diego behavior should be the default for cf-for-k8s as well (or at least there should be a system-wide option to retain this behavior) because we want people to upgrade to cf-for-k8s without too much changes (otherwise, why upgrade to cf-for-k8s and not alternatives for running stateless apps with buildpacks).

I think part of the discussion here is different which is: From a coding perspective, what options does K8s offer to achieve one or the other behavior once we have settled on a default. But I'd suggest to make this a second step which is informed by answering the question of "what do our users want"?

@herrjulz
Copy link

herrjulz commented Jan 25, 2021

@loewenstein @bkrannich @9numbernine9 @braunsonm I just discussed this topic with the team and we realised that we already set PDBs, but we set it to 1 instance if the instance count is >1. Currently this is a hardcoded value.
We should find an appropriate default setting for the PDBs we create in Eirini + we should expose the parameter in Eirini to be configurable by the end-user.

@braunsonm
Copy link
Author

@herrjulz @bkrannich

I agree that a default behaviour that matches Diego would be good, but I don't think that we should stop there. As @herrjulz said I would like to see it configurable per app. The guarantees of only a single app being available would take production workloads offline because of the increased load.

Also interesting you already default to at least 1 instance PDB! I didn't notice that before. In that case the default behaviour already is what Diego has and this issue is more about improving that since currently only having a single app be available during an upgrade would result in downtime for higher traffic apps.

@herrjulz
Copy link

herrjulz commented Jan 25, 2021

With the current approach, we have one single setting for PDBs for every app. If we want to make the setting more individual (eg for every single app), the cloud controller would need to provide this information to Eirini such that Eirini can set the PDB for every app individually. This would require some work on the CC side.

@IvanHristov98
Copy link

IvanHristov98 commented Jan 28, 2021

Hi folks 👋, a bit off-topic but I think that to decide what's best it's good to better understand how Diego maintains zero downtime for apps during cf/infra updates. So I decided to put in a few words regarding it.

Note - To understand the details below it's good to have a basic understanding of how Diego works. It's enough to read the first few lines in this doc.

During an update bosh stops, updates and then starts all the jobs on a subset of the diego-cells. The stop procedure contains a drain step which starts an evacuation in the rep and waits until it finishes. Once triggered during it:

In short the behaviour is to kill an evacuating app instance only after it is replicated on another diego-cell or replicating it times out. By doing so it is "guaranteed" that even single instance apps be kept available during updates.

Hope this puts some insight into how Diego handles updates. 😄

Another note - Diego always tries to schedule instances of a single app across different cells to maintain HA.

PS0 - I'm not the author of the evacuation feature but I'm familiar with the code base since I had to research it once.
PS1 - I only linked examples of evacuating running app instances since they're easier to grasp.

@herrjulz
Copy link

Hi all,

we started to work on exposing the Eirini PDB setting in the configmap and make the default (keep one instance up if the instance count is >1) configurable. While working on it we realised that this can lead to un-evictable pods in some cases. For example if the default is changed to 5 for minAvailablePods in the PDB all apps would need to be deployed with at least 5 instances.
Also, we were not able to come up with an use-case where it makes sense to change the current default for the PDBs Eirini creates. We think on a per-app-basis this makes most sense. But that would require an end-to-end change (cli, cloud controller, eirini) and requires a broader discussion. A part of that discussion would be how such an end-to-end change would affect Diego.

Another thing to point out is that we have to deal with the fact that depending on the backend (diego or k8s) a user has more/less features available.

@gcapizzi
Copy link
Contributor

gcapizzi commented Feb 25, 2021

Hi everyone!
One thing I'd like to point out is that, actually, according to @IvanHristov98's comment, our current behaviour (minAvailable: 1) is not equivalent to Diego's, which never runs fewer instances than the ones requested by the app (but sometimes will run more, e.g. during evacuations).

In Kubernetes this would be equivalent to maxUnavailable: 0, which unfortunately doesn't seem to be supported:

If you set maxUnavailable to 0% or 0, or you set minAvailable to 100% or the number of replicas, you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods. If you try to drain a Node where an unevictable Pod is running, the drain never completes. This is permitted as per the semantics of PodDisruptionBudget.

We've going to run a spike about this to see if there's a way to achieve Diego's behaviour in Kubernetes. If we can't find any, we might settle with maxUnavailable: 1 for apps with more than 1 instance, and no PodDisruptionBudget at all for apps with one instance (very similar to what we do now, but using maxUnavailable: 1 instead of minAvailable: 1).

@braunsonm
Copy link
Author

Hey @gcapizzi

I'd like to point out that maxUnavailable: 1 would result in very slow upgrades at scale. I'm not sure what your acceptance criteria is for how CF scales but imagine having ~50 instances for a large production application. Some napkin math: 50 instances, 60 second shutdown time, 60 second startup time (waiting for health checks to pass), ~60 seconds for the pod to be scheduled (pull time, istio sidecars, volume mounts to become available). 50 * (60 * 3) = 2 hours 30 minutes for an app to be moved onto a new upgraded node pool. That's just a single app and that's best case scenario where the instances are evenly spread and does not include the rest of the drain time.

I think this is acceptable but I wanted to point it out.

You could speed this time up by changing the CF apps to deploy as ReplicaSets/deployments instead since you won't need to wait for the pod to shutdown before starting a new one.

@gcapizzi
Copy link
Contributor

gcapizzi commented Feb 25, 2021

Hi @braunsonm! Isn't this a problem Diego is already facing? My main argument for this is that it would the closest thing to what we already have on cf-for-vms (and we've had for a long time).

It's hard to establish what the contract we have with our users is in this case, which is why we tend to consider the cf-on-vms/diego behaviour to be it, although it might not be the behaviour expected by all users. Maybe we need to talk with more operators about their experience upgrading cf-for-vms and how that would translate into expectations for cf-for-k8s.

@loewenstein
Copy link

maxUnavailable: 1 can also stall node rollout if the number of application instances is twice the number of nodes...

@gcapizzi
Copy link
Contributor

gcapizzi commented Feb 25, 2021

maxUnavailable: 1 can also stall node rollout if the number of application instances is twice the number of nodes...

@loewenstein could you explain this in a bit more detail? I haven't seen this mentioned anywhere in the Kubernetes docs and it seems quite odd to be honest. We'll make sure that we test this scenario in our spike!

@loewenstein
Copy link

Well, with e.g. 10 application instances on 5 nodes - if we consider an even spread - we'll have 2 instances on each node. But you might be right and it's just taking longer and Kubernetes is taking off Pods one by one (after spinning them up on other nodes) until finally one node is completely empty and can be replaced.

I'd run an experiment though, before I'd go with maxUnavailable: 1 as a fixed value that can't be configured.

@braunsonm
Copy link
Author

Isn't this a problem Diego is already facing? My main argument for this is that it would the closest thing to what we already have on cf-for-vms (and we've had for a long time).

It doesn't sound like it. From what was said above, the instance is started up on another cell before the old is torn down. If you specify a large max_in_flight during the deployment then, assuming you have the capacity, you could move multiple instances of an app at once since it allows the app instance count to surge.

This behaviour cannot be replicated because CF-for-k8s schedules apps as StatefulSets which requires the old instance be destroyed before a new one can be spun up.

can also stall node rollout if the number of application instances is twice the number of nodes...

This is true since the PDB would require only 1 app can be moved at a time. An example:

I have a 10 node cluster on AKS with a surge capacity of 50%. This means during an upgrade AKS will give me 5 more new nodes and will drain 5 old nodes at a time. Even though 5 nodes are draining, the app is only moving one instance at a time to the new node pool which will stall the upgrade (potentially causing it to timeout though you'd have to confirm that). As @loewenstein I think some testing would be required.

@Birdrock
Copy link
Member

@braunsonm I believe this is the related issue in Eirini: [request] Support to create a K8s Deployment instead of Statefulset resource

We'll revisit this after the Eirini team has completed work on generating Deployments instead of StatefulSets.

cc @jspawar

@braunsonm
Copy link
Author

@Birdrock That does not seem related and they seem against the idea. Although a Deployment helps avoid downtime, it does not guarantee it like a PDB does.

@gcapizzi
Copy link
Contributor

gcapizzi commented Mar 17, 2021

Hey everyone! We're currently investigating this, too see which options we have. Re: StatefulSets vs Deployments, we set podManagementPolicy: "Parallel" on our StatefulSets, which should make them behave a bit more like Deployments, although we want to investigate that too.

@loewenstein
Copy link

loewenstein commented Mar 18, 2021

I commented on cloudfoundry/eirini#94 as well, still thought I'd throw it in here directly as well.

Deployment update strategies support a maxSurge parameter, which will make sure the number of Pods is at least the number of requested instances - during normal application operation.

This won't help for cluster operation like Node updates though. The only thing I can imagine doing is PDBs with maxUnavailable=0, a controller watching Nodes, tainting those that are due to be updated (I hope there's a way to detect this for sure) and triggering a forced rolling update of Deployments running on that Node. The deployment controller should adhere to the maxSurge and due to the Node tainting there shouldn't be any Pods scheduled on the node. Eventually there should be no PDB left that blocks Node draining.

@braunsonm
Copy link
Author

braunsonm commented Mar 19, 2021

The only thing I can imagine doing is PDBs with maxUnavailable=0, a controller watching Nodes, tainting those that are due to be updated (I hope there's a way to detect this for sure) and triggering a forced rolling update of Deployments running on that Node. The deployment controller should adhere to the maxSurge and due to the Node tainting there shouldn't be any Pods scheduled on the node. Eventually there should be no PDB left that blocks Node draining.

This is interesting but seems more complicated than allowing a user to set maxUnavailable to something > 0? I suppose the reasoning you don't want to expose this is because it would not be compatible with CF for VMs?

@loewenstein
Copy link

... but seems more complicated than ...

I guess I should have stated, that I was not necessarily suggesting to implement this. It might just not be worth replicating the exact behaviour here. On the other hand, partially leaking out Kubernetes constructs sounds dangerous to me.

Maybe, we should just decide on a reasonable expectation on the percentage of instances that are guaranteed to be up and running and accepting traffic, set this as PDB and rely on the cf-for-k8s/Kubernetes operator to detect and resolve any Node update issues that might be caused by those PDBs.

@gcapizzi
Copy link
Contributor

gcapizzi commented Mar 29, 2021

Hi everyone! 👋
These are the possible solutions to the problem that we have found:

1. Set all PDBs to maxUnavailable: 1

This would maximise availability but would result in very slow upgrades, which could timeout.

2. Set all PDBs to minAvailable: 50%

This would be a more balanced default, as it trades some upgrade speed for higher availability. It would also never result in unevictable Pods (if we don't set it on single-instance LRPs, which already is the current behaviour).

3. Allow developers to specify minAvailable/maxUnavailable for single LRPs

This would require a CF-wide change.

4. Allow operators to specify minAvailable/maxUnavailable for the whole foundation

This can lead to unevictable Pods, but we can detect those and fix them by forcing maxUnavailable: 1.

5. Use Deployments instead of StatefulSets

This would make upgrades a bit faster, but also requires a massive, backwards-incompatible CF-wide change as we would have to give up instance indexes.

5a. Block the default draining behaviour by using maxUnavailable: 0, and use the controller rolling update instead, leveraging surging

This would result in rolling all the Deployment Pods, including the ones that don't need to be drained. Also, overriding the default behaviour is risky and could lead to unpredictable problems.

6. React to eviction events with a validating webhook and try to override the default draining behaviour

This could be done by blocking evictions and manipulating the StatefulSet replicas number to force the creation of new instances before the old ones get evicted. It is probably impossible to implement with StatefulSets, and would be quite hacky and racy.

We have decided to go with 2 (story here), which seems like a net improvement that can be implemented quite quickly, and maybe 4 (stories here, here and here) if there is enough support for it.

What do people think?

@braunsonm
Copy link
Author

braunsonm commented Mar 29, 2021

Option 3 is still the most flexible option here. Your change will require us to run with double the number of instances we normally would need to, just to ensure we have a decent number of available instances during upgrades.

I understand it's additional work but 2 and 4 are not satisfactory in the long term. 2 is still okay as a temporary solution but also does not line up with the process in CF for VMs.

@gcapizzi
Copy link
Contributor

gcapizzi commented Mar 29, 2021

It's important to understand that the CF-for-VMs behaviour is just not achievable on Kubernetes. Kubernetes will never surge during node drains, so even being able to set minAvailable/maxUnavailable wouldn't get you anything that lines up with how Diego behaves. maxUnavailable: 1 would get you close, but you could do that with 3 right? What makes option 3 unsuitable for your use case?

@braunsonm
Copy link
Author

@gcapizzi 3 would allow us to keep maxUnavailable: 1 only on applications that need a specific uptime guarantee or have significant load. Which is the closest we will ever get to the diego behaviour on K8s. This gives us the best of both worlds since deployment updates are still fast for apps that can withstand more instance interruptions, but remain highly available for apps that have significant load.

This is also better than 2 because we do not need to waste compute capacity in anticipation of an upgrade.

@gcapizzi
Copy link
Contributor

Sorry, wrong number! I meant:
you could do that with 4 right? What makes option 4 unsuitable for your use case?

@braunsonm
Copy link
Author

braunsonm commented Mar 29, 2021

@gcapizzi I suppose 4 would work. It just means all apps need to have a high guarantee of availability even if only one app in the whole foundation does! That is why 3 is ideal in my opinion. However with 4 being an optional setting to override 2 the operator would be aware of upgrade timeouts if they specified maxUnavailble: 1.

I'm fine with this

@bkrannich
Copy link

@gcapizzi, @braunsonm: In a recent discussion with @emalm I believe the general sense was that it might overall be beneficial to look into # 5, even if it means giving up on a certain set of existing CF behavior (app instance index, etc.).

On the other hand, my gut feeling continues to be to aim for something as close as possible to today's Diego behavior when it comes to keeping app instances running (I realize that I'm advocating for dropping and keeping CF compatibility at the same time here).

Additionally, I wonder if it would help to increase the number of replicas when we know we are doing an update and decreasing the number of replicas again once the update is done.

So, I guess my question would be: Would maybe the combination of the above allow us for a more Diego-like behavior?

@gcapizzi
Copy link
Contributor

gcapizzi commented Mar 30, 2021

@bkrannich I think your proposal is essentially 6. We just don't think manipulating the instance count of a StatefulSet in the middle of a drain would be safe, and interfering with the standard upgrade process just feels dangerous.

The unfortunate truth is that there is no way to replicate the behaviour of Diego. This is because Diego can evacuate cells knowing exactly what it can and cannot do when rescheduling containers, while in Kubernetes the drain and upgrade procedure needs to work with any Pod management controller, be it Deployment or StatefulSet or even something else entirely. Surging (which is essentially what Diego does) is a Deployment specific behaviour, but the drain procedure is completely decoupled from any controller, so it can't leverage it. Diego can safely assume that surging is fine, Kubernetes can't afford that assumption.

I agree that switching to Deployments would help with upgrades, although not much, as I explain above draining doesn't leverage surging. On the other hand, it would mean giving up features currently relied on by many (e.g. I suspect some Spring Cloud components use INSTANCE_INDEX). We are thinking about putting it behind a feature flag, but we still need to investigate thoroughly.

@loewenstein
Copy link

A couple of questions/comments:

1&2
Setting PDBs too high can completely halt node updates, couldn't it?
Setting maxUnavailable to 1 will allow a full node update only if there is at most a single application instance on every node.
Setting maxUnavailable to 50% makes halting node upda less likely, but with e.g. 6 application instances on 2 nodes neither could be drained. Right?

3&4
Given the above and assuming developers do not have access to kubernetes directly, I'd say manipulating PDBs should be reserved for operators. They would have to detect problems in node updates, but they also have the means to fix it.

5&6
I didn't try and didn't think it through completely, but I'd assume we could emulate Diego behavior a little closer to what we've listed here and independent from the StatefulSet vs. Deployment discussion like this:

  • set PDBs maxUnavailable to 0
  • watch Nodes for Draining and trigger rollouts of workloads that have pods on the draining nodes

The rollouts should eventually leave a draining node empty so that it can be shut down and replaced with an updated node.

@gcapizzi
Copy link
Contributor

gcapizzi commented Mar 31, 2021

Setting maxUnavailable to 50% makes halting node upda less likely, but with e.g. 6 application instances on 2 nodes neither could be drained. Right?

I think there's a misconception that the Kubernetes draining procedure will either kill all Pods on the node or none of them, which is not true. Kubernetes will kill as many Pods as it can within all PDB constraints, and then stop, wait for 5 seconds and try again. This goes on until the node is completely drained.

The only way to make this procedure completely stuck is to have PDBs with maxUnavailable: 0, minAvailable: N (where N is the total number of replicas) or minAvailable: 100%. We never do this, although we need to prevent it if we make these numbers user-configurable. The problem is more with speed: maxUnavailable: 1 might result in very slow upgrades and even timeouts. Also some providers will stop following PDBs after a while to prevent upgrades from getting stuck (e.g. 1 hour for GCP).

5&6
I didn't try and didn't think it through completely, but I'd assume we could emulate Diego behavior a little closer to what we've listed here and independent from the StatefulSet vs. Deployment discussion like this:

  • set PDBs maxUnavailable to 0
  • watch Nodes for Draining and trigger rollouts of workloads that have pods on the draining nodes

This would only really improve things if we used Deployments, as StatefulSets won't surge anyway. Some people have asked to implement Deployment surging during upgrades, but that doesn't seem to be going anywhere. I'm uneasy with trying to hijack the upgrade procedure ourselves.

@bkrannich
Copy link

@gcapizzi: I read the K8s 1.21 release notes and found https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown - not sure if such a feature would be of any help, though.

@gcapizzi
Copy link
Contributor

gcapizzi commented Apr 9, 2021

@gcapizzi: I read the K8s 1.21 release notes and found https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown - not sure if such a feature would be of any help, though.

Yes, I think this is related to how Pods are killed more than how they are rescheduled.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests