Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Should we rely on StatefulSets? (Yes.) #1173
Contrary to many other stateful workloads on Kubernetes, we explicitly decided not to rely on StatefulSets to manage Kubernetes pods, for reasons highlighted in this document. This decision comes from the desire to get more flexibility around Elasticsearch nodes orchestration, and avoid dealing with another abstraction layer that may not simplify our orchestration needs.
Since then (we're a few months later), we evolved both our user API (the Elasticsearch CRD) and our codebase to support more and more features, in a way that feels native to Kubernetes. These recent improvements came up with a few hard problems to solve (deal with cache inconsistencies through expectations, compare hashes of template specs, correctly reuse killed nodes PVCs, reuse PVCs during rolling upgrades, and more to solve). In a sense, we're getting closer to what StatefulSets provide, with more flexibility, at the cost of more work from our side.
Since this is a recurring topic in our team discussions, let's try to challenge again this decision, by looking at what it would mean, right now, to switch over to managing StatefulSets instead of managing pods.
Elasticsearch CRD and StatefulSets mapping
Our Elasticsearch CRD schema already groups nodes into multiple sets of nodes:
Each group of nodes shares the same pod spec and configuration. Some use cases for splitting the cluster into multiple groups include dedicated masters, hot-warm topologies, availability zone awareness, etc.
There is no "fixed" StatefulSets we always create. For example: a StatefulSet for master nodes and another one for data nodes, as other k8s operators do. The cluster topology can be much larger than that, considering hot-warm architectures, availability zones etc. This is entirely dynamic and user-driven.
There is no changes in the Elasticsearch CRD to rely on StatefulSets internally.
It is not possible to update the VolumeClaimTemplates of an existing StatefulSet. We must create a new one. If we apply a strict node spec group to statefulset mapping, I think we should reject any modification in the spec that would lead to modify the underlying StatefulSet. The user would have to explicitly create a new group with a new name (replacing the existing one) to set different VolumeClaimTemplates.
StatefulSets offer several upgrade strategies:
General approach with OnDelete
When a modification is applied to a StatefulSet (eg. change pod resource limits), we end up with a new
The operator manages the
For updating a StatefulSet with eg. pod cpu resources changes while reusing persistent volumes, we would:
Grow and shrink upgrades
Applying grow-and-shrink upgrades get a bit more complicated:
Note that, by moving on with that approach, we may decide we maybe don't need grow and shrink.
Increasing the number of nodes without changing the spec
Just increase the StatefulSets replicas count.
Decreasing the number of nodes without changing the spec
Follow the steps highlighted at the end of the grow-and-shrink approach: prepare all extra-nodes for deletion (allocation excludes), then update the StatefulSets replicas count. We don't delete pods manually here, otherwise they would be recreated. Decreasing the replicas count effectively removes the extra pods.
Changing a StatefulSet name
If we keep a strict binding between the name of a group of nodes in the Elasticsearch CRD, and the name of a StatefulSet, any update to the Elasticsearch CRD should apply to the StatefulSet. In that particular case, we would need to create a second StatefulSet, and migrate data away from nodes in the first StatefulSet to the second one, then delete IT. In other terms, changing the name of a group of nodes in ES spec is a very expensive operation. Unless we find a (hacky) way to reuse PVCs through different StatefulSets.
With this strategy, we define a
For example, if we have a StatefulSet with 5 replicas:
If the partition index is
In this mode, the operator never deletes pods. All it does is:
To do a rolling upgrade of the StatefulSet above, we would start with an index of
The same general logic as for
Pod management policy
StatefulSets can have 2 different pod management policies:
I think we don't care about
So we would use
Right now, we manage some secrets per cluster, some secrets per group of nodes, but also some secrets per pod:
With StatefulSets, it's impossible to specify a secret name for a single pod. Secrets apply to all pods in the set.
Both are doable, 2. feels more secure but 1. is simpler.
Do we actually need grow and shrink?
In the "grow-and-shrink" approach, we spin up additional nodes, migrate data over them, then safely remove nodes we don't need anymore.
Some reasons why we have it implemented right now:
It has some downsides:
Because of the last point, if we were to move towards using StatefulSets, we could also decide to remove support for grow-and-shrink in general. This would simplify the StatefulSet-based implementation. This also means doubling down on using persistent volumes instead of emptyDirs, which we consider not reliable.
Question mark on migrating small clusters (eg.
Managing a set of StatefulSets and pods in those StatefulSets
We add another abstraction layer on top of StatefulSets, but we still need to manage pods inside each StatefulSet (for revision hash comparison purpose on deletion).
To deal with master nodes in cluster mutations, we need to capture the list of master nodes across all StatefulSets, in order to:
To take into consideration our general
So here our code needs to work not only with pods, but with StatefulSets that contain these pods, for changes that span over pods in multiple StatefulSets.
Extra things we'll need to deal with
If we go on with the StatefulSets implementation, here is a list of extra things we'll need to deal with, that we didn't have to before:
Things we have implemented that we can get rid of
If we go on with the StatefulSets implementation, here is a list of extra things we have in our current codebase, that we don't need anymore:
Overall B's and C's
The outcome of this issue should be to decide whether it is worth it switching our internal implementation to StatefulSets.
Small point of clarification: there is a work in progress proposal to support PVC resizes in stateful sets that's going through active (though slow) progress:
We can also still resize the PVC directly, though any new pods spun up by the PVC would still be at the old spec. So it is pretty hacky, though "possible"
As a slight modification to option two: one (hacky) way to have different behavior between pods in a stateful set is to use the downward api. We can pass the pod name in as an env var to the pod, and then we could have an init container use the k8s api to load the per-pod secret. Kafka and some operators I've worked on have used that technique to have kind of a per-pod identity.
I still think "one secret per sset" is probably Good Enough to start, but just wanted to say there is a way to use k8s secrets and have distinct certs per pod
My vote is to accept downtime. If you're only running 1 node it seems like you've already accepted there will be downtime and it is (hopefully) only for non-production usage.
My first thought was to set minimum number of nodes >= 3 and never worry about this again, but I can see the value of single node clusters for test environments.
IMO we should just leave this up to the storage class they choose and our operator should stay out of it (assuming our operator is not manually spinning up PVs and instead relying on dynamically or pre-provisioned volumes). If they want the PV deleted when the instances is deleted then they can set the reclaim policy to Delete.
I have some questions to flesh out some of the final concerns:
How often do we really think we will be needing to replace ssets? My guess is that by far the most common will be increasing data per node, but I wonder if I'm missing something.
So following on that: what is our expected path to scale up data size? If it is just adding more nodes, then I think this is relatively easy. If we need to let people change the size of the data per-node, then we would need to work out a way to migrate to a new sset. That might be something we can do during or after beta though.
I think there are legitimate search heavy use cases that rely on emptyDir w/ tmpfs that we would exclude when we require everyone to use persistent volumes.
I think that's an understatement. I would argue that it becomes no longer viable for non-toy dataset sizes.
If we retain the ability (via process manager or other means) to have in-place configuration changes without recreating the pod, then the absence of a 'grow and shrink' style update strategy might not be so dramatic.
Finally we could still retain the shrink bit of 'grow-and-shrink': safe draining/vacation of nodes would enable use cases that don't want to have persistent volumes (and you need the 'shrink' bit anyway for stateful set replacement I guess)
It feels like the main benefit would be better alignment with build-in Kubernetes controllers and the potential to benefit from any improvements on stateful sets in the future.
Here the benefits are not so clear. Probably a bit of simplification. But IIUC not a sea change: we would need to retain some change calculation code to express for example
I think it is less about size of the change (if it is completely transparent to users does it matter how big the change is?) but more about backwards compatibility/stability/maturity of what we consider a beta release. There would be no upgrade path from the individual pod management to a stateful set based implementation, and that is probably a good reason to have that in a beta. (I mean it's technically feasible but then we would need to support both orchestration approaches concurrently for at least one release etc.)
I think you're right. Updating a sset VolumeClaimTemplate section will probably be possible at some point, but until then, we can't explicitly do it.
I guess we have 2 options there:
The second approach looks like an optimisation we could do later on, that would make our life a bit harder but the user's life a bit simpler.
We currently don't have this problem while managing pods directly: a user can update the VolumeClaimTemplates section, we'll manage deleting old pods and creating new ones with the new volume size. In the end, that's pretty similar to what we'd do with StatefulSets (spin up pods with new volumes, migrate data, delete pods with old volumes).
Not sure I understand correctly, I might be misunderstanding the way StatefulSets & PVC deletion work. From what I read in the docs, when a sset is deleted, the PVCs are not deleted. That's up to the user.
Storage classes can have retain policies to specify how the PV deletion happens if the PVC is deleted. But this does not control PVC deletion, only PV deletion. What I'm concerned with here is the PVC deletion if a sset is scaled down or removed. Say you scale down from 5 nodes to 3 nodes: I don't see why we'd keep PVCs around for nodes 4 and 5. If we decide to delete PVC, the underlying PV deletion is not our concern though, as you said.
@jonsabo This would require one or more containers in the Pod to access the API, which is undesirable if it can be avoided. From an RBAC perspective, I understand it as the credentials would technically allow reading other nodes' certs as well? Something for a later discussion for sure though.
@jonsabo It is a regression from the status quo though. My vote is for accepting downtime as well fwiw.
It's important to note that it's not just about 1 node clusters. It's also about clusters that do not have replicas for all shards, e.g non-HA clusters etc.
@jonsabo Single node clusters for production can also work depending on the nines you're promising, and the specifics of the use case. Not saying it's necessarily advisable, but it's not categorically not doable. (splitting hair here probably)
@jonsabo I think lots of people will just change the size of the volume claim template. Migrating between statefulsets will be pretty much core to a lot of the configuration changes (e.g we want to support topology changes etc), and also ties into how we could do grow/shrink as well.
@pebrc We can detect the usage of emptyDirs and disable rolling changes for those (e.g still support grow/shrink rather easily through option2 in the suggestion below)
@pebrc as it would only work for a subset of changes I'm becoming more and more luke warm to that approach, and would prefer if we first solve for general applicability and optimize afterwards of the payoff is clearly worth the effort and complexity.
@pebrc This is a huge and imporant point to my mind as well. That and I would be /very/ happy to see stable node names and IPs as well.
@sebgl Don't map nodes group names to sset names: we manage sset internally, decoupled from the ES spec name. We could base the sset name on
@sebgl To my mind this would be polyfilling for future K8s features and breaks stable naming/identity, so if K8s would disallow something, let's disallow it as well (but provide a path forward).
@sebgl Wouldn't the same arguments for statefulsets not deleting PVCs apply to our operator not deleting PVCs? What are they / what are different for us?
Suggestions: grow and shrink support:
I don't particularly like the idea of suffixing the replacement statefulset etc, seems a little messy.
(my current preference)
This makes our lives easier by not having to suffix statefulsets oddly and allows for nice and stable node names for the ES nodes.
This seems plausible to validate/verify/provide feedback on in a webhook as well?
Just so that I understand you are intending to transparently map from node set/spec to a suffixed stateful set and the node spec name would be stable?
Where generation is the change of the Elasticsearch spec we are trying to roll out (the actual number could come out of the some variant of our current change calculation)
I think from technical POV it's all good, but the UX is just terrible: What name am I supposed to choose here, because I doubt that a user will call their node sets
So then what is the new name I am supposed to choose here?
What's the point of having an operator if I have to vacate my nodes manually? Sounds like a world of pain.
I'm not certain the UX is terrible. It's not perfect, but we're not trying to paper over shortcomings of the underlying features we're using (statefulsets), and the limitation of "no downtime" (in the case of non-HA or user config)
Only changes that are not possible to do without creating a new statefulset would require a new name. This part would be the same if you were doing statefulsets directly, except we can automate migrating from one statefulset to another, which is of great value.
:shrug:. It just seems like a worse version of option 2.
I'm trying to draft what StatefulSets reconciliation would look like in pseudo-code, taking changeBudget into account, covering both rolling upgrades & grow-and-shrink, with relying on sset rollingUpdate.Partition as a way to trigger pod replacement.
I'm pretty sure this is not 100% correct and there are some gaps here. Among others it's missing a few pieces about resetting zen1, zen2, allocation excludes, shards allocations enabled/disabled settings, resetting rollingUpdate.Partition indices.
A few things I realised from this exercise:
Some other notes for future discussion:
One future feature we might want to have at some point, that will become harder with StatefulSets, is "temporarily pausing/stopping/removing a node from the cluster".
Use case: say you suspect a node is misbehaving for whatever reason (slow IO), which is slowing down the entire cluster. You want to temporarily remove that node from the cluster to see how it goes.
In the current implementation, users can update an annotation on the Elasticsearch resource, so the operator stops managing this cluster. Then manually delete the pod (remove the node) to see what happens. Once done, update that annotation again: the operator will recreate the pod.
With StatefulSets, "pausing" the operator does not prevent the StatefulSet controller to recreate any pod the user might delete. AFAIK there is nothing we can do about it. Unless we go towards a direction where the pod is still running but the Elasticsearch process itself is paused in the pod.
If I'm understanding you correctly, you're worried about how do we decide the sset name when we need to create multiple ssets? If so I think we can take the user provide sset name and append a hash of the values that require a new sset.
No this was about the interaction model where the human user would have to choose new sset names to enact configuration changes that cannot be accomplished by modifying an existing set. The mode of operation outlined by @nkvoll and @sebgl means that the user would have to choose the name themselves which I found problematic from a UX point of view because the logical identity of the set of nodes ('data-hot' for example) does not change.
What you are referring to is outlined above as Option 1 in Njals plan.
Summary of a discussion with @nkvoll and others: