Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we rely on StatefulSets? (Yes.) #1173

Closed
sebgl opened this issue Jul 1, 2019 · 21 comments

Comments

@sebgl
Copy link
Collaborator

commented Jul 1, 2019

Context

Contrary to many other stateful workloads on Kubernetes, we explicitly decided not to rely on StatefulSets to manage Kubernetes pods, for reasons highlighted in this document. This decision comes from the desire to get more flexibility around Elasticsearch nodes orchestration, and avoid dealing with another abstraction layer that may not simplify our orchestration needs.

Since then (we're a few months later), we evolved both our user API (the Elasticsearch CRD) and our codebase to support more and more features, in a way that feels native to Kubernetes. These recent improvements came up with a few hard problems to solve (deal with cache inconsistencies through expectations, compare hashes of template specs, correctly reuse killed nodes PVCs, reuse PVCs during rolling upgrades, and more to solve). In a sense, we're getting closer to what StatefulSets provide, with more flexibility, at the cost of more work from our side.

Since this is a recurring topic in our team discussions, let's try to challenge again this decision, by looking at what it would mean, right now, to switch over to managing StatefulSets instead of managing pods.

Overview

Elasticsearch CRD and StatefulSets mapping

Our Elasticsearch CRD schema already groups nodes into multiple sets of nodes:

apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: "7.1.0"
  nodes:
  # first group of nodes
  - name: firstGroup
    config:
      node.attr.attr_name: attr_value
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          resources:
            limits:
              memory: 2Gi
              cpu: 1
    nodeCount: 3
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2Gi
  # second group of nodes
  - name: secondGroup
    config:
      node.attr.attr_name: attr_value
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          resources:
            limits:
              memory: 4Gi
              cpu: 1
    nodeCount: 3
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2Gi

Each group of nodes shares the same pod spec and configuration. Some use cases for splitting the cluster into multiple groups include dedicated masters, hot-warm topologies, availability zone awareness, etc.
This naturally maps to one StatefulSet per group here. StatefulSets would be dynamically created based on the spec, with a mandatory name that we associate with a given group.
For a group of 3 nodes, instead of creating 3 pods, we (internally) create a StatefulSet with 3 replicas.

There is no "fixed" StatefulSets we always create. For example: a StatefulSet for master nodes and another one for data nodes, as other k8s operators do. The cluster topology can be much larger than that, considering hot-warm architectures, availability zones etc. This is entirely dynamic and user-driven.

There is no changes in the Elasticsearch CRD to rely on StatefulSets internally.

VolumeClaimTemplates updates

It is not possible to update the VolumeClaimTemplates of an existing StatefulSet. We must create a new one. If we apply a strict node spec group to statefulset mapping, I think we should reject any modification in the spec that would lead to modify the underlying StatefulSet. The user would have to explicitly create a new group with a new name (replacing the existing one) to set different VolumeClaimTemplates.

Update strategy

StatefulSets offer several upgrade strategies: OnDelete, RollingUpdate, RollingUpdate with partition. I think we can work with both OnDelete and Rolling update with partition.

General approach with OnDelete

With OnDelete, pods are never deleted from the StatefulSet by the StatefulSet controller, unless their ordinal is above the expected number of replicas. We (in the operator) decide when we want to delete a pod. Once deleted, a pod is automatically recreated by the StatefulSet controller, with the same name but the most up-to-date spec. Our operator would never create pods, but it would take care of pods deletion, when we decide a pod is ready to be deleted.

When a modification is applied to a StatefulSet (eg. change pod resource limits), we end up with a new revision (based on a hash of the template spec). Looking at the StatefulSet status, we can get the current revision (currentRevision: es-master-6596ffb49b), the number of pods that are using that revision, and the number of pods that are still using an old revision (updateRevision: web-696848f587). By listing pods in that StatefulSet, we can check the current revision of each pod (metadata.labels["controller-revision-hash"]: "web-696848f587").

The operator manages the replicas of each StatefulSet. When there are no upgrades to be made on the cluster, the replicas count matches the node count in the spec for that group of nodes. When there are upgrades to be made on the cluster, the replicas managed by the operator might be different, on purpose.

Rolling upgrades

For updating a StatefulSet with eg. pod cpu resources changes while reusing persistent volumes, we would:

  • Update our internal StatefulSet resource for that group of nodes. This does not trigger any change, since we use OnDelete.
  • Fetch the list of pods in that StatefulSet, and notice which pods are using an old revision compared to the current StatefulSet revision. These pods need to be rolled to new ones.
  • Starting from the pod with the highest ordinal (eg. es-master-3), apply steps we already apply to make sure we safely evacuate the pod:
    • call the ES API to prepare for node restart: disable shard allocations, update zen1 & zen2 settings, etc.
    • once ready, delete the pod
    • the pod gets automatically recreated with the correct spec by the StatefulSet controller
    • enable shards allocations if all pods in the set are currently running
  • Once done, all pods in the StatefulSet should have the current StatefulSet revision.
Grow and shrink upgrades

Applying grow-and-shrink upgrades get a bit more complicated:

  • Using a similar mechanism as for rolling upgrades, notice 3 pods need to be replaced based on the StatefulSet revision.
  • Update the StatefulSet replicas count to ES spec NodeCount + changeBudget.maxSurge. That would create 4 nodes in that group instead of the 3 nodes requested.
  • Starting from the pod with the highest ordinal (eg. es-master-3), apply steps we already apply to make sure we safely evacuate the pod:
    • call the ES API to prepare for node removal: set allocation excludes.
    • once data migration is over, delete the pod
    • the pod gets automatically recreated with the correct spec by the StatefulSet controller
  • Once all 4 pods are updated with the new spec, we need to revert the 4 replicas to 3 replicas. Apply the same steps as above to migrate data away from the temporary extra nodes (here, the 4th one, not requested by the user).
  • Once data is migrated away from extra nodes, change the number of replicas back to 3: this effectively deletes the additional pods.
  • Compared to our current code base, this is requiring more data migration than what we, since we need to remove the nodes we temporarily created instead of using them directly.

Note that, by moving on with that approach, we may decide we maybe don't need grow and shrink.

Increasing the number of nodes without changing the spec

Just increase the StatefulSets replicas count.

Decreasing the number of nodes without changing the spec

Follow the steps highlighted at the end of the grow-and-shrink approach: prepare all extra-nodes for deletion (allocation excludes), then update the StatefulSets replicas count. We don't delete pods manually here, otherwise they would be recreated. Decreasing the replicas count effectively removes the extra pods.

Changing a StatefulSet name

If we keep a strict binding between the name of a group of nodes in the Elasticsearch CRD, and the name of a StatefulSet, any update to the Elasticsearch CRD should apply to the StatefulSet. In that particular case, we would need to create a second StatefulSet, and migrate data away from nodes in the first StatefulSet to the second one, then delete IT. In other terms, changing the name of a group of nodes in ES spec is a very expensive operation. Unless we find a (hacky) way to reuse PVCs through different StatefulSets.

OnDelete: legacy?

In both documentation and code, the OnDelete strategy is marked as legacy. There is no clear statement that it is or will be deprecated though.

RollingUpdate.Partition approach

With this strategy, we define a partition index: this allows pods whose ordinal is higher than this index to be replaced by the StatefulSet controller.

For example, if we have a StatefulSet with 5 replicas:

  • es-data-0
  • es-data-1
  • es-data-2
  • es-data-3
  • es-data-4

If the partition index is 3, then the StatefulSet controller is allowed to automatically delete then recreate pods es-data-3 and es-data-4.

In this mode, the operator never deletes pods. All it does is:

  • update the StatefulSets replicas when new pods should be added or pods should be removed
  • update the partition index when some pods should be replaced

To do a rolling upgrade of the StatefulSet above, we would start with an index of 5, make sure pod 4 can safely be replaced, then update the index to 4. This would trigger the pod replacement.

The same general logic as for OnDelete applies, except we don't delete pod explicitly but manage the index instead.

Pod management policy

StatefulSets can have 2 different pod management policies:

  • OrderedReady: guarantees pods are created and deleted in the ordinal order, making sure pod N-1 is ready before creating pod N (opposite direction for deletions)
  • Parallel: create and delete all pods in parallel

I think we don't care about OrderedReady. That would just slow things down for us, especially for pods creation. We don't require any ordering of operations. Also it relies on pods Readiness checks to decide whether it can move on with pod creation/deletion, and we may not want to bind readiness check to the same threshold as pod removal conditions. We want to manage deletion ourselves (either through deleting pods explicitly with OnDelete or through managing a partition index with RollingUpdate.Partition), and we're OK with creating all expected pods at once.

So we would use Parallel here.

Secret volumes

Right now, we manage some secrets per cluster, some secrets per group of nodes, but also some secrets per pod:

  • TLS certificates: each pod gets a different one
  • Elasticsearch configuration: each pod gets one but should be similar for all pods in a group, so can probably be ignored here (should be fixed in the current implementation?)

With StatefulSets, it's impossible to specify a secret name for a single pod. Secrets apply to all pods in the set.
Regarding TLS certificates, it means we need to either:

  1. Generate a single TLS secret for the entire StatefulSet, containing TLS certificates and private keys for all pods in that StatefulSet. This secret can be updated realtime as there are more pods to be created in the StatefulSet. This is what our Helm Charts seem to do. It means there is an implicit trust between nodes of the same set.
  2. Generate the TLS certificate from within the pod at startup. Which is a feature we had implemented, then removed because of its additional complexity.

Both are doable, 2. feels more secure but 1. is simpler.

Do we actually need grow and shrink?

In the "grow-and-shrink" approach, we spin up additional nodes, migrate data over them, then safely remove nodes we don't need anymore.

Some reasons why we have it implemented right now:

  1. default in Elastic Cloud, favors data safety over fast upgrades
  2. it allows working with emptyDirs (no persistent volumes to reuse)
  3. it allows migration from a 1 node cluster to another 1 node cluster without downtime
  4. it keeps a cluster in a green state during the whole upgrade

It has some downsides:

  1. data migration can take a looong time on large clusters
  2. most people running Elasticsearch on bare metal would not do it, it's not even mentioned in Elasticsearch docs - they would go with rolling upgrades instead
  3. implementation in the operator is more complicated when using StatefulSets (see above)

Because of the last point, if we were to move towards using StatefulSets, we could also decide to remove support for grow-and-shrink in general. This would simplify the StatefulSet-based implementation. This also means doubling down on using persistent volumes instead of emptyDirs, which we consider not reliable.

Question mark on migrating small clusters (eg. 1 node to 1 node or 1 master+data to 1 master + 1 data): do we accept downtime (with PV reuse), or do we require some grow-and-shrink upgrade with data migration in order to keep the cluster alive?

Managing a set of StatefulSets and pods in those StatefulSets

We add another abstraction layer on top of StatefulSets, but we still need to manage pods inside each StatefulSet (for revision hash comparison purpose on deletion).

To deal with master nodes in cluster mutations, we need to capture the list of master nodes across all StatefulSets, in order to:

  • deal with zen1 and zen2 settings
  • generally try not to remove more than half of the master nodes at once
  • make sure we don't remove the last living master node, if the next one is yet to be created

To take into consideration our general changeBudget across all nodes, we also need to calculate changes to apply to StatefulSets based on all pods in all StatefulSets. For instance if a user specifies changeBudget.MaxSurge: 1 with a grow-and-shrink approach, we want to allow only one extra node to be created across all StatefulSets. Same for deletions and changeBudget.MaxUnavailable.

So here our code needs to work not only with pods, but with StatefulSets that contain these pods, for changes that span over pods in multiple StatefulSets.

Extra things we'll need to deal with

If we go on with the StatefulSets implementation, here is a list of extra things we'll need to deal with, that we didn't have to before:

  • Manage a dynamic set of StatefulSets and a set of pods across those sets
  • Deal with master nodes across all StatefulSets
  • Be able to completely replace a StatefulSet (node group changed in the spec) with data migration
  • Manage the replicas count of the StatefulSpec that may (on purpose) not be equal to the node count in the ES spec
  • Double data migration logic for grow-and-shrink
  • Manage pod deletion and/or rollingUpdate index
  • Manage PVC deletion: the StatefulSet controller does not delete PVCs. We might decide to leave that up to the user? Probably not.
  • Change the way we mount TLS secrets and ES config to a per-StatefulSet basis

Things we have implemented that we can get rid of

If we go on with the StatefulSets implementation, here is a list of extra things we have in our current codebase, that we don't need anymore:

  • Pods expectations, since we now use stable naming ordinals
  • Pod and PVC comparison logic, we can update StatefulSets directly
  • PVC reuse logic, handled by the StatefulSet controller through stable naming
  • Pod and PVC creation, now managed by the StatefulSet controller
  • Some of mutation.CalculateChanges complexity that would focus more on StatefulSets replicas count instead of individual pods
  • Manipulating pods in many parts of the Elasticsearch reconciliation: we now manipulate a set of StatefulSets

Overall B's and C's

Benefits:

  • Simplifies our codebase (complexity added in some places, complexity removed in others... but I'd say the code becomes generally simpler)
  • PVC reuse for free
  • Feels like "the standard way" to manage stateful workloads on k8s
  • Open for future StatefulSet improvements in k8s

Concerns:

  • Some mutations are tricky (StatefulSet replacement, grow-and-shrink)
  • Extra data migration required ((StatefulSet replacement, grow-and-shrink)) where we could optimize differently
  • Subject to StatefulSet implementation changes in Kubernetes
  • Major "internal" breaking change

To decide

The outcome of this issue should be to decide whether it is worth it switching our internal implementation to StatefulSets.
If not, let's stop discussing it, and highlight the reasons why not by updating this document.

  1. Is it possible? (likely: yes)
  2. Does it bring us some value? (likely: no additional feature, but removes some complexity from our current code)
  3. Is it worth it?
  4. When? before or after relase 0.9? (likely: before since it's big enough to deserve a place in a beta release?)
  5. Should we get rid of grow-and-shrink?
  6. OnDelete vs. RollingUpdate.Partition?

@sebgl sebgl added the discuss label Jul 1, 2019

@anyasabo

This comment has been minimized.

Copy link
Collaborator

commented Jul 1, 2019

It is not possible to update the VolumeClaimTemplates of an existing StatefulSet. We must create a new one.

Small point of clarification: there is a work in progress proposal to support PVC resizes in stateful sets that's going through active (though slow) progress:
kubernetes/enhancements#660

We can also still resize the PVC directly, though any new pods spun up by the PVC would still be at the old spec. So it is pretty hacky, though "possible"

@anyasabo

This comment has been minimized.

Copy link
Collaborator

commented Jul 1, 2019

With StatefulSets, it's impossible to specify a secret name for a single pod. Secrets apply to all pods in the set.
Regarding TLS certificates, it means we need to either:

Generate a single TLS secret for the entire StatefulSet, containing TLS certificates and private keys for all pods in that StatefulSet. This secret can be updated realtime as there are more pods to be created in the StatefulSet. This is what our Helm Charts seem to do. It means there is an implicit trust between nodes of the same set.
Generate the TLS certificate from within the pod at startup. Which is a feature we had implemented, then removed because of its additional complexity.

As a slight modification to option two: one (hacky) way to have different behavior between pods in a stateful set is to use the downward api. We can pass the pod name in as an env var to the pod, and then we could have an init container use the k8s api to load the per-pod secret. Kafka and some operators I've worked on have used that technique to have kind of a per-pod identity.

I still think "one secret per sset" is probably Good Enough to start, but just wanted to say there is a way to use k8s secrets and have distinct certs per pod

@anyasabo

This comment has been minimized.

Copy link
Collaborator

commented Jul 1, 2019

Question mark on migrating small clusters (eg. 1 node to 1 node or 1 master+data to 1 master + 1 data): do we accept downtime (with PV reuse), or do we require some grow-and-shrink upgrade with data migration in order to keep the cluster alive?

My vote is to accept downtime. If you're only running 1 node it seems like you've already accepted there will be downtime and it is (hopefully) only for non-production usage.

My first thought was to set minimum number of nodes >= 3 and never worry about this again, but I can see the value of single node clusters for test environments.

@anyasabo

This comment has been minimized.

Copy link
Collaborator

commented Jul 1, 2019

Extra things we'll need to deal with
...
Manage PVC deletion: the StatefulSet controller does not delete PVCs. We might decide to leave that up to the user? Probably not.

IMO we should just leave this up to the storage class they choose and our operator should stay out of it (assuming our operator is not manually spinning up PVs and instead relying on dynamically or pre-provisioned volumes). If they want the PV deleted when the instances is deleted then they can set the reclaim policy to Delete.

@anyasabo

This comment has been minimized.

Copy link
Collaborator

commented Jul 1, 2019

I have some questions to flesh out some of the final concerns:

Some mutations are tricky (StatefulSet replacement, grow-and-shrink)

How often do we really think we will be needing to replace ssets? My guess is that by far the most common will be increasing data per node, but I wonder if I'm missing something.

So following on that: what is our expected path to scale up data size? If it is just adding more nodes, then I think this is relatively easy. If we need to let people change the size of the data per-node, then we would need to work out a way to migrate to a new sset. That might be something we can do during or after beta though.

@pebrc

This comment has been minimized.

Copy link
Collaborator

commented Jul 1, 2019

This also means doubling down on using persistent volumes instead of emptyDirs, which we consider not reliable.

I think there are legitimate search heavy use cases that rely on emptyDir w/ tmpfs that we would exclude when we require everyone to use persistent volumes.

Applying grow-and-shrink upgrades get a bit more complicated:

I think that's an understatement. I would argue that it becomes no longer viable for non-toy dataset sizes.

If we retain the ability (via process manager or other means) to have in-place configuration changes without recreating the pod, then the absence of a 'grow and shrink' style update strategy might not be so dramatic.

Finally we could still retain the shrink bit of 'grow-and-shrink': safe draining/vacation of nodes would enable use cases that don't want to have persistent volumes (and you need the 'shrink' bit anyway for stateful set replacement I guess)

Overall B's and C's

It feels like the main benefit would be better alignment with build-in Kubernetes controllers and the potential to benefit from any improvements on stateful sets in the future.

but removes some complexity from our current code

Here the benefits are not so clear. Probably a bit of simplification. But IIUC not a sea change: we would need to retain some change calculation code to express for example elasticsearch.yml content in a hash that ends up as a label somewhere in the stateful set template. At least that is what I assume is what we need to do to drive changes when the user modifies the Elasticsearch CRDs in a way that is not directly visible in the pod template. And any simplification is offset by the new complexity we add to orchestrate multiple stateful sets and changes like stateful set replacements (thinking of gradual vacation/scaling down of an existing sset and scaling up of a replacement sset while maintaining change bugdet limitations)

it's big enough to deserve a place in a beta release?

I think it is less about size of the change (if it is completely transparent to users does it matter how big the change is?) but more about backwards compatibility/stability/maturity of what we consider a beta release. There would be no upgrade path from the individual pod management to a stateful set based implementation, and that is probably a good reason to have that in a beta. (I mean it's technically feasible but then we would need to support both orchestration approaches concurrently for at least one release etc.)

@sebgl

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 2, 2019

My guess is that by far the most common will be increasing data per node

I think you're right. Updating a sset VolumeClaimTemplate section will probably be possible at some point, but until then, we can't explicitly do it.

I guess we have 2 options there:

  1. Explicitly map a group of nodes in ES CRD to a StatefulSet (suggested in the first post): users simply can't update the volumes size of an existing group, we reject it. In order to upgrade their volume size, they will need to create a new sset. In practice: just rename the group of nodes, and we manage creating a new sset and shutting down the old one (with data migration). For us internally, this is not about increasing volume sizes: just about removing a stateful set and adding a new one because the ES spec has changed (same as a simple node group name change). Drawback: not being able to update volumes size can be weird from a user point of view (especially if they're not familiar with sset limitations), this needs to be clearly documented.
  2. Don't map nodes group names to sset names: we manage sset internally, decoupled from the ES spec name. We could base the sset name on nodes group name + VolumeClaimtemplate hash. We still have to go through new sset creation + old sset shutdown, but from the user point of view it's possible to upgrade the volume size of an existing group of nodes.

The second approach looks like an optimisation we could do later on, that would make our life a bit harder but the user's life a bit simpler.

We currently don't have this problem while managing pods directly: a user can update the VolumeClaimTemplates section, we'll manage deleting old pods and creating new ones with the new volume size. In the end, that's pretty similar to what we'd do with StatefulSets (spin up pods with new volumes, migrate data, delete pods with old volumes).
Current issues in our PVC reuse approach seem to suggest we may want to explicitly name our group of nodes to restrict PVC reuse to a single named group. Which gets us closer to StatefulSets constraints, but we still don't have to restrict VolumeClaimTemplates updates in an existing group.

@sebgl

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 2, 2019

Manage PVC deletion: the StatefulSet controller does not delete PVCs. We might decide to leave that up to the user? Probably not.

IMO we should just leave this up to the storage class they choose and our operator should stay out of it (assuming our operator is not manually spinning up PVs and instead relying on dynamically or pre-provisioned volumes). If they want the PV deleted when the instances is deleted then they can set the reclaim policy to Delete.

Not sure I understand correctly, I might be misunderstanding the way StatefulSets & PVC deletion work. From what I read in the docs, when a sset is deleted, the PVCs are not deleted. That's up to the user.

Storage classes can have retain policies to specify how the PV deletion happens if the PVC is deleted. But this does not control PVC deletion, only PV deletion. What I'm concerned with here is the PVC deletion if a sset is scaled down or removed. Say you scale down from 5 nodes to 3 nodes: I don't see why we'd keep PVCs around for nodes 4 and 5. If we decide to delete PVC, the underlying PV deletion is not our concern though, as you said.

@nkvoll

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

I still think "one secret per sset" is probably Good Enough to start, but just wanted to say there is a way to use k8s secrets and have distinct certs per pod

@jonsabo This would require one or more containers in the Pod to access the API, which is undesirable if it can be avoided. From an RBAC perspective, I understand it as the credentials would technically allow reading other nodes' certs as well? Something for a later discussion for sure though.

Question mark on migrating small clusters (eg. 1 node to 1 node or 1 master+data to 1 master + 1 data): do we accept downtime (with PV reuse), or do we require some grow-and-shrink upgrade with data migration in order to keep the cluster alive?

My vote is to accept downtime. If you're only running 1 node it seems like you've already accepted there will be downtime and it is (hopefully) only for non-production usage.

@jonsabo It is a regression from the status quo though. My vote is for accepting downtime as well fwiw.

It's important to note that it's not just about 1 node clusters. It's also about clusters that do not have replicas for all shards, e.g non-HA clusters etc.

My first thought was to set minimum number of nodes >= 3 and never worry about this again, but I can see the value of single node clusters for test environments.

@jonsabo Single node clusters for production can also work depending on the nines you're promising, and the specifics of the use case. Not saying it's necessarily advisable, but it's not categorically not doable. (splitting hair here probably)

So following on that: what is our expected path to scale up data size? If it is just adding more nodes, then I think this is relatively easy. If we need to let people change the size of the data per-node, then we would need to work out a way to migrate to a new sset. That might be something we can do during or after beta though.

@jonsabo I think lots of people will just change the size of the volume claim template. Migrating between statefulsets will be pretty much core to a lot of the configuration changes (e.g we want to support topology changes etc), and also ties into how we could do grow/shrink as well.

This also means doubling down on using persistent volumes instead of emptyDirs, which we consider not reliable.
I think there are legitimate search heavy use cases that rely on emptyDir w/ tmpfs that we would exclude when we require everyone to use.

@pebrc We can detect the usage of emptyDirs and disable rolling changes for those (e.g still support grow/shrink rather easily through option2 in the suggestion below)

If we retain the ability (via process manager or other means) to have in-place configuration changes without recreating the pod, then the absence of a 'grow and shrink' style update strategy might not be so dramatic.

@pebrc as it would only work for a subset of changes I'm becoming more and more luke warm to that approach, and would prefer if we first solve for general applicability and optimize afterwards of the payoff is clearly worth the effort and complexity.

It feels like the main benefit would be better alignment with build-in Kubernetes controllers and the potential to benefit from any improvements on stateful sets in the future.

@pebrc This is a huge and imporant point to my mind as well. That and I would be /very/ happy to see stable node names and IPs as well.

Don't map nodes group names to sset names: we manage sset internally, decoupled from the ES spec name.

@sebgl Don't map nodes group names to sset names: we manage sset internally, decoupled from the ES spec name. We could base the sset name on nodes group name + VolumeClaimtemplate hash. We still have to go through new sset creation + old sset shutdown, but from the user point of view it's possible to upgrade the volume size of an existing group of nodes.

@sebgl To my mind this would be polyfilling for future K8s features and breaks stable naming/identity, so if K8s would disallow something, let's disallow it as well (but provide a path forward).

Storage classes can have retain policies to specify how the PV deletion happens if the PVC is deleted. But this does not control PVC deletion, only PV deletion. What I'm concerned with here is the PVC deletion if a sset is scaled down or removed. Say you scale down from 5 nodes to 3 nodes: I don't see why we'd keep PVCs around for nodes 4 and 5. If we decide to delete PVC, the underlying PV deletion is not our concern though, as you said.

@sebgl Wouldn't the same arguments for statefulsets not deleting PVCs apply to our operator not deleting PVCs? What are they / what are different for us?


Suggestions: grow and shrink support:

  • User has a nodeSpec with the name foo, 3 nodes, and we've created the statefulset foo.
  • If maxUnavailable > 0, can do rolling changes, no grow and shrink required.
  • If maxUnavailable = 0 (or maybe if we can reliably detect that a node in the statefulset has the single copy of a shard?), pick an option:

Option 1

  • Create a new statefulset foo-a
  • Migrate from statefulset foo to foo-a.

I don't particularly like the idea of suffixing the replacement statefulset etc, seems a little messy.

Option 2

(my current preference)

  • Indicate the change cannot be performed, and that a new statefulset is required. This should prompt the user to change the name of the nodespec to lets say bar.
  • Migrate from statefulset foo to bar.

This makes our lives easier by not having to suffix statefulsets oddly and allows for nice and stable node names for the ES nodes.

This seems plausible to validate/verify/provide feedback on in a webhook as well?

Option 3

  • Make grow/shrink even more explicit than option 2. The user needs to add the new nodespec, update allocation excludes and eventudally delete the old nodespec.
@pebrc

This comment has been minimized.

Copy link
Collaborator

commented Jul 2, 2019

Suggestions: grow and shrink support:

Option 1

Just so that I understand you are intending to transparently map from node set/spec to a suffixed stateful set and the node spec name would be stable?

node set generation sset maxUnavailable
data-hot 1 data-hot-1 0
data-hot 2 data-hot-1 1
data-hot 3 data-hot-3 0

Where generation is the change of the Elasticsearch spec we are trying to roll out (the actual number could come out of the some variant of our current change calculation)

Option 2

I think from technical POV it's all good, but the UX is just terrible: What name am I supposed to choose here, because I doubt that a user will call their node sets foo, it will more likely be master-data or data etc. typically reflecting the topology of the set of nodes.

So then what is the new name I am supposed to choose here? data-hot-2 through data-hot-47 as I make changes to my cluster. Where every change I am attempting will be some form of stop-an-go process:

  1. kubectl apply
  2. validation: Sorry Dave, I can't do that. You have to rename you sset first.
  3. kubectl apply this time with new stateful set name
  4. OK

Option 3

What's the point of having an operator if I have to vacate my nodes manually? Sounds like a world of pain.

@nkvoll

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

@pebrc

Just so that I understand you are intending to transparently map from node set/spec to a suffixed stateful set and the node spec name would be stable?

Yes.

I think from technical POV it's all good, but the UX is just terrible: What name am I supposed to choose here, because I doubt that a user will call their node sets foo, it will more likely be master-data or data etc. typically reflecting the topology of the set of nodes.

So then what is the new name I am supposed to choose here? data-hot-2 through data-hot-47 as I make changes to my cluster. Where every change I am attempting will be some form of stop-an-go process

I'm not certain the UX is terrible. It's not perfect, but we're not trying to paper over shortcomings of the underlying features we're using (statefulsets), and the limitation of "no downtime" (in the case of non-HA or user config)

Only changes that are not possible to do without creating a new statefulset would require a new name. This part would be the same if you were doing statefulsets directly, except we can automate migrating from one statefulset to another, which is of great value.

What's the point of having an operator if I have to vacate my nodes manually? Sounds like a world of pain

:shrug:. It just seems like a worse version of option 2.

@sebgl

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 2, 2019

I'm trying to draft what StatefulSets reconciliation would look like in pseudo-code, taking changeBudget into account, covering both rolling upgrades & grow-and-shrink, with relying on sset rollingUpdate.Partition as a way to trigger pod replacement.

This is considering we map names in the group spec to sset name (no internal sset names to deal with volume size updates) - Option 2 from Njal (which was my option 1 😄).

## Phase 1: apply sset definition to create missing nodes
for each node group spec
    if no corresponding sset: skip
    if expected nodeCount >= existing sset replicas:
        apply the corresponding sset resource spec (this does not replace/delete any pod, but may create missing ones)
    if expected nodeCount < existing sset replicas:
        apply the corresponding sset resource spec, with replicas = old replica (we don't want to remove nodes yet)

## Phase 2: handle sset replacement
currentSurge = (current sset replicas sum - nodes spec count sum)
currentUnavailable = (nodes spec count sum - number of pods running ready)
## add nodes 1 by 1 for each new sset to give them all a chance to be partially populated
for each *new* node group spec, until currentSurge >= maxSurge:
    create sset & increment sset replicas
    currentSurge++; currentUnavailable--
## remove nodes 1 by 1 for each deprecated sset to give them all a chance to migrate data away
for **deprecated* sset, until currentUnavailable >= maxUnavailable:
    prepare cluster for node removal (allocation excludes)
    if data migration over: sset replicas count-- to remove the node
    currentUnavailable++; currentSurge--

## Phase 3: handle sset upgrades
for each node group spec (ordered)
    get sset status
    if all pods are running the last revision && expected replicas count: reset rollingUpdate.Partition ordinal.

    # rolling upgrade with pvc reuse
    if maxUnavailable > 0:
        for each pod with the wrong revision, starting from highest ordinal
            if currentUnavailable >= maxUnavailable: skip
            currentUnavailable++; currentSurge--
            prepare cluster for node restart (disable shard allocations, set zen1 zen2)
            shift sset partition index to replace the node

    # remove extra nodes (covers "shrink" from grow and shrink)
    if all pods are running the last revision && replica count > expected:
        make sure no allocation excludes are set for these pods
        for each pod with ordinal > expected, starting from highest ordinal:
            if currentUnavailable >= maxUnavailable: skip
            currentUnavailable++; currentSurge--
            prepare cluster for node removal (allocation excludes)
            if data migration over: sset replicas count-- to remove the node
    
# Phase 4: grow (from grow and shrink)
# grow each sset one by one to make sure each one gets a chance to migrate data away
for each sset if maxUnavailable == 0 && not all pods running last revision:
    get pod with highest ordinal
    # migrate data away from nodes to replace
    if currentUnavailable <= maxUnavailable:
        currentUnavailable++; currentSurge--
        prepare cluster for node removal (allocation excludes)
        if data migration over: sset partition ordinal-- to replace the node
    # add more nodes to hold data migration, will be deleted later
    if currentSurge < maxSurge:
        sset replicas++
        currentUnavailable--; currentSurge++

I'm pretty sure this is not 100% correct and there are some gaps here. Among others it's missing a few pieces about resetting zen1, zen2, allocation excludes, shards allocations enabled/disabled settings, resetting rollingUpdate.Partition indices.

A few things I realised from this exercise:

  • I feel like it's a harder to 1. calculate changes, then 2. apply changes, in 2 different steps. We don't have pods to create and pods to delete anymore, but sset specs to apply (or not), replicas count to increment/decrement (or not), rolling update partition ordinals to decrement. Is it OK to reconcile directly and skip our 2-steps (plan, apply) approach to changes? In other terms, get rid of our entire mutation pkg (4k lines of code) to replace it with this more direct reconciliation?
  • Using rollingUpdate.Partition brings some complexity, mostly around the question of when we should decrement it or reset it to be equal to the number of replicas. Also: what if we set it to 1 (out of 5 replicas) but suddenly the spec changes again? We need to make sure we reset it back to 5 before applying a new spec. I think starting with OnDelete would generally make things simpler. Also we could probably pre-compute the list of pods to delete before applying changes, helping with the above point.
  • First phase is interesting: we can blindly apply the sset spec in many cases to make sure new pods are created with the correct spec, without bothering with computing pods to create as we did before. This acts similarly as our current first phase of "disaster recovery" where we make sure to recover missing pods, except it's handled naturally by StatefulSets.
  • StatefulSet replacement (one sset name disappears, another one appears in the spec) can be tricky if there are several replacements to perform at once. For example, if we want to replace a "data nodes" sset and a "master nodes" sset, we probably need to make progress for both replacements (and not focus on a single replacement) to make sure we don't use our budget to create data nodes but remove master nodes.
  • Relying on sset revisions from the sset status directly is quite handy. One thing missing from this exercise is handling ES config changes (which do not appear in the sset): we probably want to include the ES config hash in the sset podTemplate labels.
  • It's interesting to see how the last phase of grow-and-shrink (removing the extra replicas we added for data migration) is just reusing the normal way we scale a StatefulSet down (see last part of phase 3): we don't do anything special here, and already need that piece working for normal cluster downscale.
  • As expected, things are probably simpler if we remove grow-and-shrink to keep only pvc reuse & shrink-and-grow: we can't basically get rid of the last block. But we still need the maxSurge concept to deal with sset replacement (phase 2)?

@sebgl sebgl self-assigned this Jul 3, 2019

@nkvoll

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

Some other notes for future discussion:

## create missing statefulsets
(maxSurge needs to be respected)

## update statefulset specs
update Partition = #replicas

## delete statefulsets with replicas == 0
kind of odd to have here but hey

## rotating statefulset replicas
update Partition.

## scaling statefulset up
update replicas.

## scaling statefulset down
ensure no data is lost when scaling down (migration etc)
ensure master node invariants are not broken.

## grow shrink
for each sset if maxUnavailable == 0 && not all pods running last revision && has maxSurge headroom:
    # we need to grow/shrink this one

    if #replicas == nodeCount:
        increase replicas to hit maxSurge.
@sebgl

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 3, 2019

One future feature we might want to have at some point, that will become harder with StatefulSets, is "temporarily pausing/stopping/removing a node from the cluster".

Use case: say you suspect a node is misbehaving for whatever reason (slow IO), which is slowing down the entire cluster. You want to temporarily remove that node from the cluster to see how it goes.

In the current implementation, users can update an annotation on the Elasticsearch resource, so the operator stops managing this cluster. Then manually delete the pod (remove the node) to see what happens. Once done, update that annotation again: the operator will recreate the pod.

With StatefulSets, "pausing" the operator does not prevent the StatefulSet controller to recreate any pod the user might delete. AFAIK there is nothing we can do about it. Unless we go towards a direction where the pod is still running but the Elasticsearch process itself is paused in the pod.

@pebrc

This comment has been minimized.

Copy link
Collaborator

commented Jul 3, 2019

where the pod is still running but the Elasticsearch process itself is paused in the pod

That would mean we need to keep the process manager around

@anyasabo

This comment has been minimized.

Copy link
Collaborator

commented Jul 3, 2019

So then what is the new name I am supposed to choose here? data-hot-2 through data-hot-47 as I make changes to my cluster. Where every change I am attempting will be some form of stop-an-go process...

If I'm understanding you correctly, you're worried about how do we decide the sset name when we need to create multiple ssets? If so I think we can take the user provide sset name and append a hash of the values that require a new sset.

@sebgl

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 3, 2019

After discussing this in Zoom with the entire team: yes, we want to use StatefulSets.
Probably don't include it in next release (0.9), but work on it in parallel.

@anyasabo

This comment has been minimized.

Copy link
Collaborator

commented Jul 3, 2019

Not sure I understand correctly, I might be misunderstanding the way StatefulSets & PVC deletion work. From what I read in the docs, when a sset is deleted, the PVCs are not deleted. That's up to the user.
you are correct that's the default behavior, I was making the assumption we would set the PVC owner ref to the ES cluster. that handles when people delete the cluster entirely, but we would need to handle scale down / sset swap events ourselves. There are issues open about adding the sset as an owner of the PVC but no actual implementation yet it looks like

@pebrc

This comment has been minimized.

Copy link
Collaborator

commented Jul 3, 2019

If I'm understanding you correctly, you're worried about how do we decide the sset name when we need to create multiple ssets? If so I think we can take the user provide sset name and append a hash of the values that require a new sset.

No this was about the interaction model where the human user would have to choose new sset names to enact configuration changes that cannot be accomplished by modifying an existing set. The mode of operation outlined by @nkvoll and @sebgl means that the user would have to choose the name themselves which I found problematic from a UX point of view because the logical identity of the set of nodes ('data-hot' for example) does not change.

What you are referring to is outlined above as Option 1 in Njals plan.

@sebgl

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 4, 2019

Summary of a discussion with @nkvoll and others:

  • Let's create a branch from master dedicated to working on StatefulSet. Goal is to have that feature branch aside for a few weeks (but not too long, eg. more than 1-2 months), and do PRs against it for everything that concerns the StatefulSets refactoring.
  • Let's start working on this now.
  • We can divide the work in a few steps:
    • handle single TLS secret + cfg per cluster
      • initcontainer to symlink files
      • single secret containing one entry per node
    • basic nodespec to sset changes in the code
    • "basic" rolling upgrades with no budget & no zen1/zen2 concerns for a single sset
    • sset replacement (delete a sset needs to be safe)
    • introduce change budget surge and unavailability: it's OK for the operator to be "stuck" in certain conditions
    • zen1
    • zen2
    • e2e tests green

@sebgl sebgl changed the title Should we rely on StatefulSets? Should we rely on StatefulSets? (Yes.) Jul 8, 2019

@thbkrkr thbkrkr closed this Jul 11, 2019

@thbkrkr thbkrkr reopened this Jul 11, 2019

@sebgl

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 19, 2019

Closing this issue (initial discussion, decision, design) in favour of keeping #1299 open (meta-issue of remaining tasks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.