Support smoother k8s nodes rotation when using local volumes #2806

sebgl · 2020-04-02T12:52:19Z

When using local volumes, it can be quite complicated to handle Kubernetes nodes upgrades.
One common way to upgrade a k8s node is to take it out of the cluster, and replace it with a fresh new one. In which case the local volume is lost, and the corresponding Elasticsearch Pod stays Pending forever.

When that happens, the only way out is to manually remove both Pod and PVC, so a new Pod gets created with a new volume.

In an ideal world to simplify this, we would like to:

migrate data away from the ES node that will be removed (the k8s node is probably being drained at k8s level already)
once that node is removed, and the corresponding Pod becomes Pending, ECK would delete both Pod and PVC so they are recreated elsewhere
this is a mode of operation the user would probably have to indicate somewhere (in the Elasticsearch spec?). Doing it automatically feels complicated (how long should we wait? will the node come back?) and dangerous.

Related discuss issue: https://discuss.elastic.co/t/does-eck-support-local-persistent-disks-and-is-it-a-good-idea/223515/3

weeco · 2020-04-04T15:00:38Z

I am not sure whether this is helpful at all because it's at such a high level, but I think the ECK operator could watch the ES data nodes and it's corresponding kubernetes nodes.

In the moment where let's say data-node-0 is no longer scheduled to k8s-node-abc but another node (for whatever reason) you can assume that this Elastic node has lost it's data. If this is the case the ECK operator can delete / recreate the PVC so that the pod is no longer pending anymore.

Does that make sense or am I missing something?

FingerLiu · 2020-05-06T09:24:59Z

this simple script seems work, but not proven in prod.

kubectl cordon k8s-node-abc

kubectl delete pvc -es-xxx --force --grace-period=0

kubectl drain k8s-node-abc --delete-local-data --ignore-daemonsets

kubectl uncordon k8s-node-abc

thbkrkr · 2021-03-18T14:19:08Z

Relates to #2448.

Jacse · 2022-03-11T21:43:50Z

We've run into this exact issue two times now. When we try to upgrade the k8s version in our nodepool, we lose all our data and the cluster goes into a completely broken state.

I don't know how it works with other providers, but I can speak for GKE. We have a cluster with 3 nodes and an index with 2 shards and 1 replica per shard

What I believe happens is the following:

GKE initiates a node pool version upgrade
A node is drained and its pod is deleted along with local data
A new node spins up with a new pod
The new pod starts receiving data from replica shards stored on the other nodes
GKE respects the pod disruption budget of max 1 unavailable pod, but its patience dies out at 1 hour and after 1 hour it continues the upgrade with a new node ("Note: During automatic or manual node upgrades, PDBs are respected for a maximum of 1 hour. If Pods running on a node cannot be scheduled onto new nodes within 1 hour, the upgrade is initiated, regardless.", from here)
A new node is drained (now 2/3 unhealthy) and everything is a mess

The logs from GKE show that almost exactly one hour passes between each node teardown.

othmane399 · 2022-09-05T21:47:08Z

Hello,

Here's our approach to upgrade k8s on local storage node groups:

create a k8s upgraded node group with a dedicated label (let's say group: beta)
change the name and the nodeSelector label of the nodeSet to upgrade, and patch the updateStrategy as following

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch

metadata:
  name: cluster

spec:
  version: 8.3.3

  # Add before removing, ensuring no data is ever lost
  updateStrategy:
    changeBudget:
      maxSurge: 1
      maxUnavailable: 0

  nodeSets:
  
  # Change name to beta (required)
  - name: alpha
    count: 2

    podTemplate:
      spec:
        # Plug on 1 group
        # Change to beta
        nodeSelector:
          group: alpha

delete the old node group once all shards has been migrated, and revert the updateStrategy

sebgl added the >feature Adds or discusses adding a feature to the product label Apr 2, 2020

sebgl mentioned this issue May 15, 2020

Rolling upgrades with local volumes can lead to unschedulable Pods #3088

Open

sebgl mentioned this issue Jun 2, 2021

Grow-and-shrink -like updates to replace an ES node are hard to achieve with StatefulSets and local volumes #4549

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support smoother k8s nodes rotation when using local volumes #2806

Support smoother k8s nodes rotation when using local volumes #2806

sebgl commented Apr 2, 2020 •

edited

Loading

weeco commented Apr 4, 2020

FingerLiu commented May 6, 2020

thbkrkr commented Mar 18, 2021

Jacse commented Mar 11, 2022 •

edited

Loading

othmane399 commented Sep 5, 2022 •

edited

Loading

Support smoother k8s nodes rotation when using local volumes #2806

Support smoother k8s nodes rotation when using local volumes #2806

Comments

sebgl commented Apr 2, 2020 • edited Loading

weeco commented Apr 4, 2020

FingerLiu commented May 6, 2020

thbkrkr commented Mar 18, 2021

Jacse commented Mar 11, 2022 • edited Loading

othmane399 commented Sep 5, 2022 • edited Loading

sebgl commented Apr 2, 2020 •

edited

Loading

Jacse commented Mar 11, 2022 •

edited

Loading

othmane399 commented Sep 5, 2022 •

edited

Loading