Shutdown records left standing after node has been restarted reduce number of replicas for `auto_expand_replicas` #5628

pebrc · 2022-04-28T13:21:32Z

Nodes with node shutdown records reduce the number of available nodes that the auto_expand_replicas mechanism in Elasticsearch takes into account when adjusting the replicas which happens on each re-route task e.g. after a node has left the cluster.

If we have a three node cluster with two shutdown records then the number of replicas calculated for auto_expand_replicas is 0 if the remaining replica comes to lie on a node scheduled for shutdown then the cluster goes RED

Console grab from a manual reproduction:

~/s/e/shutdown (master)> k8surl PUT _ _nodes/aCDmIS53SV6kIW4kLUdkrA/shutdown request.json
{"acknowledged":true}⏎                                                                                                                                                                                              ~/s/e/shutdown (master)> k8surl PUT _ _nodes/Kt6rd7L4Qpe7s8Z8n-VT2A/shutdown request.json
{"acknowledged":true}⏎                                                                                                                                                                                              ~/s/e/shutdown (master)> k8surl GET _ _nodes/shutdown | jq
{
  "nodes": [
    {
      "node_id": "aCDmIS53SV6kIW4kLUdkrA",
      "type": "RESTART",
      "reason": "manual test",
      "allocation_delay": "5m",
      "shutdown_startedmillis": 1651149105690,
      "status": "COMPLETE",
      "shard_migration": {
        "status": "COMPLETE",
        "shard_migrations_remaining": 0,
        "explanation": "no shard relocation is necessary for a node restart"
      },
      "persistent_tasks": {
        "status": "COMPLETE"
      },
      "plugins": {
        "status": "COMPLETE"
      }
    },
    {
      "node_id": "Kt6rd7L4Qpe7s8Z8n-VT2A",
      "type": "RESTART",
      "reason": "manual test",
      "allocation_delay": "5m",
      "shutdown_startedmillis": 1651149113387,
      "status": "COMPLETE",
      "shard_migration": {
        "status": "COMPLETE",
        "shard_migrations_remaining": 0,
        "explanation": "no shard relocation is necessary for a node restart"
      },
      "persistent_tasks": {
        "status": "COMPLETE"
      },
      "plugins": {
        "status": "COMPLETE"
      }
    }
  ]
}
~/s/e/shutdown (master)> k8surl GET _ _cat/shards
data-integrity-check 2 p STARTED 1 3.9kb 10.73.50.59 test-mutation-http-to-https-64nh-es-masterdata-2
data-integrity-check 2 r STARTED 1 3.9kb 10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1
data-integrity-check 1 p STARTED 4 3.9kb 10.73.50.59 test-mutation-http-to-https-64nh-es-masterdata-2
data-integrity-check 1 r STARTED 4 3.9kb 10.73.48.36 test-mutation-http-to-https-64nh-es-masterdata-0
data-integrity-check 0 r STARTED 0  225b 10.73.48.36 test-mutation-http-to-https-64nh-es-masterdata-0
data-integrity-check 0 p STARTED 0  225b 10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1
.geoip_databases     0 r STARTED         10.73.48.36 test-mutation-http-to-https-64nh-es-masterdata-0
.geoip_databases     0 p STARTED         10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1
~/s/e/shutdown (master)> k get pods
NAME                                               READY   STATUS    RESTARTS   AGE
test-mutation-http-to-https-64nh-es-masterdata-0   1/1     Running   0          65s
test-mutation-http-to-https-64nh-es-masterdata-1   1/1     Running   0          22m
test-mutation-http-to-https-64nh-es-masterdata-2   1/1     Running   0          24m
~/s/e/shutdown (master)> k delete pod test-mutation-http-to-https-64nh-es-masterdata-0
pod "test-mutation-http-to-https-64nh-es-masterdata-0" deleted
^[[A^[[B⏎                                                                                                                                                                                                           ~/s/e/shutdown (master)> k8surl GET _ _cat/shards
data-integrity-check 2 p STARTED 1 3.9kb 10.73.50.59 test-mutation-http-to-https-64nh-es-masterdata-2
data-integrity-check 2 r STARTED 1 3.9kb 10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1
data-integrity-check 1 p STARTED 4 3.9kb 10.73.50.59 test-mutation-http-to-https-64nh-es-masterdata-2
data-integrity-check 1 r STARTED 4 3.9kb 10.73.48.37 test-mutation-http-to-https-64nh-es-masterdata-0
data-integrity-check 0 r STARTED 0  225b 10.73.48.37 test-mutation-http-to-https-64nh-es-masterdata-0
data-integrity-check 0 p STARTED 0  225b 10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1
.geoip_databases     0 p STARTED         10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1

Our current implementation unfortunately sometimes have transient states where two records exist because we clean up old records after creating new ones and even sometimes short circuit the reconciliation which prolongs the transient state even more.

The text was updated successfully, but these errors were encountered:

barkbay · 2022-04-28T13:45:11Z

IIUC we should not have 2 shutdown records at the same time (assuming we only restart one Pod at a time, which is the case here I guess).
I wonder if we should not try to clean existing shutdown records at the beginning of the upgrade logic, right after the expectations are checked:

cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade.go

Lines 36 to 47 in 6badd6d

    
           // We need to check that all the expectations are satisfied before continuing. 
        
           // This is to be sure that none of the previous steps has changed the state and 
        
           // that we are not running with a stale cache. 
        
           ok, reason, err := d.expectationsSatisfied() 
        
           if err != nil { 
        
           	return results.WithError(err) 
        
           } 
        
           if !ok { 
        
           	reason := fmt.Sprintf("Nodes upgrade: %s", reason) 
        
           	return results.WithReconciliationState(defaultRequeue.WithReason(reason)) 
        
           }

colings86 · 2022-04-28T15:20:23Z

I think this should be solved by Elasticsearch rather than in the operator. The interaction between node shutdown and auto expand replicas doesn't seem right here. I'd chat to the es-distributed team about this if you haven't already.

pebrc · 2022-04-28T15:22:05Z

Good point. @colings86 I will reach out once more. We still want this fix on the operator side for the time being though.

pebrc · 2022-04-29T13:21:27Z

The underlying issue with auto_expand_replicas is addressed in Elasticsearch 7.17.2, 8.1.2, and 8.2.0 through elastic/elasticsearch#85306

pebrc added the >bug Something isn't working label Apr 28, 2022

pebrc mentioned this issue Apr 28, 2022

Delete old shutdowns before creating new ones #5629

Merged

pebrc closed this as completed in #5629 Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown records left standing after node has been restarted reduce number of replicas for `auto_expand_replicas` #5628

Shutdown records left standing after node has been restarted reduce number of replicas for `auto_expand_replicas` #5628

pebrc commented Apr 28, 2022 •

edited

Loading

barkbay commented Apr 28, 2022

colings86 commented Apr 28, 2022

pebrc commented Apr 28, 2022

pebrc commented Apr 29, 2022

Shutdown records left standing after node has been restarted reduce number of replicas for auto_expand_replicas #5628

Shutdown records left standing after node has been restarted reduce number of replicas for auto_expand_replicas #5628

Comments

pebrc commented Apr 28, 2022 • edited Loading

barkbay commented Apr 28, 2022

colings86 commented Apr 28, 2022

pebrc commented Apr 28, 2022

pebrc commented Apr 29, 2022

Shutdown records left standing after node has been restarted reduce number of replicas for `auto_expand_replicas` #5628

Shutdown records left standing after node has been restarted reduce number of replicas for `auto_expand_replicas` #5628

pebrc commented Apr 28, 2022 •

edited

Loading