Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown records left standing after node has been restarted reduce number of replicas for auto_expand_replicas #5628

Closed
pebrc opened this issue Apr 28, 2022 · 4 comments · Fixed by #5629
Labels
>bug Something isn't working

Comments

@pebrc
Copy link
Collaborator

pebrc commented Apr 28, 2022

Nodes with node shutdown records reduce the number of available nodes that the auto_expand_replicas mechanism in Elasticsearch takes into account when adjusting the replicas which happens on each re-route task e.g. after a node has left the cluster.

If we have a three node cluster with two shutdown records then the number of replicas calculated for auto_expand_replicas is 0 if the remaining replica comes to lie on a node scheduled for shutdown then the cluster goes RED

Console grab from a manual reproduction:

~/s/e/shutdown (master)> k8surl PUT _ _nodes/aCDmIS53SV6kIW4kLUdkrA/shutdown request.json
{"acknowledged":true}⏎                                                                                                                                                                                              ~/s/e/shutdown (master)> k8surl PUT _ _nodes/Kt6rd7L4Qpe7s8Z8n-VT2A/shutdown request.json
{"acknowledged":true}⏎                                                                                                                                                                                              ~/s/e/shutdown (master)> k8surl GET _ _nodes/shutdown | jq
{
  "nodes": [
    {
      "node_id": "aCDmIS53SV6kIW4kLUdkrA",
      "type": "RESTART",
      "reason": "manual test",
      "allocation_delay": "5m",
      "shutdown_startedmillis": 1651149105690,
      "status": "COMPLETE",
      "shard_migration": {
        "status": "COMPLETE",
        "shard_migrations_remaining": 0,
        "explanation": "no shard relocation is necessary for a node restart"
      },
      "persistent_tasks": {
        "status": "COMPLETE"
      },
      "plugins": {
        "status": "COMPLETE"
      }
    },
    {
      "node_id": "Kt6rd7L4Qpe7s8Z8n-VT2A",
      "type": "RESTART",
      "reason": "manual test",
      "allocation_delay": "5m",
      "shutdown_startedmillis": 1651149113387,
      "status": "COMPLETE",
      "shard_migration": {
        "status": "COMPLETE",
        "shard_migrations_remaining": 0,
        "explanation": "no shard relocation is necessary for a node restart"
      },
      "persistent_tasks": {
        "status": "COMPLETE"
      },
      "plugins": {
        "status": "COMPLETE"
      }
    }
  ]
}
~/s/e/shutdown (master)> k8surl GET _ _cat/shards
data-integrity-check 2 p STARTED 1 3.9kb 10.73.50.59 test-mutation-http-to-https-64nh-es-masterdata-2
data-integrity-check 2 r STARTED 1 3.9kb 10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1
data-integrity-check 1 p STARTED 4 3.9kb 10.73.50.59 test-mutation-http-to-https-64nh-es-masterdata-2
data-integrity-check 1 r STARTED 4 3.9kb 10.73.48.36 test-mutation-http-to-https-64nh-es-masterdata-0
data-integrity-check 0 r STARTED 0  225b 10.73.48.36 test-mutation-http-to-https-64nh-es-masterdata-0
data-integrity-check 0 p STARTED 0  225b 10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1
.geoip_databases     0 r STARTED         10.73.48.36 test-mutation-http-to-https-64nh-es-masterdata-0
.geoip_databases     0 p STARTED         10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1
~/s/e/shutdown (master)> k get pods
NAME                                               READY   STATUS    RESTARTS   AGE
test-mutation-http-to-https-64nh-es-masterdata-0   1/1     Running   0          65s
test-mutation-http-to-https-64nh-es-masterdata-1   1/1     Running   0          22m
test-mutation-http-to-https-64nh-es-masterdata-2   1/1     Running   0          24m
~/s/e/shutdown (master)> k delete pod test-mutation-http-to-https-64nh-es-masterdata-0
pod "test-mutation-http-to-https-64nh-es-masterdata-0" deleted
^[[A^[[B⏎                                                                                                                                                                                                           ~/s/e/shutdown (master)> k8surl GET _ _cat/shards
data-integrity-check 2 p STARTED 1 3.9kb 10.73.50.59 test-mutation-http-to-https-64nh-es-masterdata-2
data-integrity-check 2 r STARTED 1 3.9kb 10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1
data-integrity-check 1 p STARTED 4 3.9kb 10.73.50.59 test-mutation-http-to-https-64nh-es-masterdata-2
data-integrity-check 1 r STARTED 4 3.9kb 10.73.48.37 test-mutation-http-to-https-64nh-es-masterdata-0
data-integrity-check 0 r STARTED 0  225b 10.73.48.37 test-mutation-http-to-https-64nh-es-masterdata-0
data-integrity-check 0 p STARTED 0  225b 10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1
.geoip_databases     0 p STARTED         10.73.49.45 test-mutation-http-to-https-64nh-es-masterdata-1

Our current implementation unfortunately sometimes have transient states where two records exist because we clean up old records after creating new ones and even sometimes short circuit the reconciliation which prolongs the transient state even more.

@pebrc pebrc added the >bug Something isn't working label Apr 28, 2022
@barkbay
Copy link
Contributor

barkbay commented Apr 28, 2022

IIUC we should not have 2 shutdown records at the same time (assuming we only restart one Pod at a time, which is the case here I guess).
I wonder if we should not try to clean existing shutdown records at the beginning of the upgrade logic, right after the expectations are checked:

// We need to check that all the expectations are satisfied before continuing.
// This is to be sure that none of the previous steps has changed the state and
// that we are not running with a stale cache.
ok, reason, err := d.expectationsSatisfied()
if err != nil {
return results.WithError(err)
}
if !ok {
reason := fmt.Sprintf("Nodes upgrade: %s", reason)
return results.WithReconciliationState(defaultRequeue.WithReason(reason))
}

@colings86
Copy link

I think this should be solved by Elasticsearch rather than in the operator. The interaction between node shutdown and auto expand replicas doesn't seem right here. I'd chat to the es-distributed team about this if you haven't already.

@pebrc
Copy link
Collaborator Author

pebrc commented Apr 28, 2022

Good point. @colings86 I will reach out once more. We still want this fix on the operator side for the time being though.

@pebrc
Copy link
Collaborator Author

pebrc commented Apr 29, 2022

The underlying issue with auto_expand_replicas is addressed in Elasticsearch 7.17.2, 8.1.2, and 8.2.0 through elastic/elasticsearch#85306

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants