-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shutdown records left standing after node has been restarted reduce number of replicas for auto_expand_replicas
#5628
Comments
IIUC we should not have 2 shutdown records at the same time (assuming we only restart one Pod at a time, which is the case here I guess). cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade.go Lines 36 to 47 in 6badd6d
|
I think this should be solved by Elasticsearch rather than in the operator. The interaction between node shutdown and auto expand replicas doesn't seem right here. I'd chat to the es-distributed team about this if you haven't already. |
Good point. @colings86 I will reach out once more. We still want this fix on the operator side for the time being though. |
The underlying issue with |
Nodes with node shutdown records reduce the number of available nodes that the
auto_expand_replicas
mechanism in Elasticsearch takes into account when adjusting the replicas which happens on each re-route task e.g. after a node has left the cluster.If we have a three node cluster with two shutdown records then the number of replicas calculated for
auto_expand_replicas
is0
if the remaining replica comes to lie on a node scheduled for shutdown then the cluster goesRED
Console grab from a manual reproduction:
Our current implementation unfortunately sometimes have transient states where two records exist because we clean up old records after creating new ones and even sometimes short circuit the reconciliation which prolongs the transient state even more.
The text was updated successfully, but these errors were encountered: