[elasticsearch] Allow for configurable health status value in readinessProbe #60

rwittrick · 2019-02-12T20:34:56Z

In the elasticsearch statefulset readinessProbe command, the cluster health status has a hardcoded value of green. It's not always the case that, as with coordinator nodes, green is the minimum acceptable state for cluster health.

The text was updated successfully, but these errors were encountered:

tgruenert · 2019-02-14T08:23:28Z

I´ve run in the same issue at this moment. It would be nice to configure the required level within values. "green" can be default to be backwards compatible

readinessProbe.level: green

rwittrick · 2019-02-14T10:54:52Z

Doing that would add level to the configuration of the readinessProbe itself and require awkward logic in the template to filter it out. A couple of suggestions for the values file are:

Put the value at the top level outside of the readinessProbe

minClusterHealthStatus: green

Move the readinessProbe configuration fields under config, and the status field under readinessProbe

readinessProbe:
  config:
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 3
    timeoutSeconds: 5
  minClusterHealthStatus: green

Similar to above, but put the status value more explicitly in the command key to differentiate it from k8s config

readinessProbe:
  config:
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 3
    timeoutSeconds: 5
  command:
    minClusterHealthStatus: green

I'll work on a PR using option 3, for now

rwittrick · 2019-02-14T12:20:11Z

Actually option 3 above would be a breaking change for anyone that has already implemented values for the readinessProbe. The safest option is 1.

tgruenert · 2019-02-14T12:55:04Z

thank you for your feedback. after thinking about, my first suggestion was not the best one.
because i needed a quick fix for my problem, i also forked the project and did a change simular to your option 3.
it would be nice to see your PR for tis topic. thank you!

rwittrick · 2019-02-14T13:27:02Z

@tgruenert - see #62 :)

jordansissel · 2019-02-14T17:24:34Z

the cluster health status has a hardcoded value of green

My reading of the readinessProbe is that it checks for green once and once green only checks the node's root path GET / after that. It doesn't require the cluster to always be green.

I know it doesn't require green all the time because my cluster goes yellow (normal) during maintenance and none of my Elasticsearch nodes fail their readinessProbe.

rwittrick · 2019-02-14T20:18:43Z

This issue was raised related to the coordinator node issue and this comment - #58 (comment).

I found this issue because our data nodes restarted and some shards were left unnasigned. In the interim, our coordinator nodes restarted, and, because the cluster was not green, the readiness probe failed, meaning kibana, which was pointing at the coordinator node, could not connect to es, and we could not do index management through the UI to get the cluster green.

In conjunction with defining how coordinator nodes will work, this is to allow those types of nodes (i.e non mutating) to still be accessible and become ready in this scenario

jordansissel · 2019-02-14T22:43:15Z

Ahh. Makes sense.

I think the broader problem is that the readinessProbe currently evaluates a cluster-wide value (cluster state color) when readinessProbes should probably focus on individual pod health, not the health of the whole Elasticsearch cluster.

A red cluster state doesn't imply the cluster can't serve traffic, but that's what the readinessProbe is inferring today. Let's think about a broader per-node health check that doesn't look at the cluster health?

rwittrick · 2019-02-15T15:05:44Z

I think the broader problem is that the readinessProbe currently evaluates a cluster-wide value

agree

readinessProbes should probably focus on individual pod health

There is no real Node Health API for individual nodes, and I think it's still valid to check that the instance is part of the cluster before it's deemed ready. The closest I can think of is the Node Stats API, but that (given the query itself is valid) always returns success, and there is not "wait_for" parameters, so using it would involve a further step in interpreting the json response.
e.g.

curl -s elasticsearch-master:9200/_nodes/$(hostname -i),master:false,data:false,ingest:false/stats/http |grep -q '"successful":1'

and with that, to know you are part of a cluster, it's probably better to query against a service that is not localhost.

FYI, the PR for the original issue that I raised has been merged. Once this discussion ends I am happy to look at this again

Crazybus · 2019-02-18T13:39:26Z

I think the broader problem is that the readinessProbe currently evaluates a cluster-wide value (cluster state color) when readinessProbes should probably focus on individual pod health, not the health of the whole Elasticsearch cluster.

This would be great, but unfortunately statefulsets use the readinessProbe to also determine whether or not the next node in the statefulset can be upgraded. By using a purely node level health check here it would mean that rolling upgrades of Elasticsearch or Kubernetes would take down the Elasticsearch cluster. If anyone has a better idea of how to accomplish this in a statefulset without using a "cluster aware on startup" readinessProbe I would love to hear it!

I found this issue because our data nodes restarted and some shards were left unnasigned. In the interim, our coordinator nodes restarted

Just to clarify what happened here. Did all of these nodes get restarted at the same time? Or was this a controlled rolling upgrade scenario where Kubernetes was doing a planned rolling restart?

If all nodes do get restarted at exactly the same time it is indeed a shame that the readinessProbe will mean that no traffic is served until the cluster is green. So having coordinating nodes that always serve traffic does make sense in that situation.

Crazybus · 2019-03-11T09:04:05Z

Closing this as it was fixed by @rwittrick in #62. Thank you so much for opening and fixing the issue :)

Crazybus mentioned this issue Feb 14, 2019

issue-60: Add a configurable health status to the readinessProbe command #62

Merged

Crazybus closed this as completed Mar 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[elasticsearch] Allow for configurable health status value in readinessProbe #60

[elasticsearch] Allow for configurable health status value in readinessProbe #60

rwittrick commented Feb 12, 2019

tgruenert commented Feb 14, 2019

rwittrick commented Feb 14, 2019 •

edited

Loading

rwittrick commented Feb 14, 2019

tgruenert commented Feb 14, 2019

rwittrick commented Feb 14, 2019

jordansissel commented Feb 14, 2019

rwittrick commented Feb 14, 2019 •

edited

Loading

jordansissel commented Feb 14, 2019

rwittrick commented Feb 15, 2019 •

edited

Loading

Crazybus commented Feb 18, 2019

Crazybus commented Mar 11, 2019 •

edited

Loading

[elasticsearch] Allow for configurable health status value in readinessProbe #60

[elasticsearch] Allow for configurable health status value in readinessProbe #60

Comments

rwittrick commented Feb 12, 2019

tgruenert commented Feb 14, 2019

rwittrick commented Feb 14, 2019 • edited Loading

rwittrick commented Feb 14, 2019

tgruenert commented Feb 14, 2019

rwittrick commented Feb 14, 2019

jordansissel commented Feb 14, 2019

rwittrick commented Feb 14, 2019 • edited Loading

jordansissel commented Feb 14, 2019

rwittrick commented Feb 15, 2019 • edited Loading

Crazybus commented Feb 18, 2019

Crazybus commented Mar 11, 2019 • edited Loading

rwittrick commented Feb 14, 2019 •

edited

Loading

rwittrick commented Feb 14, 2019 •

edited

Loading

rwittrick commented Feb 15, 2019 •

edited

Loading

Crazybus commented Mar 11, 2019 •

edited

Loading