Provide better indication of the cluster state - is it writable (has quorum) #4755

rore · 2014-01-16T10:36:22Z

When using Elasticsearch as a critical application system you need to be able to monitor the cluster health.
The current health indication is not enough since the "yellow" state doesn't have a singular meaning.
It means that all primaries are up but some replica shards are not allocated. But this, on an operational level, has two possible implications -
It can be that some shards are missing but there's a quorum to all indexes so the cluster is "writable".
It can also mean that some indexes don't have a quorum in which case those writes will fail.

This is a huge difference on an operational level.

We need a way to know (and monitor) the real state of the cluster - knowing not only that all primaries are up, but also if there's a quorum.

A possible way can be to add another color, for instance -
yellow - some replicas are missing but you have a quorum
orange - primary up but no quorum

Or use another indicator altogether.

dakrone · 2015-04-10T15:50:23Z

I think we can add something like this to the indices level of the cluster health API, so something like:

{
  "active_primary_shards": 1,
  "active_shards": 1,
  "cluster_name": "elasticsearch",
  "indices": {
    "test": {
      "active_primary_shards": 1,
      "active_shards": 1,
      "initializing_shards": 0,
      "number_of_replicas": 1,
      "number_of_shards": 1,
      "relocating_shards": 0,
      "status": "yellow",
      "unassigned_shards": 1,
      "has_quorum": true
    }
  },
  "initializing_shards": 0,
  "number_of_data_nodes": 1,
  "number_of_nodes": 1,
  "number_of_pending_tasks": 0,
  "relocating_shards": 0,
  "status": "yellow",
  "timed_out": false,
  "unassigned_shards": 1
}

(see the has_quorum key)

rore · 2015-04-11T19:15:57Z

Having it per index is still problematic to monitor. The idea here was that the global cluster health indicator will reflect the state. Like today, if you have only one index that isn't fully allocated the cluster health indicates it. But there's a big difference between a "yellow" that is caused by some not allocated replicates but still allows writes, and a "yellow" when there's no quorum and writes will fail. The cluster health currently doesn't differentiate between those states.

bleskes · 2015-04-12T19:04:39Z

To me, yellow means “cluster is fully functional albeit not conforming to the required replication factor”. Red means that some operations can not be performed. With this definition in mind, I think we should signal an shard (and thus index & cluster) as red if one can not index into it due to the write_consistency settings (which also be all , and not quorum)

On 11 Apr 2015, at 21:16, Rotem Hermon notifications@github.com wrote:

Having it per index is still problematic to monitor. The idea here was that the global cluster health indicator will reflect the state. Like today, if you have only one index that isn't fully allocated the cluster health indicates it. But there's a big difference between a "yellow" that is caused by some not allocated replicates but still allows writes, and a "yellow" when there's no quorum and writes will fail. The cluster health currently doesn't differentiate between those states.

—
Reply to this email directly or view it on GitHub.

javanna · 2015-04-15T13:18:56Z

I tend to agree with @rore here. We could improve the colors that we use depending on the cluster level action.write_consistency setting (note that it can be overridden per request though). I am not a big fan of returning red for an index that would be yellow but cannot be written into. Maybe a new color would help. I don't think this is ready to be an adoptme though, given that we are still discussing it? Maybe we should mark for discussion instead? :)

jpountz · 2015-11-13T10:34:59Z

We just discussed this in Fixit Friday. Having indices that don't have a quorum reported RED or YELLOW seems problematic, so it appears we would either need to:

add a new color (ORANGE?) (@javanna's suggestion)
add a new flag (has_quorum?) on the index and cluster levels (@dakrone's suggestion)

dakrone · 2016-09-27T11:04:47Z

@abeyad I think this is no longer applicable with the quorum changes you added at index creation?

abeyad · 2016-09-27T12:16:07Z

The default as of 5.0 is to only require the primary shard to be active before indexing, which is the same as != RED. The only time we can't in the YELLOW cluster state is if wait_for_active_shards has been explicitly changed to a different value. In this situation, someone can just pass wait_for_active_shards as a request parameter to the indexing operation, and the indexing operation will wait for the requisite number of shards to be active before proceeding with the write (or it will timeout). So I don't believe there is anything more to do for this.

dakrone · 2016-09-27T12:18:22Z

Great! I'm going to close this then, we can always re-open if needed

javanna · 2016-09-27T21:45:32Z

This issue is about getting info through cluster health about whether indexing would pass our "consistency" checks. We have now renamed write consistency to wait for active shards and made things much better, but do we have that piece of info now in cluster health? Or maybe we don't need to add it anymore? Sorry if I misunderstood or I am missing anything ;)

abeyad · 2016-09-27T22:20:13Z

@javanna in 5.0, by default, if the cluster health is YELLOW, it means the "consistency" check will pass for write operations. If a user wants to increase the number of active shards to wait on before indexing proceeds, they can do so by setting the wait_for_active_shards parameter on the indexing operation, so at that point, they are in control of the value they give. If a user wants to make sure that the value they give for the indexing operation would proceed, they can always do a health check first e.g. /_cluster/health/{index_name}?wait_for_active_shards={n}. Not sure if that answers your question.. feel free to ask again if I wasn't clear :)

clintongormley added the discuss label Dec 24, 2014

dakrone added help wanted adoptme and removed discuss labels Apr 10, 2015

javanna added >enhancement :Cluster labels Apr 15, 2015

javanna added discuss and removed help wanted adoptme labels Apr 15, 2015

imotov mentioned this issue May 15, 2015

Cluster not indexing new documents when in "yellow" state. #11173

Closed

javanna mentioned this issue Jul 9, 2015

Add quorum_active flag to cluster health and waitForQuorumActive #12159

Closed

dakrone closed this as completed Sep 27, 2016

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide better indication of the cluster state - is it writable (has quorum) #4755

Provide better indication of the cluster state - is it writable (has quorum) #4755

rore commented Jan 16, 2014

dakrone commented Apr 10, 2015

rore commented Apr 11, 2015

bleskes commented Apr 12, 2015

javanna commented Apr 15, 2015

jpountz commented Nov 13, 2015

dakrone commented Sep 27, 2016

abeyad commented Sep 27, 2016

dakrone commented Sep 27, 2016

javanna commented Sep 27, 2016

abeyad commented Sep 27, 2016

Provide better indication of the cluster state - is it writable (has quorum) #4755

Provide better indication of the cluster state - is it writable (has quorum) #4755

Comments

rore commented Jan 16, 2014

dakrone commented Apr 10, 2015

rore commented Apr 11, 2015

bleskes commented Apr 12, 2015

javanna commented Apr 15, 2015

jpountz commented Nov 13, 2015

dakrone commented Sep 27, 2016

abeyad commented Sep 27, 2016

dakrone commented Sep 27, 2016

javanna commented Sep 27, 2016

abeyad commented Sep 27, 2016