Shards from _cat/shards may contain stale information #3070

sebgl · 2020-05-14T09:00:45Z

We regularly call _cat/shards to retrieve the list of shards in order to:

ensure a node is safe to remove (no more shards assigned to it)
ensure there are STARTED replica on other nodes when restarting a node containing primary shards
ensure there is no initializing or relocating shards when dealing with upgrades in a version upgrade scenario

_cat/shards does not provide realtime information and may contain stale information.

We should investigate what can be used instead.

Example:

GET _cluster/health?wait_for_no_initializing_shards&wait_for_events=languid&wait_for_nodes=N

seems to be a candidate for waiting for shards to initialize.

To be discussed with the es-distrib and cloud-orchestration teams.

The text was updated successfully, but these errors were encountered:

sebgl · 2020-05-26T15:14:45Z

I'm putting together a document to cover every single Elasticsearch API request ECK does, so we can review them all.

pebrc · 2020-05-26T18:42:24Z

"real-time information" confused me here in the issue description and might not be the right word. I managed to recover some additional detail from an out-of-band discussion between @jhalterman @yuri-tceretian and @DaveCTurner, which I try to summarize below

Problems with relying on `_cat/shards`

it works off cluster state, but there might be things happening concurrently that are not reflected in cluster state like fetching of shards and pending tasks leading to additional initialisations of shards
it works off the last applied cluster state which might be stale already (because a new state is being applied as the _cat/shards request not run on the same thread as the thing that applies cluster state changes)
the rolling upgrade docs say to check cluster health between restarting of nodes (n.b. they mention simply _cat/health )

Advantages of using `_cluster/health`

it runs on the same thread as cluster state updates which means it can wait_for_events=languid which means to wait until there are no events in the queue for the master to handle (languid is the lowest priority so checking for that means effectively no event in the queue at all)
in combination with wait_for_no_initializing_shards it closes the gap _cat/shards has in that it ensures that there are no currently running initializations but also no unexecuted tasks that could be pending initialisations
wait_for_nodes=N, where N is the number of started nodes we expect after rolling, makes sure that we have the expected number of nodes in the cluster before the check succeeds
even with these three query params in place there is still a possibility for a transient state where all three conditions are true but it is still not safe to continue rolling nodes: between rolling a node and it starting to initialise shards there is a phase where it potentially asynchronously fetches shards which is not accounted for when using _cat/shards. But even with the _cluster/health API we have to inspect the returned response to ensure number_of_inflight_fetch is 0

Reviewing the ECK usage of `_cat/shards` in the light of this information

ensure a node is safe to remove (no more shards assigned to it)

I think this use case should be fine as is. We use an allocation filter to move shards away and we can assume once the node has been evacuated it is safe to remove, initialisations or pending tasks notwithstanding. The main motivation is to avoid data loss.

ensure there are STARTED replica on other nodes when restarting a node containing primary shards

We actually combine that with a predicate that ensures that we don't roll nodes with the same shards on them at the same time even if a user specifies a change budget that would allow that. So I am thinking we are OK here too.

ensure there is no initializing or relocating shards when dealing with upgrades in a version upgrade scenario

This is where we have a gap right now in that we don't account for pending initialisations and async shard fetching because we just look at _cat/shards.

Implementation considerations

The way _cluster/health is supposed to be used is that it takes a query param timeout and the response has another field called timed_out which indicates whether the timeout expired or the conditions mentioned above were met before that.

We want to avoid a long timeout value here as long running requests have a negative impact on ECK's ability to process changes for other Elasticsearch clusters it manages.

So we could:

GET _cluster/health?wait_for_no_initializing_shards&wait_for_events=languid&wait_for_nodes=N&timeout=1

and if timed_out == false and number_of_in_flight_fetch == 0 we would allow to roll (i.e. delete) the Pod.

sebgl added the discuss We need to figure this out label May 14, 2020

botelastic bot added the triage label May 14, 2020

sebgl added the >enhancement Enhancement of existing functionality label May 14, 2020

botelastic bot removed the triage label May 14, 2020

sebgl self-assigned this May 26, 2020

Leaf-Lin added the supportability label May 26, 2020

sebgl assigned pebrc and unassigned sebgl May 27, 2020

pebrc mentioned this issue Jun 5, 2020

Use _cluster/health API during rolling upgrades #3195

Merged

pebrc closed this as completed in #3195 Jun 12, 2020

This was referenced Jun 7, 2022

TestReversalStatefulSetRename/ES_cluster_health_should_eventually_be_green is flaky #5713

Closed

Check shard activity before removing a node #5758

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shards from _cat/shards may contain stale information #3070

Shards from _cat/shards may contain stale information #3070

sebgl commented May 14, 2020 •

edited

Loading

sebgl commented May 26, 2020 •

edited

Loading

pebrc commented May 26, 2020 •

edited

Loading

Shards from _cat/shards may contain stale information #3070

Shards from _cat/shards may contain stale information #3070

Comments

sebgl commented May 14, 2020 • edited Loading

sebgl commented May 26, 2020 • edited Loading

pebrc commented May 26, 2020 • edited Loading

Problems with relying on _cat/shards

Advantages of using _cluster/health

Reviewing the ECK usage of _cat/shards in the light of this information

Implementation considerations

sebgl commented May 14, 2020 •

edited

Loading

sebgl commented May 26, 2020 •

edited

Loading

pebrc commented May 26, 2020 •

edited

Loading

Problems with relying on `_cat/shards`

Advantages of using `_cluster/health`

Reviewing the ECK usage of `_cat/shards` in the light of this information