Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shards from _cat/shards may contain stale information #3070

Closed
sebgl opened this issue May 14, 2020 · 2 comments · Fixed by #3195
Closed

Shards from _cat/shards may contain stale information #3070

sebgl opened this issue May 14, 2020 · 2 comments · Fixed by #3195
Assignees
Labels
discuss We need to figure this out >enhancement Enhancement of existing functionality

Comments

@sebgl
Copy link
Contributor

sebgl commented May 14, 2020

We regularly call _cat/shards to retrieve the list of shards in order to:

  • ensure a node is safe to remove (no more shards assigned to it)
  • ensure there are STARTED replica on other nodes when restarting a node containing primary shards
  • ensure there is no initializing or relocating shards when dealing with upgrades in a version upgrade scenario

_cat/shards does not provide realtime information and may contain stale information.

We should investigate what can be used instead.

Example:

GET _cluster/health?wait_for_no_initializing_shards&wait_for_events=languid&wait_for_nodes=N

seems to be a candidate for waiting for shards to initialize.

To be discussed with the es-distrib and cloud-orchestration teams.

@sebgl sebgl added the discuss We need to figure this out label May 14, 2020
@botelastic botelastic bot added the triage label May 14, 2020
@sebgl sebgl added the >enhancement Enhancement of existing functionality label May 14, 2020
@botelastic botelastic bot removed the triage label May 14, 2020
@sebgl sebgl self-assigned this May 26, 2020
@sebgl
Copy link
Contributor Author

sebgl commented May 26, 2020

I'm putting together a document to cover every single Elasticsearch API request ECK does, so we can review them all.

@pebrc
Copy link
Collaborator

pebrc commented May 26, 2020

"real-time information" confused me here in the issue description and might not be the right word. I managed to recover some additional detail from an out-of-band discussion between @jhalterman @yuri-tceretian and @DaveCTurner, which I try to summarize below

Problems with relying on _cat/shards

  • it works off cluster state, but there might be things happening concurrently that are not reflected in cluster state like fetching of shards and pending tasks leading to additional initialisations of shards
  • it works off the last applied cluster state which might be stale already (because a new state is being applied as the _cat/shards request not run on the same thread as the thing that applies cluster state changes)
  • the rolling upgrade docs say to check cluster health between restarting of nodes (n.b. they mention simply _cat/health )

Advantages of using _cluster/health

  • it runs on the same thread as cluster state updates which means it can wait_for_events=languid which means to wait until there are no events in the queue for the master to handle (languid is the lowest priority so checking for that means effectively no event in the queue at all)
  • in combination with wait_for_no_initializing_shards it closes the gap _cat/shards has in that it ensures that there are no currently running initializations but also no unexecuted tasks that could be pending initialisations
  • wait_for_nodes=N, where N is the number of started nodes we expect after rolling, makes sure that we have the expected number of nodes in the cluster before the check succeeds
  • even with these three query params in place there is still a possibility for a transient state where all three conditions are true but it is still not safe to continue rolling nodes: between rolling a node and it starting to initialise shards there is a phase where it potentially asynchronously fetches shards which is not accounted for when using _cat/shards. But even with the _cluster/health API we have to inspect the returned response to ensure number_of_inflight_fetch is 0

Reviewing the ECK usage of _cat/shards in the light of this information

I think this use case should be fine as is. We use an allocation filter to move shards away and we can assume once the node has been evacuated it is safe to remove, initialisations or pending tasks notwithstanding. The main motivation is to avoid data loss.

  • ensure there are STARTED replica on other nodes when restarting a node containing primary shards

We actually combine that with a predicate that ensures that we don't roll nodes with the same shards on them at the same time even if a user specifies a change budget that would allow that. So I am thinking we are OK here too.

This is where we have a gap right now in that we don't account for pending initialisations and async shard fetching because we just look at _cat/shards.

Implementation considerations

The way _cluster/health is supposed to be used is that it takes a query param timeout and the response has another field called timed_out which indicates whether the timeout expired or the conditions mentioned above were met before that.

We want to avoid a long timeout value here as long running requests have a negative impact on ECK's ability to process changes for other Elasticsearch clusters it manages.

So we could:

GET _cluster/health?wait_for_no_initializing_shards&wait_for_events=languid&wait_for_nodes=N&timeout=1

and if timed_out == false and number_of_in_flight_fetch == 0 we would allow to roll (i.e. delete) the Pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss We need to figure this out >enhancement Enhancement of existing functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants