During shard relocation some requests might fail to be sent to a shard #13719

brwe · 2015-09-22T16:09:10Z

When shards relocate then there might be small window in time where requests fail to reach the relocating shard. This happens when a node that lags one cluster state behind has not realized yet that a node has relocated and the relocation source is already removed.
Below is a graphical representation of an example course of events. This caused us some trouble in test already because results were unexpected, see #13266. It affects all actions that inherit from TransportBroadcastAction, TransportBroadcastByNodeAction and might also be problematic for others. For example: an optimize request might never reach a shard if it is relocating, indices stats may report wrong statistics, see #13266 (comment), etc.

We should check if we can get away with just sending requests to relocation targets too for the affected actions or if we need to implement these kind of requests as replication action like we did for refresh and flush.

I: shard is relocating from n2 to n3
II: CS2 signals that n3 has started its shard and n2 can remove its own copy. shard on n2 is therefore closed. But n1 lags behind one cluster state and still expects an up and running primary on n2
III: if now n1 sends an optimize, indices stats request or the likes, it will send the request to n2 (based on CS1) but that does not not have the shard anymore.

The text was updated successfully, but these errors were encountered:

rufftruffles · 2020-03-12T18:28:50Z

any update on this?

Add test demonstrating that if force merge runs while a copy is unavailable, it is silently ignored. Relates elastic#13719

henningandersen · 2021-12-30T13:22:26Z

Reexamined this today and there still are issues to tackle here. Opened #82143, which demonstrates another way that force merge can be silently ignored/forgotten.

A few options we can consider for this specific case:

Make force-merge best effort and let it return if anything could be missing (like a relocating shard or an unavailable copy).
Let force-merge be a write action, marking unavailable shards as not in-sync and ensuring somehow that we recover files. Will require us to accumulate force-merges during a recovery/relocate too and rerun these when finalizing recovery.
Register forcemerges in translog (or lucene or similar) and rerun them on recovery finalization.
Make "forcemerged to 1" a property of the index.

The forget follower API has a similar need (it removes retention leases).

Other broadcast by node actions are either read actions, reload or cache clear actions. It seems likely that we can find a different route for those.

clintongormley added >bug resiliency :Cluster labels Sep 22, 2015

brwe mentioned this issue Oct 5, 2015

Test Failure: SearchQueryIT.testIssue3177 - forced merged attempted on closed index #13266

Closed

clintongormley mentioned this issue Nov 21, 2015

[CI Failure] failed to delete a dummy doc #8669

Closed

clintongormley added the help wanted adoptme label Jan 28, 2016

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020

Leaf-Lin added the team-discuss label Nov 11, 2021

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Dec 30, 2021

Test force merge when copy is unavailable

215c512

Add test demonstrating that if force merge runs while a copy is unavailable, it is silently ignored. Relates elastic#13719

henningandersen mentioned this issue Dec 30, 2021

Test force merge when copy is unavailable #82143

Open

Leaf-Lin removed the team-discuss label May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During shard relocation some requests might fail to be sent to a shard #13719

During shard relocation some requests might fail to be sent to a shard #13719

brwe commented Sep 22, 2015

rufftruffles commented Mar 12, 2020

henningandersen commented Dec 30, 2021 •

edited

Loading

During shard relocation some requests might fail to be sent to a shard #13719

During shard relocation some requests might fail to be sent to a shard #13719

Comments

brwe commented Sep 22, 2015

rufftruffles commented Mar 12, 2020

henningandersen commented Dec 30, 2021 • edited Loading

henningandersen commented Dec 30, 2021 •

edited

Loading