Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed allocation causing partial allocation of shards on allocation awareness #14010

Closed
ppf2 opened this issue Oct 7, 2015 · 1 comment

Comments

Projects
None yet
3 participants
@ppf2
Copy link
Member

commented Oct 7, 2015

It is difficult to write out a full repro in words, so I recorded a video of the repro which will help.

The test uses the latest 1.7.2 release.

In short, 6 nodes in cluster, 1 index with 4 shards and 2 replicas (3 copies).
Each node has 2 awareness attributes (updateDomain and faultDomain) set (both forced). 3 nodes are in 1 updateDomain, the other 3 are in the other updateDomain. And these nodes are also in different faultDomains. Test has delayed allocation set to 10s for quicker allocation.

When an updateDomain is killed (3 nodes gone), the cluster shows partial allocation of shards - until a manual _cluster/reroute command is run (without post body) to prod it, or if a command is issued that updates the cluster state (eg. create an index). Once a manual reroute (that doesn't change anything) is run or the cluster state is updated, then the remaining shards are immediately allocated successfully based on the awareness settings.

If delayed allocation is turned off entirely, then everything works fine and there is no need to manually prod it to complete the rest of the allocation.

Note that sometimes, with delayed allocation on, it does do the right thing, but if you retest a few times stopping and restarting the 3 nodes, you will see that it doesn't do so consistently.

Repro video:
https://drive.google.com/file/d/0B1rxJ0dAZbQvRUE0SlVxT2pOZFE/view?usp=sharing

Node setup:
https://docs.google.com/document/d/1J5FPSvIA5U41Ou1BNpEN9P7q2L8e7KMxM69IG4dGMkk/edit?usp=sharing

@clintongormley

This comment has been minimized.

Copy link
Member

commented Oct 8, 2015

Also see #14011

@ywelsch ywelsch closed this in 0220e45 Nov 12, 2015

ywelsch added a commit that referenced this issue Nov 12, 2015

Fix missing reroute in case of multiple delayed shards
After a delayed reroute of a shard, RoutingService misses to schedule a new delayed reroute of other delayed shards.

Closes #14494
Closes #14010
Closes #14445

ywelsch added a commit that referenced this issue Nov 12, 2015

Fix missing reroute in case of multiple delayed shards
After a delayed reroute of a shard, RoutingService misses to schedule a new delayed reroute of other delayed shards.

Closes #14494
Closes #14010
Closes #14445

ywelsch added a commit that referenced this issue Nov 12, 2015

Fix missing reroute in case of multiple delayed shards
After a delayed reroute of a shard, RoutingService misses to schedule a new delayed reroute of other delayed shards.

Closes #14494
Closes #14010
Closes #14445

ywelsch added a commit that referenced this issue Nov 12, 2015

Fix missing reroute in case of multiple delayed shards
After a delayed reroute of a shard, RoutingService misses to schedule a new delayed reroute of other delayed shards.

Closes #14494
Closes #14010
Closes #14445
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.