Forced awareness fails to balance shards #13667

Open
klahnakoski opened this Issue Sep 20, 2015 · 14 comments

Projects

None yet

4 participants

@klahnakoski

I have installed ES 1.7.1 on all nodes. I have three zones:

  • A single node in primary zone
  • A single node in secondary zone
  • Four nodes in spot zone

Indexes with "number_of_replicas": 2 will not place one shard per zone, rather place a couple of replicas (?even primary?) in the spot zone. The spot zone is very unstable, but very cheap, and since a quorum of the replicas can end up in the spot zone, the cluster is mostly unusable.

Here is the config for the primary node:

cluster.name: active-data
node.zone: primary
node.name: primary
node.master: false
node.data: true

cluster.routing.allocation.awareness.force.zone.values: primary,secondary,spot
cluster.routing.allocation.awareness.attributes: zone

Here is a sample of my config file for the secondary node:

cluster.name: active-data
node.zone: secondary    
node.name: secondary
node.master: false
node.data: true

cluster.routing.allocation.awareness.force.zone.values: primary,secondary,spot
cluster.routing.allocation.awareness.attributes: zone

..and here are the configs for nodes in the spot zone:

node.zone: spot
node.name: spot_{{id}}
node.master: false
node.data: true

cluster.name: active-data
cluster.routing.allocation.awareness.force.zone.values: primary,secondary,spot
cluster.routing.allocation.awareness.attributes: zone

{{id}} is replaced with a unique hex UID for each node.

There is one more node, in the spot zone, which is master (but has no shards):

cluster.name: active-data
node.zone: spot
node.name: coordinator
node.master: true
node.data: false

cluster.routing.allocation.awareness.force.zone.values: primary,secondary,spot
cluster.routing.allocation.awareness.attributes: zone

When I have two zones (primary and spot), with one replica, the cluster is stable; the primary zone always has a copy of every shard, and loss of spot nodes does not cause loss of quorum.

@clintongormley
Member

Indexes with "number_of_replicas": 2 will not place one shard per zone, rather place a couple of replicas (?even primary?) in the spot zone.

I've tried this out on 1.7.1 and 2.0.0-beta2 and it behaves correctly. At least one shard copy is placed in each zone, and a second copy is only placed in a zone if more replicas than zones are specified. A primary could well be placed in the spot zone, but that shouldn't be an issue.

Can you provide some more info about that you're seeing, because it seems to be working correctly to me.

That said, awareness should really only be used with zones of the same size (see #12431) and having shard copies on mixed box types drags the performance of the expensive boxes down to the level of the cheap boxes.

@klahnakoski

Thank you for looking into this.

I looked at #12431, I can not say I understand the code completely, but it may be the cause: I did bring the node in the secondary zone down for a while. During that downtime, the both replicas moved to the spot zone. When the secondary was brought back up, it never got back its replicas. But now I am speculating.

Thank you for noting the degradation issues I may experience when it comes to cheap nodes: I am aware of this; the 'cheap' nodes actually have more memory, more CPU, and more space than may "expensive" nodes; they are just unreliable.

Last night, I had turned off all machines in the spot zone, and reduced to one replica so I would have quorum, and bulk indexing would proceed. I have turned the spot machines back on, and set replicas to 2, but it will be a few hours for the shards to copy over, and then I can get back to you.

@clintongormley
Member

During that downtime, the both replicas moved to the spot zone. When the secondary was brought back up, it never got back its replicas.

This sounds like you were using ordinary awareness instead of forced awareness. With forced awareness, the shards in the secondary zone become unassigned, rather than being assigned to a different zone.

@klahnakoski

I think I am using forced awareness (as per my config file)

cluster.routing.allocation.awareness.force.zone.values: primary,secondary,spot

I continue to look at #12431, and I do not know how it manages balance properly when it never calculates/uses the quorum size. Looking at [1], I am more convinced it is the cause: When secondary is missing we have:

shardCount==5
averagePerAttribute==1
leftoverPerAttribute==1
currentNodeCount==2

Which allocates a shard to the spot zone. When secondary comes back I have

shardCount==6
averagePerAttribute==2
leftoverPerAttribute==0
currentNodeCount==3

which prevents allocation to the secondary node.

I am still waiting for my replicas to recover.

[1] https://github.com/elastic/elasticsearch/blob/c57951780e0132c50b723d78038ab73e10d176c5/core/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AwarenessAllocationDecider.java#L222

@klahnakoski

I am having a tough time replicating the issue. The large index takes too long. And the small indexes behave just fine. I am still looking for the sequence of actions that cause the unbalance, and will get back to you when I can, or I have given up.

@klahnakoski

I gave up on unittest where I saw the problem, and I set {"index.routing.allocation.exclude.zone" : "spot"} to pevent it from moving shards

For the saved_queries, I set the replicas to 3

resumed secondary

Then I set the replicas back down to 2. I would expect the shards to properly balance, like they usually do, but this time they do not.

replicas back to 2

and you see the two replicas stick to the spot zone. Now I kill the spot instances

no spot

The problem now is that there is no quorum, and I can not index more documents. Also, this state seems to persist; the two replicas never come back, even after a long while.

after a while

Assuming a sequence like this does not loose all copies of a shard, I can fix this by setting replicas to zero, and then back to the value I wish.

@bleskes
Member
bleskes commented Sep 22, 2015

I think I understand what’s going on. The forced awareness and the exclude rules are conflicting. The forced awareness tells ES it must spread out evenly across the awareness values, using the spot zone and not assigning shards if the spot zone is not there. The exclude rules prevents the spot zone from being used. The reason why this only kicks in once the shard no is reduce and then increased again is that exclude rules will try to move existing shards but will not unassigned them if that’s impossible. Once the replica count is decreased and increase the exclude does prevent the allocation of the new shards on the spot zone, and the forced awareness prevents them from being allocated anywhere else. Makes sense?

On 22 Sep 2015, at 02:15, Kyle Lahnakoski notifications@github.com wrote:

I gave up on unittest where I saw the problem, and I set {"index.routing.allocation.exclude.zone" : "spot"} to pevent it from moving shards

For the saved_queries, I set the replicas to 3

Then I set the replicas back down to 2. I would expect the shards to properly balance, like they usually do, but this time they do not.

and you see the two replicas stick to the spot zone. Now I kill the spot instances

The problem now is that there is no quorum, and I can not index more documents. Also, this state seems to persist; the two replicas never come back, even after a long while.

Assuming a sequence like this does not loose all copies of a shard, I can fix this by setting replicas to zero, and then back to the value I wish.


Reply to this email directly or view it on GitHub.

@klahnakoski

Sorry for the confusion: Only the unittest index has an exclude rule, but my comment was about the lifecycle of the saved_queries index. I only mentioned unittest to explain why the shards where not moving over time.

@clintongormley
Member

@klahnakoski try using the cluster reroute api to assign a replica for shard 0 to the secondary node, with the "explain" parameter. That'll tell us why Elasticsearch doesn't want to assign the shard to that node.

@klahnakoski

Where does the "explain" parameter go? As a sibling property to "command"? or sibling of "allocate"? or sibling of "index"? [1] is vague.

[1]https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html

@clintongormley
Member

This should work:

POST _cluster/reroute?explain
{
  "commands": [
    {
      "allocate": {
        "index": "saved_queries20150510_160318",
        "shard": 0,
        "node": "secondary"
      }
    }
  ]
}
@klahnakoski

Sorry for the delay, it took a while for me to get back to the mis-allocated state. When I ran the command, the shard moved back to the secondary. I have a copy of the explain, but it does not have any complaints in it.

@nikonyrh
nikonyrh commented Oct 7, 2015

I have an other scenario (if it is too different maybe a new issue or the mailing list would be more suitable?), help or links to documentation would be greatly appreciated.

I have a few hundred GB of indexes on my "production" node (just one machine in the cluster as it is for my hobby projects). I am trying to run benchmarks for five of the indexes, each of which is about 18 GB in 4 shards. I would like to test how much faster queries and aggregations I would get by joining my gaming laptop to the cluster but I don't want it to receive any master shards.

In my understanding I would lose data if the only existing shard (at the moment all indexes have zero replicas) is transferred to the laptop, I stop that node and delete the data folder. Am I correct? New documents are constantly being indexed to other indexed which aren't part of this experiment.

I thought I could set "node.master" to false to block primary shards being transferred to the node but clearly this isn't the case, transferring starts immediately when the laptop joins the cluster.

Simply put: how do I prevent laptop node from receiving primary shards? I only want to transfer replica shards to the laptop and have all primary shards staying safely on the current master node.

@clintongormley
Member

@nikonyrh please ask questions like these in the forum: https://discuss.elastic.co/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment