Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Forced awareness fails to balance shards #3580
I built a cluster for amazon zone awareness as follows:
Background: I've setup the number of shards to 3, and I have allocation awareness set to match the awszone property which I'm setting in a config.
There's 6 nodes total:
2 in us-west-2a
So in theory, I should be able to lose either an entire side of the country. I would also expect the nodes in us-west to balance the shards, not just allocate a single shard to the machine.
However, the western zone shards aren't balancing except to move 1 shard onto each of the other hosts in the zone.
S1monw is on the case.
Discussion here: https://groups.google.com/forum/m/#!topic/elasticsearch/9yZw7sryFb4
Maybe I'll just do it manually on my laptop.
Sent from my iPad
On Aug 30, 2013, at 12:23 PM, Simon Willnauer firstname.lastname@example.org wrote:
Sorry, out with the flu.
I tried to confirm it immediately after release, but it didn't quite behave like I expected.
Basically, the problem is that since I initially found this, I set the number of replicas set such that every node has a copy of each shard to work around the problem. (i.e. 5, because 4 zones, 2 nodes in West1, 2 nodes in West2, 1 node in east1, 1 node in east2)
After I upgraded to 0.90.5, I thought, ok, and set the number of replicas to 4.
It removed all shards from one of the nodes, which was not what I expected. And it didn't rebalance after that.
So I set it back to 5 and then had to wait while the shards got recopied back to that node.
Additionally, this didn't seem to affect primary allocation, because I have one node that's primary for all of the shards.
Would that be what you expected to happen?
What's the cost of a relocation? Isn't it just kind of just a "blessing"?
Here's what I did:
ES had run out of memory on the primary. That caused the primary to switch to the 10.100 network, which only has two nodes. (They're basically live backups). I wanted to try to see if the nodes would come back immediately after a restart, plus I want to check the balancing.
I locked allocation based on a thread on the mailing list of someone telling me that they would restart faster if I locked allocation before restarting nodes.