Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
auto_expand_replicas causing very large amount of cluster state changes when a node joins or leaves the cluster - causing the master to become unresponsive #3399
In our cluster we have 8 nodes and about 100 indices. Each index have one shard and is replicated to every node using the setting “auto_expand_replicas=’0-all’”.
We observed that when a node leaves the cluster, the master node becomes unresponsive for some time. The more indices we added the longer time it got unresponsive. During this time restarted nodes were sometimes not able to join back into the cluster, causing a split brain scenario, or were just hung at startup.
Looking at the source code for how the node leave and join events is handled I think I have identified the bug. The ClusterChangedEvent is propagated to MetaDataUpdateSettingsService#clusterChanged which will loop through every index and if the number of nodes has changed, fire updateSettings for that index. When a node joins or leaves and using the auto_expand_replicas setting, every index will be affected. So for 100 indices it will fire off 100 updateSettings.
The problem is that each call updateSettings results in a new cluster state, which again will propagate back to the MetaDataUpdateSettingsService#clusterChanged, resulting in an exponential number of cluster state changes. This fills the log with messages like this:
[2013-07-26 20:55:45,726][INFO ][cluster.metadata ] [master] [index1] auto expanded replicas to 
My proposed fix is to group the updates together by fNumberOfReplicas and only trigger one update for each fNumberOfReplicas. In our case, when “auto_expand_replicas” is set to “0-all” this will result in one cluster state change instead of a flood of changes.
The fix passes all the tests and solves the problem we have been observing in production and been able to reproduce in our development environment.
Will update the ticket asap with a link to the commit for fix.
Batching the cluster events will definitely help. Btw, in 0.90 branch (upcoming 0.90.3) we fix the part about cluster not being responsive due to large amount of cluster change events. The cluster state publishing and the ping requests were using the same