Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. #36195
Elasticsearch version: 6.4.3/6.5.1
JVM version: 126.96.36.199
OS version: CentOS 7.4
Description of the problem including expected versus actual behavior:
Steps to reproduce:
We could see each node's generic thread pool used up to 128 which is full.
Jstack output for hanged node, all generic threads are waiting on txGet:
So, cluster should get hanged in distributed deadlocks.
@howardhuanghua thanks for reporting this. This is indeed an issue if the number of concurrent recoveries from a node are higher than the max size of the GENERIC thread pool (which is some value >=128, depending on the number of processors). That said, typically you should not have so many shards per node, and allowing such a high number of
@ywelsch, thanks for your comment. Currently, we limit node_concurrent_recoveries setting <=50 in our product environment version based on 6.4.3 as follow,
Please give us some suggestions if you have, thanks a lot.