New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. #36195

Open
howardhuanghua opened this Issue Dec 4, 2018 · 5 comments

Comments

Projects
None yet
5 participants
@howardhuanghua
Copy link

howardhuanghua commented Dec 4, 2018

Elasticsearch version: 6.4.3/6.5.1

JVM version: 1.8.0.181

OS version: CentOS 7.4

Description of the problem including expected versus actual behavior:
Product environment: 15 nodes, 2700+ indices, 15000+ shards.
Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100.

Steps to reproduce:

  1. Setup a three nodes cluster. 1 core, 2GB per node.
  2. Create 300 indices, 3000 shards. Each index with 100 documents.
  3. Set these parameters for cluster dynamically:
    curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
    {
    "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": 100,
    "indices.recovery.max_bytes_per_sec": "400mb"
    }
    }'
  4. Stop one of the nodes and remove data from data path.
  5. Startup the stopped node.
  6. After a while, cluster got hanged.

We could see each node's generic thread pool used up to 128 which is full.
[c_log@VM_128_27_centos ~/elasticsearch-6.4.3/bin]$ curl localhost:9200/_cat/thread_pool/generic?v
node_name name active queue rejected
node-3 generic 128 949 0
node-2 generic 128 1093 0
node-1 generic 128 1076 0

Lot's of peer recoveries are waiting:
image

Jstack output for hanged node, all generic threads are waiting on txGet:
"elasticsearch[node-3][generic][T#128]" #179 daemon prio=5 os_prio=0 tid=0x00007fa8980c8800 nid=0x3cb9 waiting on condition [0x00007fa86ca0a000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000fbef56f0> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync) at java.util.concurrent.locks.LockSupport.park(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown Source) at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:251) at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:94) at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:44) at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:32) at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.receiveFileInfo(RemoteRecoveryTargetHandler.java:133) at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$phase1$6(RecoverySourceHandler.java:387) at org.elasticsearch.indices.recovery.RecoverySourceHandler$$Lambda$3071/1370938617.run(Unknown Source) at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:105) at org.elasticsearch.common.util.CancellableThreads.execute(CancellableThreads.java:86) at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:386) at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:172) at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

So, cluster should get hanged in distributed deadlocks.

Thanks,
Howard

@howardhuanghua howardhuanghua changed the title ES Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. Cluster gets hanged after setting "cluster.routing.allocation.node_concurrent_recoveries" up to 100. Dec 4, 2018

@howardhuanghua

This comment has been minimized.

Copy link

howardhuanghua commented Dec 4, 2018

We have updated _cluster/settings rest level API to reject setting "cluster.routing.allocation.node_concurrent_recoveries" up to 50+ for workaround.

@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented Dec 5, 2018

@Ethan-Zhang

This comment has been minimized.

Copy link

Ethan-Zhang commented Dec 11, 2018

shall we use org.elasticsearch.transport.txGet(long timeout, TimeUnit unit) instead of txGet() to avoid deadlock?

@ywelsch ywelsch added the >bug label Dec 11, 2018

@ywelsch

This comment has been minimized.

Copy link
Contributor

ywelsch commented Dec 11, 2018

@howardhuanghua thanks for reporting this. This is indeed an issue if the number of concurrent recoveries from a node are higher than the max size of the GENERIC thread pool (which is some value >=128, depending on the number of processors). That said, typically you should not have so many shards per node, and allowing such a high number of node_concurrent_recoveries will also not play well with other parts of the system (e.g. shard balancer). Fixing this will require moving this code to be async, which is not a small thing to do. In the meanwhile, we can think about adding a soft-limit to the node_concurrent_recoveries setting.

@howardhuanghua

This comment has been minimized.

Copy link

howardhuanghua commented Dec 13, 2018

@ywelsch, thanks for your comment. Currently, we limit node_concurrent_recoveries setting <=50 in our product environment version based on 6.4.3 as follow,

RestClusterUpdateSettingsAction.java prepareRequest function:

        Settings settings = EMPTY_SETTINGS;
        if (source.containsKey(TRANSIENT)) {
            clusterUpdateSettingsRequest.transientSettings((Map) source.get(TRANSIENT));
            settings = clusterUpdateSettingsRequest.transientSettings();
        }
        if (source.containsKey(PERSISTENT)) {
            clusterUpdateSettingsRequest.persistentSettings((Map) source.get(PERSISTENT));
            settings = clusterUpdateSettingsRequest.persistentSettings();
        }
        
        // we limit node concurrent recoveries, as if incoming+outgoing up to generic thread pool would potentially cause cluster hang.
        if (settings.getAsInt(ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_RECOVERIES_SETTING.getKey(), 0) > MAX_CONCURRENT_RECOVERIES
                || settings.getAsInt(ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_INCOMING_RECOVERIES_SETTING.getKey(), 0) > MAX_CONCURRENT_RECOVERIES
                || settings.getAsInt(ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_OUTGOING_RECOVERIES_SETTING.getKey(), 0) > MAX_CONCURRENT_RECOVERIES) {
            throw new IllegalArgumentException("Can't set node concurrent recoveries greater than " + MAX_CONCURRENT_RECOVERIES +".");
        }

Please give us some suggestions if you have, thanks a lot.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Jan 2, 2019

Don't block on peer recovery on the target side
Today we block using the generic threadpool on the target side
until the source side has fully executed the recovery. We still
block on the source side executing the recovery in a blocking fashion
but there is no reason to block on the target side. This will
release generic threads early if there are many concurrent recoveries
happen.

Relates to elastic#36195

s1monw added a commit that referenced this issue Jan 4, 2019

Don't block on peer recovery on the target side (#37076)
Today we block using the generic thread-pool on the target side
until the source side has fully executed the recovery. We still
block on the source side executing the recovery in a blocking fashion
but there is no reason to block on the target side. This will
release generic threads early if there are many concurrent recoveries
happen.

Relates to #36195

s1monw added a commit that referenced this issue Jan 4, 2019

Don't block on peer recovery on the target side (#37076)
Today we block using the generic thread-pool on the target side
until the source side has fully executed the recovery. We still
block on the source side executing the recovery in a blocking fashion
but there is no reason to block on the target side. This will
release generic threads early if there are many concurrent recoveries
happen.

Relates to #36195

dnhatn added a commit that referenced this issue Jan 12, 2019

Make recovery source partially non-blocking (#37291)
Today a peer-recovery may run into a deadlock if the value of
node_concurrent_recoveries is too high. This happens because the
peer-recovery is executed in a blocking fashion. This commit attempts
to make the recovery source partially non-blocking. I will make three
follow-ups to make it fully non-blocking: (1) send translog operations,
(2) primary relocation, (3) send commit files.

Relates #36195

dnhatn added a commit that referenced this issue Jan 13, 2019

Make recovery source partially non-blocking (#37291)
Today a peer-recovery may run into a deadlock if the value of
node_concurrent_recoveries is too high. This happens because the
peer-recovery is executed in a blocking fashion. This commit attempts
to make the recovery source partially non-blocking. I will make three
follow-ups to make it fully non-blocking: (1) send translog operations,
(2) primary relocation, (3) send commit files.

Relates #36195
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment