Unbound threadpools considered harmful #5152

tkurki · 2014-02-18T07:53:20Z

ElasticSearch server and java client both have an unbound, undocumented threadpool that can cause lockup of the whole JVM process and take down the whole server machine: #5151.

The minimum change is documenting the unbound generic pool & instructions for reconfiguring it at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-threadpool.html.

IMHO unbound threadpool is a Bad Idea from the start. Bound pool with some reasonable size queue and abort policy would not cause jvm/server lockup in a runaway case, instead you would get some meaningful information in the log.

tkurki · 2014-02-18T07:53:50Z

https://groups.google.com/forum/#!msg/elasticsearch/oKQMoOKhSU8/09-P9bU4t_wJ

scottbessler · 2014-05-21T18:51:40Z

This appears to have bit us as well. We noticed it when we had an ActionListener on a ListenableActionFuture<BulkResponse> who took perhaps too long to execute and/or had concurrency issues (it was actually firing a Guava Eventbus event where the handlers are synchronized). We started seeing exceptions like:

org.elasticsearch.client.transport.NoNodeAvailableException: No node available
    at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:249)

And then with Yourkit attached we saw the same symptom of 7000+ of these threads being spun up, and hanging the jvm.

clintongormley · 2014-05-22T08:30:21Z

@scottbessler What version of Elasticsearch are you using?

kimchy · 2014-05-22T12:45:20Z

The idea of the generic thread pool is to have an option to fork to a thread without having to block on it or get rejected, which is needed in some cases. In places where it is used, its good, but there are places where it simply should not be used, specifically, the listening thread pool when the pool is threaded, we should have a dedicated thread pool for it that is bounded.

scottbessler · 2014-05-22T13:26:19Z

0.90.6

clintongormley · 2014-05-22T13:33:00Z

@scottbessler It appears to be fixed in 0.90.8 onwards #5151 #4162

scottbessler · 2014-05-22T16:44:10Z

The race condition may be fixed, but the unbounded undocumented thread pool lurks waiting for a similar issue to crush JVMs. If nothing else it seems that at least letting users configure this threadpool (even if the default is unbounded) would at least let them choose their poison.

kimchy · 2014-07-11T08:45:00Z

@scottbessler yea, though in all places (except for the listeners today), we are very careful at when we use the generic thread pool. We need to look into potentially changing the generic thread pool to be bounded, but have unlimited queue, since we rely on the fact that it won't block when trying to add something to it. Regardless, the listeners callback should have their own thread pool.

mgreene · 2014-09-04T04:48:03Z

@kimchy I'm curious as to why you suggest a bounded thread pool but unbounded queue. In essence, a unbounded thread pool will thrash CPU and potentially blow out the stack space on the JVM. An unbounded queue can blow out the heap.

I think the great thing (and also a PITA) about the Java concurrency semantics is that w/r/t to queues and thread pools, they encourage you to deal with the back pressure. You can elect to simplify your code by adding an unbounded queue or pool but in turn are allowing back pressure to manifest itself in weird and cryptic ways that's harder to troubleshoot in production.

kamaradclimber · 2014-09-04T12:19:35Z

I think we are hurt by one of this unbounded thread pool (called "generic").
Here is a thread dump:

"elasticsearch[Trump][generic][T#62615]" daemon prio=10 tid=0x00007f23f9db7800 nid=0x9b0c waiting on condition [0x00007f1e0f82a000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000006cbd1e030> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
    at java.util.concurrent.CopyOnWriteArrayList.remove(CopyOnWriteArrayList.java:507)
    at org.elasticsearch.cluster.service.InternalClusterService.remove(InternalClusterService.java:182)
    at org.elasticsearch.action.support.master.TransportMasterNodeOperationAction$3.onTimeout(TransportMasterNodeOperationAction.java:179)
    at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:492)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

we currently have 28k threads with the same stack

The original reason is probably elsewhere (I should file a separate ticket for this as well) but the unbound nature of the threadpool bites us here.

Today, when executing an action (mainly when using the Java API), a listener threaded flag can be set to true in order to execute the listener on a different thread pool. Today, this thread pool is the generic thread pool, which is cached. This can create problems for Java clients (mainly) around potential thread explosion. Introduce a new thread pool called listener, that is fixed sized and defaults to the half the cores maxed at 10, and use it where listeners are executed. relates to elastic#5152 closes elastic#7837

Today, when executing an action (mainly when using the Java API), a listener threaded flag can be set to true in order to execute the listener on a different thread pool. Today, this thread pool is the generic thread pool, which is cached. This can create problems for Java clients (mainly) around potential thread explosion. Introduce a new thread pool called listener, that is fixed sized and defaults to the half the cores maxed at 10, and use it where listeners are executed. relates to #5152 closes #7837

ghost · 2014-09-30T02:04:07Z

Hi, I am hitting this issue in our servers. Is this included in latest version 1.3.3? If not, while I will wait for version with fix, is there anything I can do to reduce the probability of hitting it in my client as it brings my server machine down?

"elasticsearch[Horus][generic][T#16356]" daemon prio=10 tid=0x00007f27ca746800 nid=0x5cd8 runnable [0x00007f25a4444000]
java.lang.Thread.State: RUNNABLE
at java.util.Arrays.fill(Arrays.java:1921)
at org.elasticsearch.common.io.stream.HandlesStreamOutput$HandleTable.clear(HandlesStreamOutput.java:192)
at org.elasticsearch.common.io.stream.HandlesStreamOutput$HandleTable.(HandlesStreamOutput.java:149)
at org.elasticsearch.common.io.stream.HandlesStreamOutput.(HandlesStreamOutput.java:39)
at org.elasticsearch.common.io.stream.HandlesStreamOutput.(HandlesStreamOutput.java:43)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:548)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:68)
at org.elasticsearch.client.transport.support.InternalTransportClient$2.doWithNode(InternalTransportClient.java:109)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:259)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:89)
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

clintongormley · 2014-10-14T12:02:56Z

Hi @jfiedler

PR #7837 adds a listener thread pool, which should fix the issue that you are seeing (in 1.4). The generic threadpool is still cached and unbounded, but we are very careful about what gets run on these threads.

Closing this ticket.

dernasherbrezon · 2015-04-03T12:12:19Z

There are still unbounded thread pools exist at NettyTransport (

elasticsearch/src/main/java/org/elasticsearch/transport/netty/NettyTransport.java

Line 312 in 1032429

private ClientBootstrap createClientBootstrap() {

)

I would expect them configured the same way index/search/&etc threadpools. Could create pull request if needed

Today, when executing an action (mainly when using the Java API), a listener threaded flag can be set to true in order to execute the listener on a different thread pool. Today, this thread pool is the generic thread pool, which is cached. This can create problems for Java clients (mainly) around potential thread explosion. Introduce a new thread pool called listener, that is fixed sized and defaults to the half the cores maxed at 10, and use it where listeners are executed. relates to elastic#5152 closes elastic#7837

kimchy added enhancement labels May 22, 2014

clintongormley added v1.4.0 and removed v1.4.0 labels Jul 11, 2014

clintongormley added adoptme and removed discuss labels Jul 11, 2014

jpountz removed the adoptme label Aug 1, 2014

jpountz assigned dakrone Aug 1, 2014

kamaradclimber mentioned this issue Sep 4, 2014

~28k threads on es java client #7592

Closed

clintongormley added v1.5.0 and removed v1.4.0.Beta1 labels Sep 8, 2014

nfx mentioned this issue Oct 4, 2014

Failed to deserialize exception response from stream #7823

Closed

clintongormley closed this as completed Oct 14, 2014

clintongormley removed the v1.5.0 label Mar 19, 2015

dernasherbrezon mentioned this issue Apr 15, 2015

Unbounded threadpools in Netty #10616

Closed

This was referenced Oct 28, 2015

Forbid cached thread pools #14336

Closed

Forbid changing thread pool types #14367

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbound threadpools considered harmful #5152

Unbound threadpools considered harmful #5152

tkurki commented Feb 18, 2014

tkurki commented Feb 18, 2014

scottbessler commented May 21, 2014

clintongormley commented May 22, 2014

kimchy commented May 22, 2014

scottbessler commented May 22, 2014

clintongormley commented May 22, 2014

scottbessler commented May 22, 2014

kimchy commented Jul 11, 2014

mgreene commented Sep 4, 2014

kamaradclimber commented Sep 4, 2014

ghost commented Sep 30, 2014

clintongormley commented Oct 14, 2014

dernasherbrezon commented Apr 3, 2015

Unbound threadpools considered harmful #5152

Unbound threadpools considered harmful #5152

Comments

tkurki commented Feb 18, 2014

tkurki commented Feb 18, 2014

scottbessler commented May 21, 2014

clintongormley commented May 22, 2014

kimchy commented May 22, 2014

scottbessler commented May 22, 2014

clintongormley commented May 22, 2014

scottbessler commented May 22, 2014

kimchy commented Jul 11, 2014

mgreene commented Sep 4, 2014

kamaradclimber commented Sep 4, 2014

ghost commented Sep 30, 2014

clintongormley commented Oct 14, 2014

dernasherbrezon commented Apr 3, 2015