Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbound threadpools considered harmful #5152

Closed
tkurki opened this issue Feb 18, 2014 · 13 comments
Closed

Unbound threadpools considered harmful #5152

tkurki opened this issue Feb 18, 2014 · 13 comments
Assignees

Comments

@tkurki
Copy link

tkurki commented Feb 18, 2014

ElasticSearch server and java client both have an unbound, undocumented threadpool that can cause lockup of the whole JVM process and take down the whole server machine: #5151.

The minimum change is documenting the unbound generic pool & instructions for reconfiguring it at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-threadpool.html.

IMHO unbound threadpool is a Bad Idea from the start. Bound pool with some reasonable size queue and abort policy would not cause jvm/server lockup in a runaway case, instead you would get some meaningful information in the log.

@tkurki
Copy link
Author

tkurki commented Feb 18, 2014

@scottbessler
Copy link

This appears to have bit us as well. We noticed it when we had an ActionListener on a ListenableActionFuture<BulkResponse> who took perhaps too long to execute and/or had concurrency issues (it was actually firing a Guava Eventbus event where the handlers are synchronized). We started seeing exceptions like:

org.elasticsearch.client.transport.NoNodeAvailableException: No node available
    at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:249)

And then with Yourkit attached we saw the same symptom of 7000+ of these threads being spun up, and hanging the jvm.

@clintongormley
Copy link

@scottbessler What version of Elasticsearch are you using?

@kimchy
Copy link
Member

kimchy commented May 22, 2014

The idea of the generic thread pool is to have an option to fork to a thread without having to block on it or get rejected, which is needed in some cases. In places where it is used, its good, but there are places where it simply should not be used, specifically, the listening thread pool when the pool is threaded, we should have a dedicated thread pool for it that is bounded.

@scottbessler
Copy link

0.90.6

@clintongormley
Copy link

@scottbessler It appears to be fixed in 0.90.8 onwards #5151 #4162

@scottbessler
Copy link

The race condition may be fixed, but the unbounded undocumented thread pool lurks waiting for a similar issue to crush JVMs. If nothing else it seems that at least letting users configure this threadpool (even if the default is unbounded) would at least let them choose their poison.

@kimchy
Copy link
Member

kimchy commented Jul 11, 2014

@scottbessler yea, though in all places (except for the listeners today), we are very careful at when we use the generic thread pool. We need to look into potentially changing the generic thread pool to be bounded, but have unlimited queue, since we rely on the fact that it won't block when trying to add something to it. Regardless, the listeners callback should have their own thread pool.

@mgreene
Copy link
Contributor

mgreene commented Sep 4, 2014

@kimchy I'm curious as to why you suggest a bounded thread pool but unbounded queue. In essence, a unbounded thread pool will thrash CPU and potentially blow out the stack space on the JVM. An unbounded queue can blow out the heap.

I think the great thing (and also a PITA) about the Java concurrency semantics is that w/r/t to queues and thread pools, they encourage you to deal with the back pressure. You can elect to simplify your code by adding an unbounded queue or pool but in turn are allowing back pressure to manifest itself in weird and cryptic ways that's harder to troubleshoot in production.

@kamaradclimber
Copy link

I think we are hurt by one of this unbounded thread pool (called "generic").
Here is a thread dump:

"elasticsearch[Trump][generic][T#62615]" daemon prio=10 tid=0x00007f23f9db7800 nid=0x9b0c waiting on condition [0x00007f1e0f82a000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000006cbd1e030> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
    at java.util.concurrent.CopyOnWriteArrayList.remove(CopyOnWriteArrayList.java:507)
    at org.elasticsearch.cluster.service.InternalClusterService.remove(InternalClusterService.java:182)
    at org.elasticsearch.action.support.master.TransportMasterNodeOperationAction$3.onTimeout(TransportMasterNodeOperationAction.java:179)
    at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:492)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

we currently have 28k threads with the same stack

The original reason is probably elsewhere (I should file a separate ticket for this as well) but the unbound nature of the threadpool bites us here.

kimchy added a commit to kimchy/elasticsearch that referenced this issue Sep 23, 2014
Today, when executing an action (mainly when using the Java API), a listener threaded flag can be set to true in order to execute the listener on a different thread pool. Today, this thread pool is the generic thread pool, which is cached. This can create problems for Java clients (mainly) around potential thread explosion.
Introduce a new thread pool called listener, that is fixed sized and defaults to the half the cores maxed at 10, and use it where listeners are executed.
relates to elastic#5152
closes elastic#7837
kimchy added a commit that referenced this issue Sep 25, 2014
Today, when executing an action (mainly when using the Java API), a listener threaded flag can be set to true in order to execute the listener on a different thread pool. Today, this thread pool is the generic thread pool, which is cached. This can create problems for Java clients (mainly) around potential thread explosion.
Introduce a new thread pool called listener, that is fixed sized and defaults to the half the cores maxed at 10, and use it where listeners are executed.
relates to #5152
closes #7837
kimchy added a commit that referenced this issue Sep 25, 2014
Today, when executing an action (mainly when using the Java API), a listener threaded flag can be set to true in order to execute the listener on a different thread pool. Today, this thread pool is the generic thread pool, which is cached. This can create problems for Java clients (mainly) around potential thread explosion.
Introduce a new thread pool called listener, that is fixed sized and defaults to the half the cores maxed at 10, and use it where listeners are executed.
relates to #5152
closes #7837
kimchy added a commit that referenced this issue Sep 25, 2014
Today, when executing an action (mainly when using the Java API), a listener threaded flag can be set to true in order to execute the listener on a different thread pool. Today, this thread pool is the generic thread pool, which is cached. This can create problems for Java clients (mainly) around potential thread explosion.
Introduce a new thread pool called listener, that is fixed sized and defaults to the half the cores maxed at 10, and use it where listeners are executed.
relates to #5152
closes #7837
@ghost
Copy link

ghost commented Sep 30, 2014

Hi, I am hitting this issue in our servers. Is this included in latest version 1.3.3? If not, while I will wait for version with fix, is there anything I can do to reduce the probability of hitting it in my client as it brings my server machine down?

"elasticsearch[Horus][generic][T#16356]" daemon prio=10 tid=0x00007f27ca746800 nid=0x5cd8 runnable [0x00007f25a4444000]
java.lang.Thread.State: RUNNABLE
at java.util.Arrays.fill(Arrays.java:1921)
at org.elasticsearch.common.io.stream.HandlesStreamOutput$HandleTable.clear(HandlesStreamOutput.java:192)
at org.elasticsearch.common.io.stream.HandlesStreamOutput$HandleTable.(HandlesStreamOutput.java:149)
at org.elasticsearch.common.io.stream.HandlesStreamOutput.(HandlesStreamOutput.java:39)
at org.elasticsearch.common.io.stream.HandlesStreamOutput.(HandlesStreamOutput.java:43)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:548)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:68)
at org.elasticsearch.client.transport.support.InternalTransportClient$2.doWithNode(InternalTransportClient.java:109)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:259)
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:89)
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

@clintongormley
Copy link

Hi @jfiedler

PR #7837 adds a listener thread pool, which should fix the issue that you are seeing (in 1.4). The generic threadpool is still cached and unbounded, but we are very careful about what gets run on these threads.

Closing this ticket.

@dernasherbrezon
Copy link

There are still unbounded thread pools exist at NettyTransport (

private ClientBootstrap createClientBootstrap() {
)

I would expect them configured the same way index/search/&etc threadpools. Could create pull request if needed

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
Today, when executing an action (mainly when using the Java API), a listener threaded flag can be set to true in order to execute the listener on a different thread pool. Today, this thread pool is the generic thread pool, which is cached. This can create problems for Java clients (mainly) around potential thread explosion.
Introduce a new thread pool called listener, that is fixed sized and defaults to the half the cores maxed at 10, and use it where listeners are executed.
relates to elastic#5152
closes elastic#7837
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants