Don't allow idle threads to live forever #612

mjgp2 · 2023-04-27T13:45:26Z

From https://bz.apache.org/bugzilla/show_bug.cgi?id=66581

Currently the ExecutorService used for this is as such:

ExecutorService executorService = new ThreadPoolExecutor(0, Integer.MAX_VALUE, Long.MAX_VALUE, TimeUnit.MILLISECONDS, new SynchronousQueue<>(), new AsyncIOThreadFactory());

This means that the threads never terminate and are spawned whenever a thread is not available, which for a server under a spike of load spawns many threads which never exit, consuming system resources indefinitely.

I would suggest that a saner idle timeout than Long.MAX_VALUE should be used, such as 60 seconds, unless there is a good reason I am unaware of as to why such threads should never die.

rmaucher · 2023-04-27T14:41:37Z

This seems like a good idea (although no idea about the best default idle value). It was like this since the code was initially added.

markt-asf · 2023-04-27T15:30:04Z

My reading of the code is that the AsynchronousChannelProvider submits tasks that never end to the executor so this patch has no effect. The purpose of the Executor is to replace any threads that fail. Local testing appears to confirm this. Unless I've missed something (always possible) I think this PR (and the associated BZ issue) needs to be closed without being merged.

mjgp2 · 2023-04-27T15:35:03Z

I'll have a further look at this tomorrow. The situation I have is hundreds of idle threads. Potentially a configurable solution using virtual threads would mitigate the issue.

markt-asf · 2023-04-27T15:36:31Z

A thread dump would be helpful.

rmaucher · 2023-04-27T15:51:01Z

Hum, this is the websocket client and it is using NIO2, so that executor would be used internally for all things that are run asynchronously for IO like calling completion handlers and so on.

mjgp2 · 2023-04-28T11:12:32Z

I've done a smaller thread ump (attached).

I can find this documentation on the AsynchronousChannelGroup.withCachedThreadPool call made within this websocket code:

•  An asynchronous channel group associated with a cached thread pool submits events to the thread pool that simply invoke the user's completion handler.
•  Internal kernel's I/O operations are handled by one or more internal threads that are not visible to the user application.
•  That means you have one hidden thread pool that dispatch events to a cached thread pool, which in turn invokes completion handler
•  Probability of suffering the hangs problem as with the FixedThreadPool is lower.
•  Still might grows infinitely (those infinite thread pool should have never existed anyway!).
•  At least you guarantee that the kernel will be able to complete its I/O operations (like reading bytes).
•  Oups...CachedThreadPool must support unbounded queueing to works properly (grrr not sure I like that!!!).
•  So you can possibly lock all the threads and feed the queue forever until OOM happens. Great!
•  Of course, nobody will ever do that!

So in short this threadpool is only used for completion callbacks.

It seems like the decision was made to have a synchronous queue with an unbounded number of threads, rather than have a bounded number of threads and an unbounded queue. To me this seems like the wrong way round, interested to hear thoughts?

There is no good reason to keep the threads running forever that I can see based on the documentation and also other implementations that call the withCachedThreadPool method?

These threads are globally shared, however, looking at the dump there are two SecureIO threads created for every websocket session. This seems very resource hungry if you have hundreds of websocket connections? This is something that I imagine could be solved by virtual threads, or refactoring to have a shared thread pool?

Thanks all

threaddump.txt

markt-asf · 2023-05-02T09:09:15Z

The thread dump confirms the earlier analysis. The sun.nio.ch.KQueuePort$EventHandlerTask never exits so the proposed change will not have the impact desired.
It would be worth exploring if the default group configuration has changed since the code was first written since the intention was to mimic that but with the added benefits that multiple WsWebSocketContainer would share a single pool and that the pool would be destroyed when no longer required.

markt-asf · 2023-05-02T11:03:37Z

The default AsynchronousChannelGroup has adopted a keep-alive of 60s at some point but since the task never exits, it makes no difference in this case.
We could make the similar change in Tomcat but, until sun.nio.ch.KQueuePort$EventHandlerTask changes, it will have no effect.
'KQueuePort` appears to be MacOS specific (at least in the latest source code) so there may be benefits in this change on other platforms.
Overall, I think I have convinced myself this change makes sense, even if it doesn't solve the issue the OP is seeing.

Don't allow idle threads to live forever

0a78d3b

jackshirazi mentioned this pull request Apr 28, 2023

Unexpected side-effect of sampling profiler with file copy within transaction elastic/apm-agent-java#3119

Closed

markt-asf merged commit cd25be5 into apache:main May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't allow idle threads to live forever #612

Don't allow idle threads to live forever #612

mjgp2 commented Apr 27, 2023 •

edited

Loading

rmaucher commented Apr 27, 2023

markt-asf commented Apr 27, 2023

mjgp2 commented Apr 27, 2023

markt-asf commented Apr 27, 2023

rmaucher commented Apr 27, 2023

mjgp2 commented Apr 28, 2023 •

edited

Loading

markt-asf commented May 2, 2023

markt-asf commented May 2, 2023

Don't allow idle threads to live forever #612

Don't allow idle threads to live forever #612

Conversation

mjgp2 commented Apr 27, 2023 • edited Loading

rmaucher commented Apr 27, 2023

markt-asf commented Apr 27, 2023

mjgp2 commented Apr 27, 2023

markt-asf commented Apr 27, 2023

rmaucher commented Apr 27, 2023

mjgp2 commented Apr 28, 2023 • edited Loading

markt-asf commented May 2, 2023

markt-asf commented May 2, 2023

mjgp2 commented Apr 27, 2023 •

edited

Loading

mjgp2 commented Apr 28, 2023 •

edited

Loading