Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[RFC] Reusable ThreadPool workitems #20936

Closed

Conversation

benaadams
Copy link
Member

For the global threadpool queue workitems can be reused.

This introduces a threadpool thread local pool (64 items) and a global pool (256 items).

There are two types of pools; one for WaitCallback and one for Action<TState>.

Items are returned to the threadpool local pool of the thread they execute on; when that is full they are returned to the global pool; until full then they are discarded.

Queued items from a threadpool thread first attempt to get an item from their local pool, then the global pool, then they create a new item.

Queued items from a non-threadpool thread first attempt to get an item from the global pool, then they create a new item.

Items queued to the thread local work queues, rather than the global queue do not use reusable items.

Generic Action<TState> items are only pooled if they are an object type; as they are stored as object and pass through a type converter to execute; which would box value-types.

Before
image

After
image

/cc @stephentoub @davidfowl

@stephentoub
Copy link
Member

Can you share a macro benchmark (e.g. some real-ish asp.net workload) that demonstrates this is actually a net win on throughout/load/etc.? I get nervous any time we introduce a global pool like this. It's only going to help with the first N items queued to execute concurrently (where N is however many you're willing to cache), it introduces contention, these are otherwise relatively small objects that are inexpensive to create, etc.

@stephentoub
Copy link
Member

cc: @kouvel, @jkotas

@benaadams
Copy link
Member Author

I get nervous any time we introduce a global pool like this.

Me too; why I marked it as RFC 😄

Can you share a macro benchmark (e.g. some real-ish asp.net workload) that demonstrates this is actually a net win on throughout/load/etc.?

Working on it. Current allocations for 816,628 requests in 2.2 for the aspnet/KestrelHttpServer/.../benchmarkapps/PlatformBenchmarks

image

@benaadams
Copy link
Member Author

In 3.0 with this change for 948,734 requests

image

Though need to compare 3.0 before and after for perf

@benaadams
Copy link
Member Author

Hmm, looking at the allocations; wasn't this ValueTaskAwaiter<TResult> interface cast boxing elided by the Jit?

else if ((null != (object)default(TAwaiter)) && (awaiter is IStateMachineBoxAwareAwaiter))
{
try
{
((IStateMachineBoxAwareAwaiter)awaiter).AwaitUnsafeOnCompleted(box);

image

/cc @AndyAyersMS @stephentoub

@jkotas
Copy link
Member

jkotas commented Nov 11, 2018

it introduces contention, these are otherwise relatively small objects that are inexpensive to create, etc.

Just to add one more less obvious to the list: more cards with Gen2 -> Gen0 pointers that make Gen0 pauses longer.

Can you share a macro benchmark

Agree, the allocating rate does not really matter for these kind of optimizations. What matters is throughput, length of GC pauses are, peak memory consumption, etc.

We had number of these custom allocators to solve various problems, and they pretty much always turned out to be a bad idea at the end after being fine-tuned to work well under different conditions (example: #18360). It is hard to beat GC for small objects.

@benaadams
Copy link
Member Author

benaadams commented Nov 11, 2018

Add issue for the allocations of ValueTaskAwaiter<T> and TaskAwaiter https://github.com/dotnet/coreclr/issues/20937

@benaadams
Copy link
Member Author

Those allocations might be tiered jit; but alas Windows update (insider build) broke my WSL test bed after reporting so I can't re-test in same way 😢

@AndyAyersMS
Copy link
Member

Yes, tiering is a possible explanation for the allocations, but I expect to switch to tier 1 code quickly enough that in realistic tests it wouldn't be noticeable.

@benaadams
Copy link
Member Author

benaadams commented Nov 11, 2018

Yes, tiering is a possible explanation for the allocations, but I expect to switch to tier 1 code quickly enough that in realistic tests it wouldn't be noticeable.

Manually, triggering 20ish requests triggers improved asm. The thousands of allocations from starting with the load-test is probably more a demonstration of the benefits of tier0, in that it was able to handle the load much earlier (as was running checked build)

@benaadams
Copy link
Member Author

benaadams commented Nov 11, 2018

Allocations after a few warm-up rounds, for 1,254,606 requests against PR (149.4k allocated objects 12.67MB, 144.4k collected object 11.13MB)

image

GC Heap 366 new objects (17.3kB); 1.6k dead objects (89.1kB); 4.6k survived objects (1.53MB)

image

@benaadams
Copy link
Member Author

benaadams commented Nov 12, 2018

Not worth it; Kestrel can SocketTransport can eliminate almost all of its QUWI callbacks by implementing IThreadPoolWorkItem aspnet/KestrelHttpServer#3095

image

Pipes can equally do the same (when scheduler == ThreadPoolScheduler); though the Pipe may need 4 subobjects so it can implement the interface 4 times; or a state carrier type that it shares.

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants