Enable optimized single-proc allocation helpers for single-proc x86/x64 systems only #27014

jkotas · 2019-10-03T18:54:07Z

Use maximum number of processors the process may run on to determine whether it is ok to use
single-proc allocation helpers. It is not sufficient to depend on current process affinity since
that can change during the process lifetime.

Also, the single-proc allocation helpers work well on x86/x64 systems only because of they depend
on atomic non-interlocked increment instruction for good performance. Such instruction is available
on x86/x64 only. Disable them everywhere else.

Fixes #26990

…64 systems only Use maximum number of processors the process may run on to determine whether it is ok to use single-proc allocation helpers. It is not sufficient to depend on current process affinity since that can change during the process lifetime. Also, the single-proc allocation helpers work well on x86/x64 systems only because of they depend on atomic non-interlocked increment instruction for good performance. Such instruction is available on x86/x64 only. Disable them everywhere else. Fixes #26990

Maoni0 · 2019-10-03T19:43:43Z

I have mixed feelings about this - this stops using the global alloc context even when the process is affinitized to a single proc which I would think is a way more common scenario than process affinity changing while the process is running....

jkotas · 2019-10-03T22:39:43Z

I have built a small micro-benchmark to see the difference between the allocation rate for global alloc context vs. per-thread allocation contexts. The micro-benchmarks runs for (int i = 0; i < 1000000000; i++) GC.KeepAlive(new object()); in a loop and computes the allocation rate from that.

The average results that I see on my machine (Xeon E5, Windows x64) show:

Global allocation context: ~4.7GB/s
Per-thread allocation context: ~4.8GB/s

The helper for global allocation context is 12 instructions vs. the helper for per-thread allocation context is 14 instruction. However, the global allocation context requires 4 writes per allocation (lock, MethodTable*, allocptr, unlock), but the per-thread allocation context is only 2 memory writes per allocation (MethodTable*, allocptr). I think it explains why the per-thread allocation context is slightly faster even though it is more instructions.

@Maoni0 Do you have any benchmarks for which you would like to keep the global alloc contexts? Ideally, I would love to get rid of them to make everything simpler.

Maoni0 · 2019-10-04T05:35:11Z

@jkotas I would love to make things simpler too but I'm pretty sure some perf benefit was seen with the global alloc context in this scenario - of course that was a long time ago (before my time) so it wouldn't be surprising if things have changed. so I would like to at least some perf investigation been done with GCPerfSim. @andy-ms can fill you in on how to run it.

another thing is how this performs on Linux as getting to the per thread alloc context is more expensive on Linux.

I'll be OOF starting tomorrow for a week. also CC-ing @sergiy-k and @PeterSolMS who can assist with the perf investigation in my absence.

jkotas · 2019-10-04T13:55:38Z

Thanks. @andy-ms Could you please send me instructions for how to run GCPerfSim?

another thing is how this performs on Linux as getting to the per thread alloc context is more expensive on Linux.

We have not implemented the UP allocation helpers at all on Unix. This change affects Windows only.

jkotas · 2019-10-08T02:04:12Z

GCPerfSim has not identified any issues

…64 systems only (dotnet#27014) Use maximum number of processors the process may run on to determine whether it is ok to use single-proc allocation helpers. It is not sufficient to depend on current process affinity since that can change during the process lifetime. Also, the single-proc allocation helpers work well on x86/x64 systems only because of they depend on atomic non-interlocked increment instruction for good performance. Such instruction is available on x86/x64 only. Disable them everywhere else. Fixes #26990

…64 systems only (#27014) (#27080) Use maximum number of processors the process may run on to determine whether it is ok to use single-proc allocation helpers. It is not sufficient to depend on current process affinity since that can change during the process lifetime. Also, the single-proc allocation helpers work well on x86/x64 systems only because of they depend on atomic non-interlocked increment instruction for good performance. Such instruction is available on x86/x64 only. Disable them everywhere else. Fixes #26990

jkotas requested review from janvorli and Maoni0 October 3, 2019 18:54

jkotas added the area-GC label Oct 4, 2019

jkotas merged commit 4e702da into dotnet:master Oct 8, 2019

jkotas deleted the fix-26990 branch October 9, 2019 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable optimized single-proc allocation helpers for single-proc x86/x64 systems only #27014

Enable optimized single-proc allocation helpers for single-proc x86/x64 systems only #27014

jkotas commented Oct 3, 2019

Maoni0 commented Oct 3, 2019

jkotas commented Oct 3, 2019 •

edited

Maoni0 commented Oct 4, 2019

jkotas commented Oct 4, 2019

jkotas commented Oct 8, 2019

Enable optimized single-proc allocation helpers for single-proc x86/x64 systems only #27014

Enable optimized single-proc allocation helpers for single-proc x86/x64 systems only #27014

Conversation

jkotas commented Oct 3, 2019

Maoni0 commented Oct 3, 2019

jkotas commented Oct 3, 2019 • edited

Maoni0 commented Oct 4, 2019

jkotas commented Oct 4, 2019

jkotas commented Oct 8, 2019

jkotas commented Oct 3, 2019 •

edited