Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Enable optimized single-proc allocation helpers for single-proc x86/x64 systems only #27014

Merged
merged 1 commit into from Oct 8, 2019

Conversation

jkotas
Copy link
Member

@jkotas jkotas commented Oct 3, 2019

Use maximum number of processors the process may run on to determine whether it is ok to use
single-proc allocation helpers. It is not sufficient to depend on current process affinity since
that can change during the process lifetime.

Also, the single-proc allocation helpers work well on x86/x64 systems only because of they depend
on atomic non-interlocked increment instruction for good performance. Such instruction is available
on x86/x64 only. Disable them everywhere else.

Fixes #26990

…64 systems only

Use maximum number of processors the process may run on to determine whether it is ok to use
single-proc allocation helpers. It is not sufficient to depend on current process affinity since
that can change during the process lifetime.

Also, the single-proc allocation helpers work well on x86/x64 systems only because of they depend
on atomic non-interlocked increment instruction for good performance. Such instruction is available
on x86/x64 only. Disable them everywhere else.

Fixes #26990
@jkotas jkotas requested review from janvorli and Maoni0 October 3, 2019 18:54
@Maoni0
Copy link
Member

Maoni0 commented Oct 3, 2019

I have mixed feelings about this - this stops using the global alloc context even when the process is affinitized to a single proc which I would think is a way more common scenario than process affinity changing while the process is running....

@jkotas
Copy link
Member Author

jkotas commented Oct 3, 2019

I have built a small micro-benchmark to see the difference between the allocation rate for global alloc context vs. per-thread allocation contexts. The micro-benchmarks runs for (int i = 0; i < 1000000000; i++) GC.KeepAlive(new object()); in a loop and computes the allocation rate from that.

The average results that I see on my machine (Xeon E5, Windows x64) show:

  • Global allocation context: ~4.7GB/s
  • Per-thread allocation context: ~4.8GB/s

The helper for global allocation context is 12 instructions vs. the helper for per-thread allocation context is 14 instruction. However, the global allocation context requires 4 writes per allocation (lock, MethodTable*, allocptr, unlock), but the per-thread allocation context is only 2 memory writes per allocation (MethodTable*, allocptr). I think it explains why the per-thread allocation context is slightly faster even though it is more instructions.

@Maoni0 Do you have any benchmarks for which you would like to keep the global alloc contexts? Ideally, I would love to get rid of them to make everything simpler.

@Maoni0
Copy link
Member

Maoni0 commented Oct 4, 2019

@jkotas I would love to make things simpler too but I'm pretty sure some perf benefit was seen with the global alloc context in this scenario - of course that was a long time ago (before my time) so it wouldn't be surprising if things have changed. so I would like to at least some perf investigation been done with GCPerfSim. @andy-ms can fill you in on how to run it.

another thing is how this performs on Linux as getting to the per thread alloc context is more expensive on Linux.

I'll be OOF starting tomorrow for a week. also CC-ing @sergiy-k and @PeterSolMS who can assist with the perf investigation in my absence.

@jkotas
Copy link
Member Author

jkotas commented Oct 4, 2019

Thanks. @andy-ms Could you please send me instructions for how to run GCPerfSim?

another thing is how this performs on Linux as getting to the per thread alloc context is more expensive on Linux.

We have not implemented the UP allocation helpers at all on Unix. This change affects Windows only.

@jkotas jkotas added the area-GC label Oct 4, 2019
@jkotas
Copy link
Member Author

jkotas commented Oct 8, 2019

GCPerfSim has not identified any issues

@jkotas jkotas merged commit 4e702da into dotnet:master Oct 8, 2019
jkotas added a commit to jkotas/coreclr that referenced this pull request Oct 8, 2019
…64 systems only (dotnet#27014)

Use maximum number of processors the process may run on to determine whether it is ok to use
single-proc allocation helpers. It is not sufficient to depend on current process affinity since
that can change during the process lifetime.

Also, the single-proc allocation helpers work well on x86/x64 systems only because of they depend
on atomic non-interlocked increment instruction for good performance. Such instruction is available
on x86/x64 only. Disable them everywhere else.

Fixes #26990
@jkotas jkotas deleted the fix-26990 branch October 9, 2019 17:09
jkotas added a commit that referenced this pull request Oct 14, 2019
…64 systems only (#27014) (#27080)

Use maximum number of processors the process may run on to determine whether it is ok to use
single-proc allocation helpers. It is not sufficient to depend on current process affinity since
that can change during the process lifetime.

Also, the single-proc allocation helpers work well on x86/x64 systems only because of they depend
on atomic non-interlocked increment instruction for good performance. Such instruction is available
on x86/x64 only. Disable them everywhere else.

Fixes #26990
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
2 participants