[NativeAOT] Simplifying access to thread static variables #84566

VSadov · 2023-04-10T15:36:10Z

separate "fast" inlinable case . (used for singlemodule, not dynamic cases, when optimizing)
make the storage for fast threadstatics a single "combo" instance instead of array of instances.

ghost · 2023-04-10T15:36:15Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes: #84373

separate "fast" inlinable case. (no multimodule, not dynamic)
preinitialize storage for fast threadstatics on thread attach
consider making storage for fast threadstatics a single "combo" instance instead of array of instances.

Author:	VSadov
Assignees:	-
Labels:	`area-NativeAOT-coreclr`
Milestone:	-

VSadov · 2023-04-10T15:52:10Z

src/coreclr/nativeaot/Runtime/amd64/MiscStubs.asm

@@ -41,37 +39,18 @@ LEAF_END RhpStackProbe, _TEXT

 LEAF_ENTRY RhpGetThreadStaticBaseForType, _TEXT


The "fast" access got a lot simpler here. RhpGetThreadStaticBaseForType is what we want eventually not called, but directly inlined by the JIT into callers.

We could yet make this a bit simpler by removing an indirection into the array of storage instances. It may be somewhat challenging though.

Is there any other technical reason why we don't have such assembly helper for Arm64 apart that it is not implemented?

The helper for Arm64 is now implemented

src/coreclr/nativeaot/Runtime/amd64/MiscStubs.S

src/coreclr/nativeaot/Runtime/amd64/MiscStubs.asm

src/coreclr/nativeaot/System.Private.CoreLib/src/Internal/Runtime/ThreadStatics.cs

kunalspathak · 2023-04-11T14:31:22Z

src/coreclr/nativeaot/Runtime/amd64/MiscStubs.asm

@@ -41,37 +39,18 @@ LEAF_END RhpStackProbe, _TEXT

 LEAF_ENTRY RhpGetThreadStaticBaseForType, _TEXT


Is there any other technical reason why we don't have such assembly helper for Arm64 apart that it is not implemented?

VSadov · 2023-04-11T14:59:37Z

Is there any other technical reason why we don't have such assembly helper for Arm64 apart that it is not implemented?

no reason. Also no reason for the helper to be in assembly.
I plan to replace this with c++ and call on all architectures. Until eventually it is inlined.

kunalspathak · 2023-04-11T15:00:22Z

I plan to replace this with c++ and call on all architectures

are you thinking about it as part of this PR?

VSadov · 2023-04-11T15:04:53Z

are you thinking about it as part of this PR?

yes. It would also make it easier to do further changes - no need to fix multiple helpers.

kunalspathak · 2023-04-11T15:16:02Z

What is the speedup from just this change?

VSadov · 2023-04-11T15:26:54Z

What is the speedup from just this change?

i think the cost of the call is still there, so this might not be a huge improvement.
it would be interesting to measure after couple more “easy” changes like moving to c++ and not passing unused arguments.

src/coreclr/nativeaot/Runtime/thread.cpp

src/coreclr/nativeaot/Runtime/threadstore.cpp

VSadov · 2023-04-12T22:12:50Z

What is the speedup from just this change?

I see about 15% speedup on x64. That is with both old and new implementations still making a call to the helper.

With a simple microbenchmark which calls a method accessing bunch of threadstatics in a loop I see:
(time, lower is better)

=== before the change:
4745
4747
4749

=== after the change:
4068
4082
4074

=== coreclr (JIT) 
7250
7272
7262

the benchmark:

using System.Diagnostics;
using System.Runtime.CompilerServices;

namespace ConsoleApp22
{
    class C00 {[ThreadStatic] public static int t_i00;}
    class C01 {[ThreadStatic] public static int t_i01;}
    class C02 {[ThreadStatic] public static int t_i02;}
    class C03 {[ThreadStatic] public static int t_i03;}
    class C04 {[ThreadStatic] public static int t_i04;}
    class C05 {[ThreadStatic] public static int t_i05;}
    class C06 {[ThreadStatic] public static int t_i06;}
    class C07 {[ThreadStatic] public static int t_i07;}
    class C08 {[ThreadStatic] public static int t_i08;}
    class C09 {[ThreadStatic] public static int t_i09;}
    class C10 {[ThreadStatic] public static int t_i10;}
    class C11 {[ThreadStatic] public static int t_i11;}
    class C12 {[ThreadStatic] public static int t_i12;}
    class C13 {[ThreadStatic] public static int t_i13;}
    class C14 {[ThreadStatic] public static int t_i14;}
    class C15 {[ThreadStatic] public static int t_i15;}
    class C16 {[ThreadStatic] public static int t_i16;}
    class C17 {[ThreadStatic] public static int t_i17;}
    class C18 {[ThreadStatic] public static int t_i18;}
    class C19 {[ThreadStatic] public static int t_i19;}
    class C20 {[ThreadStatic] public static int t_i20;}
    class C21 {[ThreadStatic] public static int t_i21;}
    class C22 {[ThreadStatic] public static int t_i22;}
    class C23 {[ThreadStatic] public static int t_i23;}
    class C24 {[ThreadStatic] public static int t_i24;}
    class C25 {[ThreadStatic] public static int t_i25;}
    class C26 {[ThreadStatic] public static int t_i26;}
    class C27 {[ThreadStatic] public static int t_i27;}
    class C28 {[ThreadStatic] public static int t_i28;}
    class C29 {[ThreadStatic] public static int t_i29;}

    internal class Program
    {
        const int iters = 1000000;
        static void Main(string[] args)
        {
            for (; ; )
            {
                Time(AccessTLS);
            }
        }

        static void Time(Action a)
        {
            var sw = Stopwatch.StartNew();
            for (int i = 0; i < 100; i++)
            {
                a();
            }

            sw.Stop();
            System.Console.WriteLine(sw.ElapsedMilliseconds);
        }

        static void AccessTLS()
        {
            for (int i = 0; i < iters; i++)
            {
                OneTLSAccess();
            }
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static void OneTLSAccess()
        {
            C00.t_i00 = C00.t_i00 +C01.t_i01 + C02.t_i02 +C03.t_i03 + C04.t_i04 +C05.t_i05 + C06.t_i06 +C07.t_i07 +
                        C08.t_i08 +C09.t_i09 + C10.t_i10 +C11.t_i11 + C12.t_i12 +C13.t_i13 + C14.t_i14 +C15.t_i15 +
                        C16.t_i16 +C17.t_i17 + C18.t_i18 +C19.t_i19 + C20.t_i20 +C21.t_i21 + C22.t_i22 +C23.t_i23 +
                        C24.t_i24 +C25.t_i25 + C26.t_i26 +C27.t_i27 + C28.t_i28 +C29.t_i29;
        }
    }
}

VSadov · 2023-04-12T22:22:21Z

The codegen looks like the folllowing:

before the changes:

=== thunk
00007ff6`20391907 4c8d0512720f00 lea     r8, [7FF620488B20h]
00007ff6`2039190e 498b08         mov     rcx, qword ptr [r8]
00007ff6`20391911 418b5008       mov     edx, dword ptr [r8+8]
00007ff6`20391915 e9c6050000     jmp     00007FF620391EE0
 
=== helper
00007ff6`20391ee0 8b0512f41900         mov     eax, dword ptr [7FF6205312F8h]
00007ff6`20391ee6 654c8b042558000000   mov     r8, qword ptr gs:[58h]
00007ff6`20391eef 4d8b04c0             mov     r8, qword ptr [r8+rax*8]
00007ff6`20391ef3 b810000000           mov     eax, 10h
00007ff6`20391ef8 4903c0               add     rax, r8
00007ff6`20391efb 448b4108             mov     r8d, dword ptr [rcx+8]
00007ff6`20391eff 488b8090000000       mov     rax, qword ptr [rax+90h]
00007ff6`20391f06 4885c0               test    rax, rax
00007ff6`20391f09 0f84717a0b00         je      00007FF620449980
00007ff6`20391f0f 443b4008             cmp     r8d, dword ptr [rax+8]
00007ff6`20391f13 0f83677a0b00         jae     00007FF620449980
00007ff6`20391f19 4a8b44c010           mov     rax, qword ptr [rax+r8*8+10h]
00007ff6`20391f1e 4885c0               test    rax, rax
00007ff6`20391f21 0f84597a0b00         je      00007FF620449980
00007ff6`20391f27 3b5008               cmp     edx, dword ptr [rax+8]
00007ff6`20391f2a 0f83507a0b00         jae     00007FF620449980
00007ff6`20391f30 488b44d010           mov     rax, qword ptr [rax+rdx*8+10h]
00007ff6`20391f35 4885c0               test    rax, rax
00007ff6`20391f38 0f84427a0b00         je      00007FF620449980
00007ff6`20391f3e c3                   ret

after the changes:

=== thunk:
00007ff6`aa91198e 488d15db710f00 lea     rdx, [7FF6AAA08B70h]
00007ff6`aa911995 8b4a08         mov     ecx, dword ptr [rdx+8]
00007ff6`aa911998 e903110000     jmp     00007FF6AA912AA0
 
=== helper
00007ff6`aa912aa0 65488b042558000000 mov     rax, qword ptr gs:[58h]
00007ff6`aa912aa9 83c102             add     ecx, 2
00007ff6`aa912aac 8b1526f81900       mov     edx, dword ptr [7FF6AAAB22D8h]
00007ff6`aa912ab2 41b910000000       mov     r9d, 10h
00007ff6`aa912ab8 4c8b04d0           mov     r8, qword ptr [rax+rdx*8]
00007ff6`aa912abc 4b8b840898000000   mov     rax, qword ptr [r8+r9+98h]
00007ff6`aa912ac4 488b04c8           mov     rax, qword ptr [rax+rcx*8]
00007ff6`aa912ac8 c3                 ret

jkotas · 2023-04-13T02:54:02Z

src/coreclr/nativeaot/Runtime/threadstore.cpp

+    TypeManager* pSingleTypeManager = GetRuntimeInstance()->GetSingleTypeManager();
+    if (pSingleTypeManager != NULL)
+    {
+        InitInlineThreadStatics(pSingleTypeManager);


This will make thread attach a potential source of unhandled OOM that leads to fail fast. This fail fast is impossible for user code to catch or recover from. I am not sure whether it is a good trade-off to make for native AOT. It will make native AOT less suitable for system-programming like tasks where uncatchable OOMs are a problem.

One possible solution is to ignore the OOM on thread attach and live the storage as NULL and turn the AV on the first use of the threadstatic into an OOM exception. That could be inconvenient though when the access is JIT-inlined.

Or we can do one null-check for the whole thing and call the initializer on the first use.

I think I will switch this to "allocate on first use" pattern. In such case the OOM is most likely to happen while creating managed Thread, either directly or indirectly by accessing managed thread ID.
It would still be possible to access a random threadstatic and get an unexpected OOM, but I think it would be always catcheable.

VSadov · 2023-04-13T05:13:10Z

@MichalStrehovsky The codegen for the helper on Linux is indeed not very good. Unlike on Windows where it seems better than hand-written assembly.

System.Collections.Tests`::RhpGetThreadStaticBaseForType(uint32_t):
    0x55555542c040 <+0>:  push   rbp
    0x55555542c041 <+1>:  mov    rbp, rsp
    0x55555542c044 <+4>:  push   rbx
    0x55555542c045 <+5>:  push   rax
    0x55555542c046 <+6>:  mov    ebx, edi
->  0x55555542c048 <+8>:  mov    rax, qword ptr fs:[0x0]
    0x55555542c051 <+17>: lea    rax, [rax - 0xf0]
    0x55555542c058 <+24>: mov    rax, qword ptr [rax + 0x98]
    0x55555542c05f <+31>: add    ebx, 0x2
    0x55555542c062 <+34>: mov    rax, qword ptr [rax + 8*rbx]
    0x55555542c066 <+38>: add    rsp, 0x8
    0x55555542c06a <+42>: pop    rbx
    0x55555542c06b <+43>: pop    rbp
    0x55555542c06c <+44>: ret

This looks pretty bad. I wonder what makes the compiler confused.

VSadov · 2023-04-13T05:14:38Z

Are we forcing frame pointers somehow?

MichalStrehovsky · 2023-04-13T05:19:12Z

The frame is there on Linux because this is generated as a call. It gets turned into a mov by linker magic during linking. But linker magic is not able to mop up things around it. Linker magic can be only done when we produce executable. Not when we produce a shared library.

This was a good article on Linux TLS I read some time ago: https://maskray.me/blog/2021-02-14-all-about-thread-local-storage

VSadov · 2023-04-13T05:24:05Z

Call to __tls_get_addr@PLT should not be so detrimental. We should see roughly the same codegen as hand-written assembly.

VSadov · 2023-04-13T05:27:37Z

Right, we do disable frame pointer optimizations and that is likely causing this kind of code:

runtime/eng/native/configurecompiler.cmake

Line 407 in 0ac097f

add_compile_options(-fno-omit-frame-pointer)

The reason is "to make it easier to profile".
It seems a bit strange reason - to disable optimizations to make it easier to find perf problems...

Perhaps CoreClr has other reasons for requiring frame pointers (but why on Unix only?)
I think we should consider not doing that, at least for NativeAOT.

jkotas · 2023-04-13T05:36:10Z

The reason is "to make it easier to profile".

Profiling tools on Linux want to have RBP chain. We can double check whether it is still the case - I would expect it to be.

why on Unix only?

We do that on Windows too (look for /Oy-) and then flip it around perf critical methods using #include <optsmallperfcritical.h>. optsmallperfcritical.h does not do anything on non-Windows. Is it possible to fix that?

MichalStrehovsky · 2023-04-13T05:38:25Z

It seems a bit strange reason - to disable optimizations to make it easier to find perf problems...

Fedora had a big discussion about it last year (didn't follow where that went but it is likely still an issue if they were having heated discussions about it): https://www.phoronix.com/news/Fedora-37-No-Omit-Frame-Pointer

VSadov · 2023-04-13T05:40:50Z

Not suppressing that optimization results in as good codegen as there could be:

System.Collections.Tests`::RhpGetThreadStaticBaseForType(uint32_t):
    0x55555542be20 <+0>:  push   rbx
    0x55555542be21 <+1>:  mov    ebx, edi
->  0x55555542be23 <+3>:  mov    rax, qword ptr fs:[0x0]
    0x55555542be2c <+12>: lea    rax, [rax - 0xf0]
    0x55555542be33 <+19>: mov    rax, qword ptr [rax + 0x98]
    0x55555542be3a <+26>: add    ebx, 0x2
    0x55555542be3d <+29>: mov    rax, qword ptr [rax + 8*rbx]
    0x55555542be41 <+33>: pop    rbx
    0x55555542be42 <+34>: ret

VSadov · 2023-04-13T06:09:59Z

Profiling tools on Linux want to have RBP chain. We can double check whether it is still the case - I would expect it to be.

perf allows to choose DWARF, FP chain (a bit more lightweight), or LBR (very lightweight, but Intel-specific).

--call-graph
           Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is
           "fp".

               Allows specifying "fp" (frame pointer) or "dwarf"
               (DWARF's CFI - Call Frame Information) or "lbr"
               (Hardware Last Branch Record facility) as the method to collect
               the information used to show the call graphs.

               In some systems, where binaries are build with gcc
               --fomit-frame-pointer, using the "fp" method will produce bogus
               call graphs, using "dwarf", if available (perf tools linked to
               the libunwind or libdw library) should be used instead.
               Using the "lbr" method doesn't require any compiler options. It
               will produce call graphs from the hardware LBR registers. The
               main limitation is that it is only available on new Intel
               platforms, such as Haswell. It can only get user call chain. It
               doesn't work with branch stack sampling at the same time.

               When "dwarf" recording is used, perf also records (user) stack dump
               when sampled.  Default size of the stack dump is 8192 (bytes).
               User can change the size by passing the size after comma like
               "--call-graph dwarf,4096".

link for the above

flip it around perf critical methods using #include <optsmallperfcritical.h>. optsmallperfcritical.h does not do anything on non-Windows. Is it possible to fix that?

This would probably be the best option, if possible.

MichalStrehovsky · 2023-04-13T09:27:44Z

The reason is "to make it easier to profile".

Profiling tools on Linux want to have RBP chain. We can double check whether it is still the case - I would expect it to be.

why on Unix only?

We do that on Windows too (look for /Oy-) and then flip it around perf critical methods using #include <optsmallperfcritical.h>. optsmallperfcritical.h does not do anything on non-Windows. Is it possible to fix that?

Thinking about this - does RyuJIT support the equivalent of -fno-omit-frame-pointers? Or do we just want it for native code?

jkotas · 2023-04-13T14:33:33Z

Thinking about this - does RyuJIT support the equivalent of -fno-omit-frame-pointers?

On platforms that can use frame chains (all platforms except Win x64), RyuJIT is smart about adding the frame to more complex methods only, so that the trivial methods do not pay for the frame. Also, it does not allocate the frame pointer register so that methods with omitted frame do not break the frame chain. For the frame chain walking profiler, it looks as if there was more method inlining - some methods are omitted in the frame chain, but the frame chain still gives you pretty accurate picture of the callstack. Look for rpMustCreateEBPFrame and ETW_EBP_FRAMED for details.

Also, there is JitFramed config that will force the frames on all methods in case somebody needs it.

VSadov · 2023-04-13T17:26:43Z

flip it around perf critical methods using #include <optsmallperfcritical.h>. optsmallperfcritical.h does not do anything on non-Windows. Is it possible to fix that?

GCC has multiple ways to do this (attributes, pragmas), but clang has none.

There is also -momit-leaf-frame-pointer, which I think we can use, but it does not help RhpGetThreadStaticBaseForType case since it is not technically a leaf.

VSadov · 2023-05-04T02:12:49Z

/azp run runtime-extra-platforms

azure-pipelines · 2023-05-04T02:13:00Z

Azure Pipelines successfully started running 1 pipeline(s).

VSadov · 2023-05-04T16:35:37Z

NativeAOT OSX failures are #85600

VSadov · 2023-05-04T16:42:16Z

mono failures appear to be unrelated. Either device connection, build or test failures that happen in a noop test run as well. #85694

VSadov added the area-NativeAOT-coreclr label Apr 10, 2023

ghost assigned VSadov Apr 10, 2023

VSadov mentioned this pull request Apr 10, 2023

[NativeAOT] Optimize the access pattern to threadstatics. #84373

Closed

VSadov commented Apr 10, 2023

View reviewed changes

kunalspathak reviewed Apr 11, 2023

View reviewed changes

MichalStrehovsky reviewed Apr 12, 2023

View reviewed changes

src/coreclr/nativeaot/Runtime/thread.cpp Outdated Show resolved Hide resolved

MichalStrehovsky reviewed Apr 12, 2023

View reviewed changes

src/coreclr/nativeaot/Runtime/threadstore.cpp Outdated Show resolved Hide resolved

VSadov force-pushed the tlsRef branch 2 times, most recently from db1c5d2 to 723e2ec Compare April 12, 2023 16:26

jkotas reviewed Apr 13, 2023

View reviewed changes

VSadov added 21 commits May 3, 2023 18:52

report offsets to JIT

c649f9d

reflection

ce5ba65

add root, layout tweaks, couple fixes

f2c993a

all pass

9e60d58

more compact storage

530e7b3

unix x64 asm helper

18395e9

arm64

ad25223

tweak arm64

2671c9d

simplify GetThreadStaticBaseForType

13270c6

renames

383961c

Undo unnecessary changes

b802437

a few tweaks

d88d024

add some comments and asserts

79cfc7e

INLINE_GET_TLS_VAR for win-x64

1802dd7

moved the managed tls root off the thread

76b4884

unix x64

39cdd4c

unix arm64

b81666f

win-arm64

567f5eb

adjust dwarf test for fewer warnings

cfc81f0

remove unused asm offsets

1ee3b96

PR feedback

419af38

VSadov force-pushed the tlsRef branch from a854ec8 to 419af38 Compare May 4, 2023 02:10

VSadov merged commit 36c507f into dotnet:main May 4, 2023
156 of 170 checks passed

VSadov deleted the tlsRef branch May 4, 2023 16:45

dotnet locked as resolved and limited conversation to collaborators Jun 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NativeAOT] Simplifying access to thread static variables #84566

[NativeAOT] Simplifying access to thread static variables #84566

VSadov commented Apr 10, 2023 •

edited

ghost commented Apr 10, 2023

VSadov Apr 10, 2023

kunalspathak Apr 11, 2023

VSadov Apr 23, 2023

kunalspathak Apr 11, 2023

VSadov commented Apr 11, 2023

kunalspathak commented Apr 11, 2023

VSadov commented Apr 11, 2023

kunalspathak commented Apr 11, 2023

VSadov commented Apr 11, 2023

VSadov commented Apr 12, 2023

VSadov commented Apr 12, 2023

jkotas Apr 13, 2023

VSadov Apr 13, 2023

VSadov Apr 13, 2023

VSadov commented Apr 13, 2023 •

edited

VSadov commented Apr 13, 2023

MichalStrehovsky commented Apr 13, 2023

VSadov commented Apr 13, 2023

VSadov commented Apr 13, 2023

jkotas commented Apr 13, 2023

MichalStrehovsky commented Apr 13, 2023

VSadov commented Apr 13, 2023

VSadov commented Apr 13, 2023 •

edited

MichalStrehovsky commented Apr 13, 2023

jkotas commented Apr 13, 2023 •

edited

VSadov commented Apr 13, 2023

VSadov commented May 4, 2023

azure-pipelines bot commented May 4, 2023

VSadov commented May 4, 2023

VSadov commented May 4, 2023

		@@ -41,37 +39,18 @@ LEAF_END RhpStackProbe, _TEXT

		LEAF_ENTRY RhpGetThreadStaticBaseForType, _TEXT

[NativeAOT] Simplifying access to thread static variables #84566

[NativeAOT] Simplifying access to thread static variables #84566

Conversation

VSadov commented Apr 10, 2023 • edited

ghost commented Apr 10, 2023

VSadov Apr 10, 2023

Choose a reason for hiding this comment

kunalspathak Apr 11, 2023

Choose a reason for hiding this comment

VSadov Apr 23, 2023

Choose a reason for hiding this comment

kunalspathak Apr 11, 2023

Choose a reason for hiding this comment

VSadov commented Apr 11, 2023

kunalspathak commented Apr 11, 2023

VSadov commented Apr 11, 2023

kunalspathak commented Apr 11, 2023

VSadov commented Apr 11, 2023

VSadov commented Apr 12, 2023

VSadov commented Apr 12, 2023

jkotas Apr 13, 2023

Choose a reason for hiding this comment

VSadov Apr 13, 2023

Choose a reason for hiding this comment

VSadov Apr 13, 2023

Choose a reason for hiding this comment

VSadov commented Apr 13, 2023 • edited

VSadov commented Apr 13, 2023

MichalStrehovsky commented Apr 13, 2023

VSadov commented Apr 13, 2023

VSadov commented Apr 13, 2023

jkotas commented Apr 13, 2023

MichalStrehovsky commented Apr 13, 2023

VSadov commented Apr 13, 2023

VSadov commented Apr 13, 2023 • edited

MichalStrehovsky commented Apr 13, 2023

jkotas commented Apr 13, 2023 • edited

VSadov commented Apr 13, 2023

VSadov commented May 4, 2023

azure-pipelines bot commented May 4, 2023

VSadov commented May 4, 2023

VSadov commented May 4, 2023

VSadov commented Apr 10, 2023 •

edited

VSadov commented Apr 13, 2023 •

edited

VSadov commented Apr 13, 2023 •

edited

jkotas commented Apr 13, 2023 •

edited