Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Simplifying access to thread static variables #84566

Merged
merged 30 commits into from May 4, 2023

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Apr 10, 2023

Fixes: #84373

  • separate "fast" inlinable case . (used for singlemodule, not dynamic cases, when optimizing)
  • make the storage for fast threadstatics a single "combo" instance instead of array of instances.

@ghost
Copy link

ghost commented Apr 10, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes: #84373

  • separate "fast" inlinable case. (no multimodule, not dynamic)
  • preinitialize storage for fast threadstatics on thread attach
  • consider making storage for fast threadstatics a single "combo" instance instead of array of instances.
Author: VSadov
Assignees: -
Labels:

area-NativeAOT-coreclr

Milestone: -

@@ -41,37 +39,18 @@ LEAF_END RhpStackProbe, _TEXT

LEAF_ENTRY RhpGetThreadStaticBaseForType, _TEXT
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "fast" access got a lot simpler here. RhpGetThreadStaticBaseForType is what we want eventually not called, but directly inlined by the JIT into callers.

We could yet make this a bit simpler by removing an indirection into the array of storage instances. It may be somewhat challenging though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other technical reason why we don't have such assembly helper for Arm64 apart that it is not implemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper for Arm64 is now implemented

src/coreclr/nativeaot/Runtime/amd64/MiscStubs.S Outdated Show resolved Hide resolved
src/coreclr/nativeaot/Runtime/amd64/MiscStubs.asm Outdated Show resolved Hide resolved
@@ -41,37 +39,18 @@ LEAF_END RhpStackProbe, _TEXT

LEAF_ENTRY RhpGetThreadStaticBaseForType, _TEXT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other technical reason why we don't have such assembly helper for Arm64 apart that it is not implemented?

@VSadov
Copy link
Member Author

VSadov commented Apr 11, 2023

Is there any other technical reason why we don't have such assembly helper for Arm64 apart that it is not implemented?

no reason. Also no reason for the helper to be in assembly.
I plan to replace this with c++ and call on all architectures. Until eventually it is inlined.

@kunalspathak
Copy link
Member

I plan to replace this with c++ and call on all architectures

are you thinking about it as part of this PR?

@VSadov
Copy link
Member Author

VSadov commented Apr 11, 2023

are you thinking about it as part of this PR?

yes. It would also make it easier to do further changes - no need to fix multiple helpers.

@kunalspathak
Copy link
Member

What is the speedup from just this change?

@VSadov
Copy link
Member Author

VSadov commented Apr 11, 2023

What is the speedup from just this change?

i think the cost of the call is still there, so this might not be a huge improvement.
it would be interesting to measure after couple more “easy” changes like moving to c++ and not passing unused arguments.

@VSadov VSadov force-pushed the tlsRef branch 2 times, most recently from db1c5d2 to 723e2ec Compare April 12, 2023 16:26
@VSadov
Copy link
Member Author

VSadov commented Apr 12, 2023

What is the speedup from just this change?

I see about 15% speedup on x64. That is with both old and new implementations still making a call to the helper.

With a simple microbenchmark which calls a method accessing bunch of threadstatics in a loop I see:
(time, lower is better)

=== before the change:
4745
4747
4749

=== after the change:
4068
4082
4074

=== coreclr (JIT) 
7250
7272
7262

the benchmark:

using System.Diagnostics;
using System.Runtime.CompilerServices;

namespace ConsoleApp22
{
    class C00 {[ThreadStatic] public static int t_i00;}
    class C01 {[ThreadStatic] public static int t_i01;}
    class C02 {[ThreadStatic] public static int t_i02;}
    class C03 {[ThreadStatic] public static int t_i03;}
    class C04 {[ThreadStatic] public static int t_i04;}
    class C05 {[ThreadStatic] public static int t_i05;}
    class C06 {[ThreadStatic] public static int t_i06;}
    class C07 {[ThreadStatic] public static int t_i07;}
    class C08 {[ThreadStatic] public static int t_i08;}
    class C09 {[ThreadStatic] public static int t_i09;}
    class C10 {[ThreadStatic] public static int t_i10;}
    class C11 {[ThreadStatic] public static int t_i11;}
    class C12 {[ThreadStatic] public static int t_i12;}
    class C13 {[ThreadStatic] public static int t_i13;}
    class C14 {[ThreadStatic] public static int t_i14;}
    class C15 {[ThreadStatic] public static int t_i15;}
    class C16 {[ThreadStatic] public static int t_i16;}
    class C17 {[ThreadStatic] public static int t_i17;}
    class C18 {[ThreadStatic] public static int t_i18;}
    class C19 {[ThreadStatic] public static int t_i19;}
    class C20 {[ThreadStatic] public static int t_i20;}
    class C21 {[ThreadStatic] public static int t_i21;}
    class C22 {[ThreadStatic] public static int t_i22;}
    class C23 {[ThreadStatic] public static int t_i23;}
    class C24 {[ThreadStatic] public static int t_i24;}
    class C25 {[ThreadStatic] public static int t_i25;}
    class C26 {[ThreadStatic] public static int t_i26;}
    class C27 {[ThreadStatic] public static int t_i27;}
    class C28 {[ThreadStatic] public static int t_i28;}
    class C29 {[ThreadStatic] public static int t_i29;}

    internal class Program
    {
        const int iters = 1000000;
        static void Main(string[] args)
        {
            for (; ; )
            {
                Time(AccessTLS);
            }
        }

        static void Time(Action a)
        {
            var sw = Stopwatch.StartNew();
            for (int i = 0; i < 100; i++)
            {
                a();
            }

            sw.Stop();
            System.Console.WriteLine(sw.ElapsedMilliseconds);
        }

        static void AccessTLS()
        {
            for (int i = 0; i < iters; i++)
            {
                OneTLSAccess();
            }
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static void OneTLSAccess()
        {
            C00.t_i00 = C00.t_i00 +C01.t_i01 + C02.t_i02 +C03.t_i03 + C04.t_i04 +C05.t_i05 + C06.t_i06 +C07.t_i07 +
                        C08.t_i08 +C09.t_i09 + C10.t_i10 +C11.t_i11 + C12.t_i12 +C13.t_i13 + C14.t_i14 +C15.t_i15 +
                        C16.t_i16 +C17.t_i17 + C18.t_i18 +C19.t_i19 + C20.t_i20 +C21.t_i21 + C22.t_i22 +C23.t_i23 +
                        C24.t_i24 +C25.t_i25 + C26.t_i26 +C27.t_i27 + C28.t_i28 +C29.t_i29;
        }
    }
}

@VSadov
Copy link
Member Author

VSadov commented Apr 12, 2023

The codegen looks like the folllowing:

before the changes:

=== thunk
00007ff6`20391907 4c8d0512720f00 lea     r8, [7FF620488B20h]
00007ff6`2039190e 498b08         mov     rcx, qword ptr [r8]
00007ff6`20391911 418b5008       mov     edx, dword ptr [r8+8]
00007ff6`20391915 e9c6050000     jmp     00007FF620391EE0
 
=== helper
00007ff6`20391ee0 8b0512f41900         mov     eax, dword ptr [7FF6205312F8h]
00007ff6`20391ee6 654c8b042558000000   mov     r8, qword ptr gs:[58h]
00007ff6`20391eef 4d8b04c0             mov     r8, qword ptr [r8+rax*8]
00007ff6`20391ef3 b810000000           mov     eax, 10h
00007ff6`20391ef8 4903c0               add     rax, r8
00007ff6`20391efb 448b4108             mov     r8d, dword ptr [rcx+8]
00007ff6`20391eff 488b8090000000       mov     rax, qword ptr [rax+90h]
00007ff6`20391f06 4885c0               test    rax, rax
00007ff6`20391f09 0f84717a0b00         je      00007FF620449980
00007ff6`20391f0f 443b4008             cmp     r8d, dword ptr [rax+8]
00007ff6`20391f13 0f83677a0b00         jae     00007FF620449980
00007ff6`20391f19 4a8b44c010           mov     rax, qword ptr [rax+r8*8+10h]
00007ff6`20391f1e 4885c0               test    rax, rax
00007ff6`20391f21 0f84597a0b00         je      00007FF620449980
00007ff6`20391f27 3b5008               cmp     edx, dword ptr [rax+8]
00007ff6`20391f2a 0f83507a0b00         jae     00007FF620449980
00007ff6`20391f30 488b44d010           mov     rax, qword ptr [rax+rdx*8+10h]
00007ff6`20391f35 4885c0               test    rax, rax
00007ff6`20391f38 0f84427a0b00         je      00007FF620449980
00007ff6`20391f3e c3                   ret   

after the changes:  

=== thunk:
00007ff6`aa91198e 488d15db710f00 lea     rdx, [7FF6AAA08B70h]
00007ff6`aa911995 8b4a08         mov     ecx, dword ptr [rdx+8]
00007ff6`aa911998 e903110000     jmp     00007FF6AA912AA0
 
=== helper
00007ff6`aa912aa0 65488b042558000000 mov     rax, qword ptr gs:[58h]
00007ff6`aa912aa9 83c102             add     ecx, 2
00007ff6`aa912aac 8b1526f81900       mov     edx, dword ptr [7FF6AAAB22D8h]
00007ff6`aa912ab2 41b910000000       mov     r9d, 10h
00007ff6`aa912ab8 4c8b04d0           mov     r8, qword ptr [rax+rdx*8]
00007ff6`aa912abc 4b8b840898000000   mov     rax, qword ptr [r8+r9+98h]
00007ff6`aa912ac4 488b04c8           mov     rax, qword ptr [rax+rcx*8]
00007ff6`aa912ac8 c3                 ret     

TypeManager* pSingleTypeManager = GetRuntimeInstance()->GetSingleTypeManager();
if (pSingleTypeManager != NULL)
{
InitInlineThreadStatics(pSingleTypeManager);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make thread attach a potential source of unhandled OOM that leads to fail fast. This fail fast is impossible for user code to catch or recover from. I am not sure whether it is a good trade-off to make for native AOT. It will make native AOT less suitable for system-programming like tasks where uncatchable OOMs are a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible solution is to ignore the OOM on thread attach and live the storage as NULL and turn the AV on the first use of the threadstatic into an OOM exception. That could be inconvenient though when the access is JIT-inlined.

Or we can do one null-check for the whole thing and call the initializer on the first use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will switch this to "allocate on first use" pattern. In such case the OOM is most likely to happen while creating managed Thread, either directly or indirectly by accessing managed thread ID.
It would still be possible to access a random threadstatic and get an unexpected OOM, but I think it would be always catcheable.

@VSadov
Copy link
Member Author

VSadov commented Apr 13, 2023

@MichalStrehovsky The codegen for the helper on Linux is indeed not very good. Unlike on Windows where it seems better than hand-written assembly.

System.Collections.Tests`::RhpGetThreadStaticBaseForType(uint32_t):
    0x55555542c040 <+0>:  push   rbp
    0x55555542c041 <+1>:  mov    rbp, rsp
    0x55555542c044 <+4>:  push   rbx
    0x55555542c045 <+5>:  push   rax
    0x55555542c046 <+6>:  mov    ebx, edi
->  0x55555542c048 <+8>:  mov    rax, qword ptr fs:[0x0]
    0x55555542c051 <+17>: lea    rax, [rax - 0xf0]
    0x55555542c058 <+24>: mov    rax, qword ptr [rax + 0x98]
    0x55555542c05f <+31>: add    ebx, 0x2
    0x55555542c062 <+34>: mov    rax, qword ptr [rax + 8*rbx]
    0x55555542c066 <+38>: add    rsp, 0x8
    0x55555542c06a <+42>: pop    rbx
    0x55555542c06b <+43>: pop    rbp
    0x55555542c06c <+44>: ret

This looks pretty bad. I wonder what makes the compiler confused.

@VSadov
Copy link
Member Author

VSadov commented Apr 13, 2023

Are we forcing frame pointers somehow?

@MichalStrehovsky
Copy link
Member

The frame is there on Linux because this is generated as a call. It gets turned into a mov by linker magic during linking. But linker magic is not able to mop up things around it. Linker magic can be only done when we produce executable. Not when we produce a shared library.

This was a good article on Linux TLS I read some time ago: https://maskray.me/blog/2021-02-14-all-about-thread-local-storage

@VSadov
Copy link
Member Author

VSadov commented Apr 13, 2023

Call to __tls_get_addr@PLT should not be so detrimental. We should see roughly the same codegen as hand-written assembly.

@VSadov
Copy link
Member Author

VSadov commented Apr 13, 2023

Right, we do disable frame pointer optimizations and that is likely causing this kind of code:

add_compile_options(-fno-omit-frame-pointer)

The reason is "to make it easier to profile".
It seems a bit strange reason - to disable optimizations to make it easier to find perf problems...

Perhaps CoreClr has other reasons for requiring frame pointers (but why on Unix only?)
I think we should consider not doing that, at least for NativeAOT.

@jkotas
Copy link
Member

jkotas commented Apr 13, 2023

The reason is "to make it easier to profile".

Profiling tools on Linux want to have RBP chain. We can double check whether it is still the case - I would expect it to be.

why on Unix only?

We do that on Windows too (look for /Oy-) and then flip it around perf critical methods using #include <optsmallperfcritical.h>. optsmallperfcritical.h does not do anything on non-Windows. Is it possible to fix that?

@MichalStrehovsky
Copy link
Member

It seems a bit strange reason - to disable optimizations to make it easier to find perf problems...

Fedora had a big discussion about it last year (didn't follow where that went but it is likely still an issue if they were having heated discussions about it): https://www.phoronix.com/news/Fedora-37-No-Omit-Frame-Pointer

@VSadov
Copy link
Member Author

VSadov commented Apr 13, 2023

Not suppressing that optimization results in as good codegen as there could be:

System.Collections.Tests`::RhpGetThreadStaticBaseForType(uint32_t):
    0x55555542be20 <+0>:  push   rbx
    0x55555542be21 <+1>:  mov    ebx, edi
->  0x55555542be23 <+3>:  mov    rax, qword ptr fs:[0x0]
    0x55555542be2c <+12>: lea    rax, [rax - 0xf0]
    0x55555542be33 <+19>: mov    rax, qword ptr [rax + 0x98]
    0x55555542be3a <+26>: add    ebx, 0x2
    0x55555542be3d <+29>: mov    rax, qword ptr [rax + 8*rbx]
    0x55555542be41 <+33>: pop    rbx
    0x55555542be42 <+34>: ret

@VSadov
Copy link
Member Author

VSadov commented Apr 13, 2023

Profiling tools on Linux want to have RBP chain. We can double check whether it is still the case - I would expect it to be.

perf allows to choose DWARF, FP chain (a bit more lightweight), or LBR (very lightweight, but Intel-specific).

--call-graph
           Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is
           "fp".

               Allows specifying "fp" (frame pointer) or "dwarf"
               (DWARF's CFI - Call Frame Information) or "lbr"
               (Hardware Last Branch Record facility) as the method to collect
               the information used to show the call graphs.

               In some systems, where binaries are build with gcc
               --fomit-frame-pointer, using the "fp" method will produce bogus
               call graphs, using "dwarf", if available (perf tools linked to
               the libunwind or libdw library) should be used instead.
               Using the "lbr" method doesn't require any compiler options. It
               will produce call graphs from the hardware LBR registers. The
               main limitation is that it is only available on new Intel
               platforms, such as Haswell. It can only get user call chain. It
               doesn't work with branch stack sampling at the same time.

               When "dwarf" recording is used, perf also records (user) stack dump
               when sampled.  Default size of the stack dump is 8192 (bytes).
               User can change the size by passing the size after comma like
               "--call-graph dwarf,4096".

link for the above

flip it around perf critical methods using #include <optsmallperfcritical.h>. optsmallperfcritical.h does not do anything on non-Windows. Is it possible to fix that?

This would probably be the best option, if possible.

@MichalStrehovsky
Copy link
Member

The reason is "to make it easier to profile".

Profiling tools on Linux want to have RBP chain. We can double check whether it is still the case - I would expect it to be.

why on Unix only?

We do that on Windows too (look for /Oy-) and then flip it around perf critical methods using #include <optsmallperfcritical.h>. optsmallperfcritical.h does not do anything on non-Windows. Is it possible to fix that?

Thinking about this - does RyuJIT support the equivalent of -fno-omit-frame-pointers? Or do we just want it for native code?

@jkotas
Copy link
Member

jkotas commented Apr 13, 2023

Thinking about this - does RyuJIT support the equivalent of -fno-omit-frame-pointers?

On platforms that can use frame chains (all platforms except Win x64), RyuJIT is smart about adding the frame to more complex methods only, so that the trivial methods do not pay for the frame. Also, it does not allocate the frame pointer register so that methods with omitted frame do not break the frame chain. For the frame chain walking profiler, it looks as if there was more method inlining - some methods are omitted in the frame chain, but the frame chain still gives you pretty accurate picture of the callstack. Look for rpMustCreateEBPFrame and ETW_EBP_FRAMED for details.

Also, there is JitFramed config that will force the frames on all methods in case somebody needs it.

@VSadov
Copy link
Member Author

VSadov commented Apr 13, 2023

flip it around perf critical methods using #include <optsmallperfcritical.h>. optsmallperfcritical.h does not do anything on non-Windows. Is it possible to fix that?

GCC has multiple ways to do this (attributes, pragmas), but clang has none.

There is also -momit-leaf-frame-pointer, which I think we can use, but it does not help RhpGetThreadStaticBaseForType case since it is not technically a leaf.

@VSadov
Copy link
Member Author

VSadov commented May 4, 2023

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented May 4, 2023

NativeAOT OSX failures are #85600

@VSadov
Copy link
Member Author

VSadov commented May 4, 2023

mono failures appear to be unrelated. Either device connection, build or test failures that happen in a noop test run as well. #85694

@VSadov VSadov merged commit 36c507f into dotnet:main May 4, 2023
156 of 170 checks passed
@VSadov VSadov deleted the tlsRef branch May 4, 2023 16:45
@dotnet dotnet locked as resolved and limited conversation to collaborators Jun 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NativeAOT] Optimize the access pattern to threadstatics.
4 participants