[NativeAOT] ThreadStatics part 2 #87148

VSadov · 2023-06-05T21:32:48Z

The change allows inlining of the access to thread static variables in the accessing module.
The optimization is no longer affected by whether we have a singlemodule or multimodule scenario.

We do only win-x64 and linux-x64 here. That is enough to discuss and test the design.
The other platform combinations may come as a separate change and will simply follow the same plan.

The change requires dotnet/llvm-project#429 , dotnet/llvm-project#438 in order to pass tests.

ghost · 2023-06-05T21:32:58Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

In progress:

Author:	VSadov
Assignees:	VSadov
Labels:	`area-NativeAOT-coreclr`
Milestone:	-

filipnavara · 2023-06-06T09:00:57Z

Mach-O - any docs on how to emit TLS variables?

https://www.jakubkonka.com/2021/01/21/llvm-tls-apple-silicon.html

The tricky part is that there's one thing emitted in the .o file and the pattern is then recognized by ld64 linker and rewritten. If done incorrectly you get a garbage. I may have a look at it eventually if you skip over that part.

VSadov · 2023-06-06T15:56:31Z

The tricky part is that there's one thing emitted in the .o file and the pattern is then recognized by ld64 linker and rewritten

Yes. TLS variables must have some degree of rewrite. What is stored in the object/executable file will need to be picked up by the linker/loader and become a "template" that is used at thread start (or first access) to initialize the actual TLS storage.

The patterns are wildly different:

The ELF is the simplest. Anything in .tdata and .tbss will end up in TLS.
Windows seems to require laying out the actual TLS template somewhere in the binary, location does not matter much, just needs to be contiguous. What makes it TLS is referencing via _tls_used, _tls_start, _tls_end, etc..
However that is done by the linker. A compilation unit just needs to put the variables into .tls$ section.
Mach-O - ???
If it is just placing the variables into __thread_vars and __thread_bss, it would not be a lot different from ELF.

filipnavara · 2023-06-06T17:22:21Z

If it is just placing the variables into __thread_vars and __thread_bss, it would not be a lot different from ELF.

IIRC yes, that's one part of it. The other part is that linker must recognize the code accessing it as code section (should already be the case for anything emitted in the managed code section) and it must use exact code pattern, not just correct relocation type.

src/coreclr/nativeaot/Runtime/unix/unixasmmacrosamd64.inc

VSadov · 2023-06-11T20:48:39Z

...ls/aot/ILCompiler.Compiler/Compiler/DependencyAnalysis/Target_X64/X64ReadyToRunHelperNode.cs

+        {
+            ISymbolNode getInlinedThreadStaticBaseSlow = factory.HelperEntrypoint(HelperEntrypoint.GetInlinedThreadStaticBaseSlow);
+            ISymbolNode tlsRoot = factory.TlsRoot;
+            bool singleFileExe = factory.CompilationModuleGroup.IsSingleFileCompilation;


We can emit more compact code when we know it is going to be a single exe. This is particularly useful on Windows, since there are no linker relaxations, but is mildly advantageous on Linux as well.

Is this assuming that _tls_index is going to be zero when IsSingleFileCompilation is set? I do not think it is safe assumption to make. IsSingleFileCompilation does not guarantee that the output is .exe.

I am not sure whether this is a safe assumption to make even for .exes - is that documented somewhere or have you seen any Windows C++ compiler doing this optimization?

_tls_index would be 0 in the most common case that we support right now where the TLS variable is used from the same module that defines it and it is known at link time. I am not sure if IsSingleFileCompilation guarantees that, but it would be useful if we could know if we have such case as inlining shorter code is attractive.
Overall - Since in such cases _tls_index will be 0 constant at link time, I think this is a minor optimization since it mostly just makes the code shorter. It does not reduce the number of memory accesses or indirection. It is not a big deal if we can't do this.

C++ compiler does this optimization when /GA switch is used (optimize for application), but it is also known that it may cause trouble in rare cases:

It’s worth noting that in the (extremely rare) case that a .exe exports functions and is imported by another .exe, this optimization will cause random corruption if the imported .exe happens to use __declspec(thread).

Source: http://www.nynaeve.net/?p=185

I did not know about /GA switch. IsSingleFileCompilation does not have the right semantics for this. We would need a new compiler switch that says we are building an application.

The same can be said about Linux case. GD form is the safest to emit and linker will relax it to LS form if that is ok.

Emitting an Initial Exec model is mostly size saving, but it could be attractive, considering it is a very common case for us and having smaller code is nice when inlining.
The applicability of the model is defined slightly differently than on Windows, but I think it means roughly the same - not dynamically loaded and not cross-module accessed.

A more restrictive optimization is usable if the variables accessed are known to be in one of the modules available at program start and if the programmer selects to use the static access model.

It is a good option if we can do this, but emitting GD form is not that much worse to be too worried about.

Yes, this optimization can be enabled on both Windows and Unix, via a new compiler switch.

To put things in the perspective - for Linux the Initial Exec sequence is the same 16 bytes as GD, but does not need an additional mov rdi, rax after that as there is more flexibility where the result would be if you do not imply a possible method call.

It saves some, but not a huge saving overall.

IsSingleFileCompilation does not guarantee that the output is .exe.

Just confirmed that this is correct. IsSingleFileCompilation is not enough and can break easily on Windows in a case of dynamically loaded .dlls. We really need a switch analogous to /GA in order to do Initial Exec optimizations.

It does not seem to break on Linux, but the docs say it must be "modules available at program start", so it could be just luck or some limited tolerance to abuse. We still need a switch.

I have short-circuited the "Initial Exec" optimizations, but left the code in place, assuming we will implement a compiler switch similar to /GA.

TLS Relocations are more complex on linux-arm64 for the dynamic case while LE case is still quite simple. It makes the support for switch like \GA / -fno-pic -fpie a lot more interesting.

VSadov · 2023-06-14T18:34:48Z

I think this is ready for a review.

MichalStrehovsky · 2023-06-14T20:57:13Z

Do we still need the global pre-pass to give indices to things? The C++ compiler doesn't need a global optimization pass to access thread locals.

VSadov · 2023-06-14T22:11:32Z

Do we still need the global pre-pass to give indices to things?

That was needed because compiler is concurrent and we may need to emit field acceses before we know if they are inlined in the TLS block and in what position. I think this reason is still there.

The C++ compiler doesn't need a global optimization pass to access thread locals

Likely because every c++ threadstatic variable can be referenced symbolically.

We could have such model if we had a TLS root per-type, not per-module. There was a concern that such approach would not scale.
#84566 (comment)

For example we would need to walk the list of such roots when reporting to GC. If you have 100 threadstatics, you might end up with a 100 items linked list to walk per each thread per-GC. (twice if there is a background marking pass, three times if need to update pointers due to compaction)
Reporting one block per thread would be cheaper. In particular if most these threadstatics are not objects, then no further tracing is required on the GC side other than marking the containing block as alive.

VSadov · 2023-06-14T23:21:44Z

src/coreclr/tools/Common/TypeSystem/Common/Utilities/GCPointerMap.Algorithm.cs

@@ -101,7 +101,7 @@ public static GCPointerMap FromStaticLayout(MetadataType type)
            return builder.ToGCMap();
        }

-        private static void MapThreadStaticsForType(GCPointerMapBuilder builder, MetadataType type, int baseOffset)
+        private static void MapThreadStaticsForType(ref GCPointerMapBuilder builder, MetadataType type, int baseOffset)


The ref part is the follow up change that somehow missed the previous PR.
#84566 (comment)

MichalStrehovsky · 2023-06-15T00:53:21Z

Likely because every c++ threadstatic variable can be referenced symbolically.

I don't think there's a reason to allocate threadstatics that don't have GC references on the GC heap other than ".NET Native did it this way and that's what we inherited". If we wanted threadstatics to be as fast as C++, we'd update this

runtime/src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/CompilerTypeSystemContext.Aot.cs

Lines 85 to 97 in 6800fd6

    
           protected override bool ComputeHasGCStaticBase(FieldDesc field) 
        
           { 
        
               Debug.Assert(field.IsStatic); 
        
               if (field.IsThreadStatic) 
        
                   return true; 
        
               TypeDesc fieldType = field.FieldType; 
        
               if (fieldType.IsValueType) 
        
                   return ((DefType)fieldType).ContainsGCPointers; 
        
               else 
        
                   return fieldType.IsGCPointer; 
        
           }

to not place them on the GC static base in the type system and emit them like C++ compiler would.

In particular if most these threadstatics are not objects, then no further tracing is required on the GC side other than marking the containing block as alive.

IIRC this is how threadstatics worked before. We could group the individual roots together using linker magic (emitting things into .foo$i section and emitting something else into .foo$a and .foo$z section emits everything into .foo section, but linker will sort things alphabetically - so if one refers to something in .foo$a, one is guaranteed that it's going to be before everything in .foo$i). I think we could make this group all TLS symbols that refer to things on the GC heap into a single block that we could address as a single block.

VSadov · 2023-06-15T01:55:04Z

I don't think there's a reason to allocate threadstatics that don't have GC references on the GC heap

I think we have already touched the idea of placing unmanaged threadstatics on a separate plan from managed ones. It is technically possible and the current scheme does not close the door to that, but it is unlikely we will have time for that before 9.0

For managed threadstatics the current scheme is nearly optimal.
Since roots live in native memory, they are not served by GC barriers/writewatch and cannot participate in incremental marking, thus must be reported during STW pauses. The time is more precious when user's code does not run. Thus lots of roots will at some point hurt and the fewer roots we have to report the better. One root per individually compiled module is the minimum that we can have without introducing indirection and indexing by module on every threadstatic access.

Whether linker tricks can be used to predict/guarantee the location of the fields within the managed storage block and replace the need for pre-scanning - I do not know. I think this can also be left as a future improvement.

MichalStrehovsky · 2023-06-15T02:18:56Z

current scheme does not close the door to that, but it is unlikely we will have time for that before 9.0

The question is whether this is the thing we want to teach the JIT to inline. I believe the reason for this work was that we need this to be as efficient as possible before JIT starts inlining this because it will be a lot more difficult to change later. This is still not as efficient as C++. I still think the global prepass is a wrong direction.

For managed threadstatics the current scheme is nearly optimal.

I think managed threadstatics could work the same way as normal GC statics - normal GC statics get a pointer to the base in the .data segment. We dereference this pointer to get to the actual data on the GC heap (this data is per type, so no global prepass). But we do not scan each type's static base individually when GC runs. Instead we have a single spine that we report to the GC as an array that is like any other - we could have the same spine for threadstatics.

I think we could have a similar scheme for threadstatics where the pointer to the type's base is in the .tls region (instead of .data) and gets accessed similarly (an indirection to get to the object on the GC heap that represents this types' base).

VSadov · 2023-06-15T16:13:40Z

I think managed threadstatics could work the same way as normal GC statics

Normal statics are much more numerous than thread local statics, so the motivation of allocating storage per type is higher.
On the other hand regular statics are very static. There is only one copy of storage per-type with convenient points when to initialize the storage and the spine with trivial failure modes. Since the lifetime of regular statics is so simple, you can reference indirectly via handles, while not worrying about deallocation/leaks.

The only "convenient" part about threadlocal variables that there are typically a lot less of them in a program than regular statics, but there could be easily hundreds of threads with separate instances. There is no nice point to allocate the support upfront - the thread attaching to runtime can't handle allocation failures. Deallocation could also get tricky.

Once you start adjusting the design to these details (lets avoid handles and point to bases directly, but we can merge the bases in one block to have fewer roots, . . .), it may end up not far from the approach in this PR.

I do not think there is a shortage of ideas of how to deal with TLS.
However, once trying to prototype, I think the tradeoffs and implementation details will likely push most ideas towards the design similar to what we have here.

VSadov · 2023-06-22T20:52:13Z

ObjectWriter changes are in. We are passing tests now!!!

VSadov · 2023-06-22T21:03:58Z

/azp run runtime-extra-platforms

azure-pipelines · 2023-06-22T21:04:21Z

Azure Pipelines successfully started running 1 pipeline(s).

VSadov · 2023-06-22T22:34:19Z

extra platforms run has bunch of failures like

 Unhandled exception. Internal.TypeSystem.TypeSystemException+FileNotFoundException: Failed to load assembly 'SQLitePCLRaw.core'

It does not look related. (if TLS was broken, we'd not get this far)

Also seems vaguely familiar. I think this happened before.

jkotas · 2023-06-23T05:57:02Z

src/coreclr/nativeaot/System.Private.CoreLib/src/Internal/Runtime/ThreadStatics.cs

        internal static unsafe object GetInlinedThreadStaticBaseSlow(ref object? threadStorage)
        {
            Debug.Assert(threadStorage == null);
            // Allocate an object that will represent a memory block for all thread static fields
-            TypeManagerHandle typeManager = RuntimeImports.RhGetSingleTypeManager();
+            TypeManagerHandle typeManager = (new object()).GetMethodTable()->TypeManager;


Suggested change

TypeManagerHandle typeManager = (new object()).GetMethodTable()->TypeManager;

TypeManagerHandle typeManager = EETypePtr.EETypePtrOf<object>().ToPointer()->TypeManager;

?

(Also, I have noticed that we have unused broken MethodTableOf<T> - you can delete it while you are on it.)

I thought there might be a way to just fetch the method table of the type, but did not want to involve anything reflection-like by accident. This code will run once per thread, so allocating one object seemed both simple and affordable.
I did not know about EETypePtrOf.

jkotas

LGTM. Thank you!

MichalStrehovsky · 2023-06-23T06:13:40Z

...ls/aot/ILCompiler.Compiler/Compiler/DependencyAnalysis/Target_X64/X64ReadyToRunHelperNode.cs

+                // mov  rax, qword ptr[rdi]
+                encoder.Builder.EmitBytes(new byte[] { 0x48, 0x8B, 0x07 });


We can express a lot of these with the existing EmitMOV methods (not going to comment on all).

These patterns are fragile. We want to be sure we emit exactly these bytes.

When we emit instructions with zero offsets we may have more than one way to encode them. That is especially true on arm64 - if you have a zero offset, there could be a different instruction format that just does not have an offset. I had a bit of trouble with this in regular assembly in #85689 - masm tries to be helpful and picks a nicer instruction, but it does not work, since having a specific encoding is important for the relocation to be applicable.
And some instructions are not expressible with EmitXXX - like on Linux we need to insert some junk prefixes that make no semantical difference, but ensure the size of the code is big enough to be patched.

In many cases I ended up looking at what c++ compiler emits and copying the exact instruction bytes, to be sure.
EmitBytes was an easier API to work with than trying to figure if it is possible and how to do the same using EmitMOV and the like and be sure the emit is exactly as needed.
Perhaps in a few places we can use EmitMOV. I do not think it is worth finding and switching them.

Also this codegen will eventually (and hopefully soon) move into JIT and will be using whatever JIT uses to emit code anyways.

MichalStrehovsky · 2023-06-23T06:14:04Z

...ls/aot/ILCompiler.Compiler/Compiler/DependencyAnalysis/Target_X64/X64ReadyToRunHelperNode.cs

+            // IsSingleFileCompilation is not enough to guarantee that we can use "Initial Executable" optimizations.
+            // we need a special compiler flag analogous to /GA. Just assume "false" for now.
+            // bool singleFileExe = factory.CompilationModuleGroup.IsSingleFileCompilation;
+            bool singleFileExe = false;


If we already know isSingleFileExe is not what we want, can we call this what this is? (Same for ARM64)

This will need to be changed anyways once the "optimize for application" is implemented. I assume we will be doing that.

But I can change this. Would isInitialExecutable look better?

I would name it after the expected optimization (canAssumeZeroOffset or whatever). singleFileExe is unlikely to be the thing we end up keying on because I don't expect we'd enable it whenever we build a single file exe (I don't see this option enabled by default for EXE projects in VC++ VS projects and I don't even see a UI option to enable it in VS settings in general besides typing /GA manually - both of those feel like red flags we'd need to understand).

Not sure about Windows. Perhaps the difference on win-x64 is not big enough and building an ".exe" is not common enough. On win-arm64 there would be more difference, but perhaps they chose to be consistent with x64, or just did not think about the flag.

Both GCC and Clang assume single executable mode by default (-fpie). Dynamic TLS mode is basically a de-optimization that happens when building an .so (i.e. -fpic is used).
The differences are more significant than on Windows. See:

runtime/src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/DependencyAnalysis/Target_ARM64/ARM64ReadyToRunHelperNode.cs

Line 243 in 46b426a

if (isInitialExecutable)

I do not think the flag should describe the difference in the codegen. because what happens is specific to OS and architecture.
The flag should simply specify what user intends to produce - an executable or a dynamic library.

MichalStrehovsky · 2023-06-23T06:20:56Z

src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/DependencyAnalysis/TlsRootNode.cs

+    {
+        public void AppendMangledName(NameMangler nameMangler, Utf8StringBuilder sb)
+        {
+            // tls_InlinedThreadStatics


Is this comment useful?

not any more. I will remove it.

src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/DependencyAnalysis/TlsRootNode.cs

VSadov · 2023-06-23T16:59:06Z

Thanks!!!

VSadov · 2023-06-28T21:00:21Z

CC @kunalspathak - I think this is the state from which JIT work can happen.

NCLnclNCL · 2023-07-10T17:09:40Z

Ok

dotnet-issue-labeler bot added the area-NativeAOT-coreclr label Jun 5, 2023

ghost assigned VSadov Jun 5, 2023

build-analysis bot mentioned this pull request Jun 6, 2023

Tracking issue for CI build timeouts #76454

Closed

VSadov force-pushed the thrSt branch from 274393a to 2092ffc Compare June 8, 2023 05:21

This was referenced Jun 8, 2023

[mono] Tests failing with Failed to find compatible device: arm64-v8a #87267

Closed

System.Net.Requests.Tests, System.Net.Security.Tests failing with Segmentation fault on Debian 10 #87268

Closed

VSadov force-pushed the thrSt branch from 2092ffc to 65ca15b Compare June 8, 2023 14:43

This was referenced Jun 8, 2023

Infra improvements for Helix #68176

Closed

Long Running Test: Interop/MonoAPI/MonoMono/PInvokeDetach/PInvokeDetach.sh #73040

Closed

VSadov force-pushed the thrSt branch from 65ca15b to b019485 Compare June 10, 2023 00:03

VSadov mentioned this pull request Jun 11, 2023

Changes to support TLS access emitted in ILC dotnet/llvm-project#425

Closed

VSadov commented Jun 11, 2023

View reviewed changes

src/coreclr/nativeaot/Runtime/unix/unixasmmacrosamd64.inc Show resolved Hide resolved

VSadov commented Jun 11, 2023

View reviewed changes

VSadov force-pushed the thrSt branch from c80794f to 1429562 Compare June 13, 2023 02:09

VSadov mentioned this pull request Jun 14, 2023

Changes to support TLS access emitted in ILC (x64 win/linux) dotnet/llvm-project#429

Merged

VSadov marked this pull request as ready for review June 14, 2023 18:33

VSadov requested a review from MichalStrehovsky as a code owner June 14, 2023 18:33

VSadov commented Jun 14, 2023

View reviewed changes

MichalStrehovsky mentioned this pull request Jun 15, 2023

Build integration support for Bionic in NativeAOT #87595

Merged

VSadov added 11 commits June 22, 2023 12:30

enable ILC generation of tls_InlinedThreadStatics on win-x64

62b60de

allow storage inlining in multimodule case when TLS access is inlined

2921dc4

disable "Initial Exec" optimizations

97cf520

some comments and formatting

8dfa7e9

follow up change that was suggested in the previous PR

cdfd0a3

Remove use of RhpGetInlinedThreadstaticBase on x64

b650cbb

Remove use of RhpGetInlinedThreadStaticBase on arm64

f9c5f01

removed tls_InlinedThreadStatics

bca1c3b

a few cleanups/typos

ff29a93

fix after rebase

6065863

inlined TLS support for linux-arm64

e48b82b

VSadov force-pushed the thrSt branch from f806f34 to e48b82b Compare June 22, 2023 19:30

build-analysis bot mentioned this pull request Jun 23, 2023

Failed USB connection via port 54050, error 61, in tvOS arm64 Release AllSubsets_Mono #82637

Open

jkotas reviewed Jun 23, 2023

View reviewed changes

jkotas approved these changes Jun 23, 2023

View reviewed changes

MichalStrehovsky reviewed Jun 23, 2023

View reviewed changes

VSadov commented Jun 23, 2023

View reviewed changes

src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/DependencyAnalysis/TlsRootNode.cs Show resolved Hide resolved

PR feedback

dab05cf

VSadov merged commit 0f71880 into dotnet:main Jun 23, 2023
105 of 108 checks passed

VSadov deleted the thrSt branch June 23, 2023 16:59

build-analysis bot mentioned this pull request Jun 23, 2023

LibraryImportGenerator.Unit.Tests crashed in CI #87951

Closed

ghost locked as resolved and limited conversation to collaborators Aug 14, 2023

	TypeManagerHandle typeManager = (new object()).GetMethodTable()->TypeManager;
	TypeManagerHandle typeManager = EETypePtr.EETypePtrOf<object>().ToPointer()->TypeManager;

		// mov rax, qword ptr[rdi]
		encoder.Builder.EmitBytes(new byte[] { 0x48, 0x8B, 0x07 });

[NativeAOT] ThreadStatics part 2 #87148

[NativeAOT] ThreadStatics part 2 #87148

Conversation

VSadov commented Jun 5, 2023 • edited Loading

ghost commented Jun 5, 2023

filipnavara commented Jun 6, 2023

VSadov commented Jun 6, 2023 • edited Loading

filipnavara commented Jun 6, 2023 • edited Loading

VSadov Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Jun 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov commented Jun 14, 2023

MichalStrehovsky commented Jun 14, 2023

VSadov commented Jun 14, 2023 • edited Loading

Choose a reason for hiding this comment

MichalStrehovsky commented Jun 15, 2023

VSadov commented Jun 15, 2023

MichalStrehovsky commented Jun 15, 2023

VSadov commented Jun 15, 2023

VSadov commented Jun 22, 2023

VSadov commented Jun 22, 2023

azure-pipelines bot commented Jun 22, 2023

VSadov commented Jun 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VSadov commented Jun 23, 2023

VSadov commented Jun 28, 2023

NCLnclNCL commented Jul 10, 2023

VSadov commented Jun 5, 2023 •

edited

Loading

VSadov commented Jun 6, 2023 •

edited

Loading

filipnavara commented Jun 6, 2023 •

edited

Loading

VSadov Jun 11, 2023 •

edited

Loading

VSadov Jun 12, 2023 •

edited

Loading

VSadov Jun 12, 2023 •

edited

Loading

VSadov Jun 12, 2023 •

edited

Loading

VSadov Jun 14, 2023 •

edited

Loading

VSadov commented Jun 14, 2023 •

edited

Loading

VSadov Jun 23, 2023 •

edited

Loading

VSadov Jun 26, 2023 •

edited

Loading