Allow async interruptions on safepoints #95565

VSadov · 2023-12-04T01:13:23Z

TL;DR

no need for hijacking if a thread is already on a safe point
helps a lot with cases like Slow thread suspension in busy loop #94767 as otherwise uninterruptible loop in the caller now gets points of interruption

== What is going on here:

Currently we have roughly two kinds of methods - fully and partially interruptible. We can initiate stackwalks in fully interruptible methods as well as can stackwalk through them. Partially interruptible allow only stackwalk through them as only call-return sites have info about GC content of stack and nonvolatile registers (and volatile ones are conveniently dead at return sites).
The advantage of partially interruptible methods is that they carry less info and thus everything that can be partially interruptible is emitted as such. The ratio of partially/fully interuptible varies, but generally partially interruptible is in vast majority.

A good question to ask - can we initiate a stack walk when stopped on a call-return site in a partially interruptible code?
The current answer is "almost". We already store the info about nonvolatile registers and stack locations for these sites, so in theory we have enough information to start stackwalk. Almost enough.
The only piece that is missing is the content of return registers. Unlike the case of walking through virtually unwound callers, when return registers are not live yet, in the case of actual return, they contain return values, and if those are object refs, they must be reported to GC. Once GC info could tell if return registers are live after a call and interesting to GC, the call-return sites become interruptible.

== What is gained:

If we find a thread on an interruptible point we do not need to hijack/release and wait for the thread to self-suspend when eventually returning from the call - we are done! We've already caught the thread where we want it to be.
This would add a lot of additional interruptibility points in a program, which would improve robustness of suspension.

#2 is actually more interesting. There are known cases that present a challenge to suspension. One of them is a "short call in a loop". This happens when we have a relatively tight loop that perform call(s) to short methods. At compile time we only know that there are calls, thus we do not put GC polls in the loop. As a result we end up with a loop that can be suspended only when we manage to catch the thread at a few instructions in the calee.
Superscalar multiple-issue CPUs can process multiple instructions at a time, which makes this situation worse as that effectively reduces the number of possible hijackable points in the calee. We may see the IP of a stopped thread at the call-return site way more often than inside the calee's body.

== Posible further improvements:

Once we have this, hijacking could be simplified as well.
Currently, while installing a hijack, we need to consult the GC info about the GC-refness of contained method's returns and stash that information in a transition frame, so that, if hijack is hit, the stack crawler could know if/how report the return registers.

All this special casing would be unnecessary as we could put synchronous interrupts (hijacked case) and asynchronous interrupted case (signals or OS suspension) on the same plan with respect to GC reporting. - just ask the decoder to report volatile registers for the leaf frame, of which only returns would be live at return site - same as what happens in asynchronous case. The difference would be only in how interruption happened and how registerset display was formed (i.e pushed by hijack probe or by the OS).

A sizable chunk of architecture and platform-specific c++/asm code that deals with return flags could be gone. Like a good part of this file:

runtime/src/coreclr/nativeaot/Runtime/ICodeManager.h

Lines 25 to 38 in 9a3cacd

    
           // All values but GCRK_Unknown must correspond to MethodReturnKind enumeration in gcinfo.h 
        
           enum GCRefKind : unsigned char 
        
           { 
        
               GCRK_Scalar         = 0x00, 
        
               GCRK_Object         = 0x01, 
        
               GCRK_Byref          = 0x02, 
        
           #ifdef TARGET_64BIT 
        
               // Composite return kinds for value types returned in two registers (encoded with two bits per register) 
        
               GCRK_Scalar_Obj     = (GCRK_Object << 2) | GCRK_Scalar, 
        
               GCRK_Obj_Obj        = (GCRK_Object << 2) | GCRK_Object, 
        
               GCRK_Byref_Obj      = (GCRK_Object << 2) | GCRK_Byref, 
        
               GCRK_Scalar_Byref   = (GCRK_Byref  << 2) | GCRK_Scalar, 
        
               GCRK_Obj_Byref      = (GCRK_Byref  << 2) | GCRK_Object, 
        
               GCRK_Byref_Byref    = (GCRK_Byref  << 2) | GCRK_Byref,

Also we may be able to drop the return kind bits in the GC info entirely, I think there are no other uses.

ghost · 2023-12-04T01:13:35Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

no need for hijacking if a thread is already on a safe point
helps a lot with cases like Slow thread suspension in busy loop #94767 as otherwise uninterruptible loop in the caller gets points of interruption

Author:	VSadov
Assignees:	VSadov
Labels:	`area-NativeAOT-coreclr`
Milestone:	-

ryujit-bot · 2024-01-26T06:29:11Z

Diff results for #95565

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

Overall (+0.12% to +0.38%)

Collection	PDIFF
benchmarks.run.linux.arm64.checked.mch	+0.14%
benchmarks.run_pgo.linux.arm64.checked.mch	+0.12%
benchmarks.run_tiered.linux.arm64.checked.mch	+0.27%
coreclr_tests.run.linux.arm64.checked.mch	+0.28%
libraries.crossgen2.linux.arm64.checked.mch	+0.38%
libraries.pmi.linux.arm64.checked.mch	+0.22%
libraries_tests.run.linux.arm64.Release.mch	+0.28%
libraries_tests_no_tiered_compilation.run.linux.arm64.Release.mch	+0.30%
realworld.run.linux.arm64.checked.mch	+0.22%
smoke_tests.nativeaot.linux.arm64.checked.mch	+0.20%

MinOpts (+0.02% to +1.26%)

Collection	PDIFF
benchmarks.run.linux.arm64.checked.mch	+0.80%
benchmarks.run_pgo.linux.arm64.checked.mch	+0.44%
benchmarks.run_tiered.linux.arm64.checked.mch	+0.41%
coreclr_tests.run.linux.arm64.checked.mch	+0.37%
libraries.crossgen2.linux.arm64.checked.mch	+0.12%
libraries.pmi.linux.arm64.checked.mch	+0.02%
libraries_tests.run.linux.arm64.Release.mch	+0.78%
libraries_tests_no_tiered_compilation.run.linux.arm64.Release.mch	+1.26%
realworld.run.linux.arm64.checked.mch	+0.57%
smoke_tests.nativeaot.linux.arm64.checked.mch	+1.22%

FullOpts (+0.08% to +0.38%)

Collection	PDIFF
benchmarks.run.linux.arm64.checked.mch	+0.14%
benchmarks.run_pgo.linux.arm64.checked.mch	+0.08%
benchmarks.run_tiered.linux.arm64.checked.mch	+0.09%
coreclr_tests.run.linux.arm64.checked.mch	+0.22%
libraries.crossgen2.linux.arm64.checked.mch	+0.38%
libraries.pmi.linux.arm64.checked.mch	+0.22%
libraries_tests.run.linux.arm64.Release.mch	+0.10%
libraries_tests_no_tiered_compilation.run.linux.arm64.Release.mch	+0.27%
realworld.run.linux.arm64.checked.mch	+0.22%
smoke_tests.nativeaot.linux.arm64.checked.mch	+0.20%

Throughput diffs for linux/x64 ran on windows/x64

Overall (+0.12% to +0.35%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.14%
benchmarks.run_pgo.linux.x64.checked.mch	+0.12%
benchmarks.run_tiered.linux.x64.checked.mch	+0.35%
coreclr_tests.run.linux.x64.checked.mch	+0.28%
libraries.crossgen2.linux.x64.checked.mch	+0.35%
libraries.pmi.linux.x64.checked.mch	+0.21%
libraries_tests.run.linux.x64.Release.mch	+0.28%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.29%
realworld.run.linux.x64.checked.mch	+0.22%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.16%

MinOpts (+0.02% to +1.36%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+1.36%
benchmarks.run_pgo.linux.x64.checked.mch	+0.57%
benchmarks.run_tiered.linux.x64.checked.mch	+0.68%
coreclr_tests.run.linux.x64.checked.mch	+0.41%
libraries.crossgen2.linux.x64.checked.mch	+0.15%
libraries.pmi.linux.x64.checked.mch	+0.02%
libraries_tests.run.linux.x64.Release.mch	+0.94%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.44%
realworld.run.linux.x64.checked.mch	+0.72%
smoke_tests.nativeaot.linux.x64.checked.mch	+1.25%

FullOpts (+0.07% to +0.35%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.13%
benchmarks.run_pgo.linux.x64.checked.mch	+0.07%
benchmarks.run_tiered.linux.x64.checked.mch	+0.08%
coreclr_tests.run.linux.x64.checked.mch	+0.20%
libraries.crossgen2.linux.x64.checked.mch	+0.35%
libraries.pmi.linux.x64.checked.mch	+0.21%
libraries_tests.run.linux.x64.Release.mch	+0.10%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.28%
realworld.run.linux.x64.checked.mch	+0.21%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.16%

Throughput diffs for osx/arm64 ran on windows/x64

Overall (+0.15% to +0.38%)

Collection	PDIFF
benchmarks.run_pgo.osx.arm64.checked.mch	+0.15%
benchmarks.run_tiered.osx.arm64.checked.mch	+0.29%
coreclr_tests.run.osx.arm64.checked.mch	+0.29%
libraries.crossgen2.osx.arm64.checked.mch	+0.38%
libraries.pmi.osx.arm64.checked.mch	+0.22%
libraries_tests.run.osx.arm64.Release.mch	+0.33%
libraries_tests_no_tiered_compilation.run.osx.arm64.Release.mch	+0.30%
realworld.run.osx.arm64.checked.mch	+0.22%

MinOpts (+0.01% to +1.29%)

Collection	PDIFF
benchmarks.run_pgo.osx.arm64.checked.mch	+0.51%
benchmarks.run_tiered.osx.arm64.checked.mch	+0.55%
coreclr_tests.run.osx.arm64.checked.mch	+0.36%
libraries.crossgen2.osx.arm64.checked.mch	+0.12%
libraries.pmi.osx.arm64.checked.mch	+0.01%
libraries_tests.run.osx.arm64.Release.mch	+0.77%
libraries_tests_no_tiered_compilation.run.osx.arm64.Release.mch	+1.29%
realworld.run.osx.arm64.checked.mch	+0.57%

FullOpts (+0.05% to +0.38%)

Collection	PDIFF
benchmarks.run_pgo.osx.arm64.checked.mch	+0.05%
benchmarks.run_tiered.osx.arm64.checked.mch	+0.08%
coreclr_tests.run.osx.arm64.checked.mch	+0.23%
libraries.crossgen2.osx.arm64.checked.mch	+0.38%
libraries.pmi.osx.arm64.checked.mch	+0.22%
libraries_tests.run.osx.arm64.Release.mch	+0.11%
libraries_tests_no_tiered_compilation.run.osx.arm64.Release.mch	+0.28%
realworld.run.osx.arm64.checked.mch	+0.22%

Throughput diffs for windows/arm64 ran on windows/x64

Overall (+0.15% to +0.38%)

Collection	PDIFF
benchmarks.run.windows.arm64.checked.mch	+0.15%
benchmarks.run_pgo.windows.arm64.checked.mch	+0.15%
benchmarks.run_tiered.windows.arm64.checked.mch	+0.29%
coreclr_tests.run.windows.arm64.checked.mch	+0.28%
libraries.crossgen2.windows.arm64.checked.mch	+0.38%
libraries.pmi.windows.arm64.checked.mch	+0.22%
libraries_tests.run.windows.arm64.Release.mch	+0.34%
libraries_tests_no_tiered_compilation.run.windows.arm64.Release.mch	+0.30%
realworld.run.windows.arm64.checked.mch	+0.22%
smoke_tests.nativeaot.windows.arm64.checked.mch	+0.19%

MinOpts (+0.02% to +1.29%)

Collection	PDIFF
benchmarks.run.windows.arm64.checked.mch	+0.66%
benchmarks.run_pgo.windows.arm64.checked.mch	+0.51%
benchmarks.run_tiered.windows.arm64.checked.mch	+0.55%
coreclr_tests.run.windows.arm64.checked.mch	+0.36%
libraries.crossgen2.windows.arm64.checked.mch	+0.12%
libraries.pmi.windows.arm64.checked.mch	+0.02%
libraries_tests.run.windows.arm64.Release.mch	+0.78%
libraries_tests_no_tiered_compilation.run.windows.arm64.Release.mch	+1.29%
realworld.run.windows.arm64.checked.mch	+0.57%
smoke_tests.nativeaot.windows.arm64.checked.mch	+1.17%

FullOpts (+0.08% to +0.38%)

Collection	PDIFF
benchmarks.run.windows.arm64.checked.mch	+0.15%
benchmarks.run_pgo.windows.arm64.checked.mch	+0.09%
benchmarks.run_tiered.windows.arm64.checked.mch	+0.08%
coreclr_tests.run.windows.arm64.checked.mch	+0.22%
libraries.crossgen2.windows.arm64.checked.mch	+0.38%
libraries.pmi.windows.arm64.checked.mch	+0.22%
libraries_tests.run.windows.arm64.Release.mch	+0.11%
libraries_tests_no_tiered_compilation.run.windows.arm64.Release.mch	+0.27%
realworld.run.windows.arm64.checked.mch	+0.22%
smoke_tests.nativeaot.windows.arm64.checked.mch	+0.19%

Throughput diffs for windows/x64 ran on windows/x64

Overall (+0.13% to +0.36%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.13%
benchmarks.run.windows.x64.checked.mch	+0.15%
benchmarks.run_pgo.windows.x64.checked.mch	+0.14%
benchmarks.run_tiered.windows.x64.checked.mch	+0.36%
coreclr_tests.run.windows.x64.checked.mch	+0.28%
libraries.crossgen2.windows.x64.checked.mch	+0.34%
libraries.pmi.windows.x64.checked.mch	+0.19%
libraries_tests.run.windows.x64.Release.mch	+0.33%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.26%
realworld.run.windows.x64.checked.mch	+0.20%
smoke_tests.nativeaot.windows.x64.checked.mch	+0.14%

MinOpts (+0.02% to +1.24%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.71%
benchmarks.run.windows.x64.checked.mch	+1.09%
benchmarks.run_pgo.windows.x64.checked.mch	+0.65%
benchmarks.run_tiered.windows.x64.checked.mch	+0.73%
coreclr_tests.run.windows.x64.checked.mch	+0.41%
libraries.crossgen2.windows.x64.checked.mch	+0.15%
libraries.pmi.windows.x64.checked.mch	+0.02%
libraries_tests.run.windows.x64.Release.mch	+0.94%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.45%
realworld.run.windows.x64.checked.mch	+0.70%
smoke_tests.nativeaot.windows.x64.checked.mch	+1.24%

FullOpts (+0.06% to +0.34%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.07%
benchmarks.run.windows.x64.checked.mch	+0.15%
benchmarks.run_pgo.windows.x64.checked.mch	+0.06%
benchmarks.run_tiered.windows.x64.checked.mch	+0.12%
coreclr_tests.run.windows.x64.checked.mch	+0.20%
libraries.crossgen2.windows.x64.checked.mch	+0.34%
libraries.pmi.windows.x64.checked.mch	+0.19%
libraries_tests.run.windows.x64.Release.mch	+0.10%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.26%
realworld.run.windows.x64.checked.mch	+0.20%
smoke_tests.nativeaot.windows.x64.checked.mch	+0.14%

Details here

Throughput diffs for linux/arm ran on windows/x86

Overall (+0.14% to +0.42%)

Collection	PDIFF
benchmarks.run.linux.arm.checked.mch	+0.14%
benchmarks.run_pgo.linux.arm.checked.mch	+0.17%
benchmarks.run_tiered.linux.arm.checked.mch	+0.23%
coreclr_tests.run.linux.arm.checked.mch	+0.31%
libraries.crossgen2.linux.arm.checked.mch	+0.42%
libraries.pmi.linux.arm.checked.mch	+0.22%
libraries_tests.run.linux.arm.Release.mch	+0.32%
libraries_tests_no_tiered_compilation.run.linux.arm.Release.mch	+0.31%
realworld.run.linux.arm.checked.mch	+0.18%

MinOpts (+0.02% to +1.66%)

Collection	PDIFF
benchmarks.run.linux.arm.checked.mch	+0.90%
benchmarks.run_pgo.linux.arm.checked.mch	+0.73%
benchmarks.run_tiered.linux.arm.checked.mch	+0.76%
coreclr_tests.run.linux.arm.checked.mch	+0.45%
libraries.crossgen2.linux.arm.checked.mch	+0.14%
libraries.pmi.linux.arm.checked.mch	+0.02%
libraries_tests.run.linux.arm.Release.mch	+1.00%
libraries_tests_no_tiered_compilation.run.linux.arm.Release.mch	+1.66%
realworld.run.linux.arm.checked.mch	+0.71%

FullOpts (+0.10% to +0.42%)

Collection	PDIFF
benchmarks.run.linux.arm.checked.mch	+0.14%
benchmarks.run_pgo.linux.arm.checked.mch	+0.13%
benchmarks.run_tiered.linux.arm.checked.mch	+0.10%
coreclr_tests.run.linux.arm.checked.mch	+0.21%
libraries.crossgen2.linux.arm.checked.mch	+0.42%
libraries.pmi.linux.arm.checked.mch	+0.22%
libraries_tests.run.linux.arm.Release.mch	+0.12%
libraries_tests_no_tiered_compilation.run.linux.arm.Release.mch	+0.26%
realworld.run.linux.arm.checked.mch	+0.18%

Details here

Throughput diffs for linux/arm64 ran on linux/x64

Overall (+0.11% to +0.33%)

Collection	PDIFF
libraries.pmi.linux.arm64.checked.mch	+0.19%
realworld.run.linux.arm64.checked.mch	+0.20%
benchmarks.run.linux.arm64.checked.mch	+0.13%
libraries.crossgen2.linux.arm64.checked.mch	+0.33%
libraries_tests.run.linux.arm64.Release.mch	+0.22%
smoke_tests.nativeaot.linux.arm64.checked.mch	+0.17%
benchmarks.run_tiered.linux.arm64.checked.mch	+0.22%
libraries_tests_no_tiered_compilation.run.linux.arm64.Release.mch	+0.26%
coreclr_tests.run.linux.arm64.checked.mch	+0.23%
benchmarks.run_pgo.linux.arm64.checked.mch	+0.11%

MinOpts (+0.05% to +0.90%)

Collection	PDIFF
libraries.pmi.linux.arm64.checked.mch	+0.05%
realworld.run.linux.arm64.checked.mch	+0.42%
benchmarks.run.linux.arm64.checked.mch	+0.63%
libraries.crossgen2.linux.arm64.checked.mch	+0.09%
libraries_tests.run.linux.arm64.Release.mch	+0.58%
smoke_tests.nativeaot.linux.arm64.checked.mch	+0.90%
benchmarks.run_tiered.linux.arm64.checked.mch	+0.33%
libraries_tests_no_tiered_compilation.run.linux.arm64.Release.mch	+0.81%
coreclr_tests.run.linux.arm64.checked.mch	+0.26%
benchmarks.run_pgo.linux.arm64.checked.mch	+0.35%

FullOpts (+0.07% to +0.33%)

Collection	PDIFF
libraries.pmi.linux.arm64.checked.mch	+0.19%
realworld.run.linux.arm64.checked.mch	+0.19%
benchmarks.run.linux.arm64.checked.mch	+0.12%
libraries.crossgen2.linux.arm64.checked.mch	+0.33%
libraries_tests.run.linux.arm64.Release.mch	+0.10%
smoke_tests.nativeaot.linux.arm64.checked.mch	+0.17%
benchmarks.run_tiered.linux.arm64.checked.mch	+0.08%
libraries_tests_no_tiered_compilation.run.linux.arm64.Release.mch	+0.24%
coreclr_tests.run.linux.arm64.checked.mch	+0.20%
benchmarks.run_pgo.linux.arm64.checked.mch	+0.07%

Throughput diffs for linux/x64 ran on linux/x64

Overall (+0.10% to +0.31%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.13%
realworld.run.linux.x64.checked.mch	+0.19%
libraries.crossgen2.linux.x64.checked.mch	+0.31%
benchmarks.run_tiered.linux.x64.checked.mch	+0.27%
libraries_tests.run.linux.x64.Release.mch	+0.23%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.26%
benchmarks.run_pgo.linux.x64.checked.mch	+0.10%
libraries.pmi.linux.x64.checked.mch	+0.19%
coreclr_tests.run.linux.x64.checked.mch	+0.23%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.13%

MinOpts (+0.06% to +1.02%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+1.02%
realworld.run.linux.x64.checked.mch	+0.54%
libraries.crossgen2.linux.x64.checked.mch	+0.11%
benchmarks.run_tiered.linux.x64.checked.mch	+0.53%
libraries_tests.run.linux.x64.Release.mch	+0.71%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.33%
benchmarks.run_pgo.linux.x64.checked.mch	+0.45%
libraries.pmi.linux.x64.checked.mch	+0.06%
coreclr_tests.run.linux.x64.checked.mch	+0.30%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.92%

FullOpts (+0.06% to +0.31%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.12%
realworld.run.linux.x64.checked.mch	+0.19%
libraries.crossgen2.linux.x64.checked.mch	+0.31%
benchmarks.run_tiered.linux.x64.checked.mch	+0.07%
libraries_tests.run.linux.x64.Release.mch	+0.09%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.25%
benchmarks.run_pgo.linux.x64.checked.mch	+0.06%
libraries.pmi.linux.x64.checked.mch	+0.19%
coreclr_tests.run.linux.x64.checked.mch	+0.18%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.13%

Details here

VSadov · 2024-01-27T01:05:23Z

I think this is ready for a discussion.

jkotas · 2024-01-27T01:09:11Z

Why is this specific to native AOT? It should be equally applicable to CoreCLR too.
Is this a breaking change in the GCInfo format or semantics?

ghost · 2024-01-27T01:09:44Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

TL;DR

no need for hijacking if a thread is already on a safe point
helps a lot with cases like Slow thread suspension in busy loop #94767 as otherwise uninterruptible loop in the caller gets points of interruption

== What is going on here:

Currently we have roughly two kinds of methods - fully and partially interruptible. We can initiate stackwalks in fully interruptible methods as well as can stackwalk through them. Partially interruptible allow only stackwalk through them as only call-return sites have info about the GC content of nonvolatile registers (and volatile ones are conveniently dead at return sites).
The advantage of partially interruptible methods is that they carry less info and thus everything that can be partially interruptible is emitted as such. The ratio of partially/fully interuptible varies, but generally partially interruptible is in vast majority.

A good question to ask - can we initiate a stack walk when stopped on a call-return site in a partially interruptible code?
The current answer is "almost". We already store the info about nonvolatile registers at these sites, so in theory we have enough information to start stackwalk. Almost enough.
The only piece that is missing is the content of return registers. Unlike the case of walking through virtually unwound callers, when return registers are not live yet, in the case of actual return, they contain return values, and if those are object refs, they must be reported to GC. Once GC info could tell if return registers are live after a call, the call-return sites become interruptible.

== What is gained:

If we find ourselves on an interruptible point we do not need to hijack and wait for the thread to take the bait - we are done! We've already caught the thread where we want it to be.
This would add a lot of additional interruptibility points in a program, which would improve robustness of suspension.

#2 is actually more interesting. There are known cases that present challenge to suspension. One of them is a "short call in a loop". This happens when we have a relatively tight loop that perform call(s) to short methods. At compile time we only know that there are calls, thus we do not put GC polls in the loop. As a result we end up with a loop that can be suspended only when/if we manage to catch the thread at a few instructions in the calee.
Superscalar multiple-issue CPUs can process multiple instructions at a time, which makes this situation worse as that effectively reduces the number of possible hijackable points in the calee. We may see the thread IP of a stopped thread at the call-return site way more often than inside the calee's body.

== Posible further improvements:

Once we have this, hijacking could be simplified as well.
Currently, while installing a hijack, we need to consult the GC info about the GC-refness of contained method's returns and stash that information in a transition frame, so that, if hijack is hit, the stack crawler could know if/how report the return registers.

All this special casing would be unnecessary as we could put synchronous interrupts (hijacked case) and asynchronous interrupted case (signals or OS suspension) on the same plan with respect to GC reporting. - just ask the decoder to report volatile registers, of which only returns would be live at return site - same as what happens in asynchronous case. The difference would be only in how interruption happened and how registerset display was formed (i.e pushed by hijack probe or by the OS).

A sizable chunk of architecture and platform-specific c++/asm code that deals with return flags could be gone. Like a good part of this file:

runtime/src/coreclr/nativeaot/Runtime/ICodeManager.h

Lines 25 to 38 in 9a3cacd

    
           // All values but GCRK_Unknown must correspond to MethodReturnKind enumeration in gcinfo.h 
        
           enum GCRefKind : unsigned char 
        
           { 
        
               GCRK_Scalar         = 0x00, 
        
               GCRK_Object         = 0x01, 
        
               GCRK_Byref          = 0x02, 
        
           #ifdef TARGET_64BIT 
        
               // Composite return kinds for value types returned in two registers (encoded with two bits per register) 
        
               GCRK_Scalar_Obj     = (GCRK_Object << 2) | GCRK_Scalar, 
        
               GCRK_Obj_Obj        = (GCRK_Object << 2) | GCRK_Object, 
        
               GCRK_Byref_Obj      = (GCRK_Object << 2) | GCRK_Byref, 
        
               GCRK_Scalar_Byref   = (GCRK_Byref  << 2) | GCRK_Scalar, 
        
               GCRK_Obj_Byref      = (GCRK_Byref  << 2) | GCRK_Object, 
        
               GCRK_Byref_Byref    = (GCRK_Byref  << 2) | GCRK_Byref,

Also we may be able to drop the return kind bits in the GC info entirely, I think there are no other uses.

Author:	VSadov
Assignees:	VSadov
Labels:	`area-CodeGen-coreclr`, `area-NativeAOT-coreclr`
Milestone:	-

VSadov · 2024-01-27T01:16:01Z

The reason why this is NativeAOT is just because NativeAOT is more straightforward in this area and as such more approachable to experimenting.
This will work for CoreCLR as well, but I figured - should do one step at a time.

I did experiment with CoreCLR locally. It worked and passed libraries tests, but I was not confident enough to include those changes in the same PR.
It is not a lot of additional changes, but more scattered than I'd like, so I was unsure about completeness of the change.

Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>

VSadov · 2024-05-05T23:31:04Z

rebased onto recent main, resolved conflicts

AndyAyersMS · 2024-05-06T14:52:10Z

Seems like Bruce is out for a while longer still.

@kunalspathak are you going to review?

kunalspathak · 2024-05-06T15:01:49Z

Seems like Bruce is out for a while longer still.

@kunalspathak are you going to review?

I have asked @jakobbotsch last week to take a look.

VSadov · 2024-05-06T17:04:49Z

I have asked @jakobbotsch last week to take a look.

@jakobbotsch looked through the changes after that and signed off. (Thanks, Jakob!!)
So I wonder if we wait for someone else or good to go.

AndyAyersMS · 2024-05-06T17:11:55Z

I think we are good.

VSadov · 2024-05-06T17:37:13Z

I will update the JIT Guid and proceed to merging this, if nothing comes up.

VSadov · 2024-05-06T23:09:51Z

Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd has failed is #97735

The rest is green.

VSadov · 2024-05-06T23:13:18Z

Thanks everybody for reviewing and helping with this!!!

* allow async interruptions on safepoints * ARM64 TODO * report GC ref/byref returns at partially interruptible callsites * enable on all platforms * tweak * fix after rebasing * do not record tailcalls * IsInterruptibleSafePoint * update gccover * turn on new behavior on a gcinfo version * tailcalls tweak * do not report unused returns * CORINFO_HELP_FAIL_FAST should not be a safepoint * treat tailcalls as emitNoGChelper * versioning tweak * enable in CoreCLR (not just for GC stress scenarios) * fix x86 build * other architectures * added a knob DOTNET_InterruptibleCallSites * moved DOTNET_InterruptibleCallSites check to the code manager * JIT_StackProbe should not be a safepoint (stack is not cleaned yet) * Hooked up GCInfo version to R2R file version * formatting * GCStress support for RISC architectures * Update src/coreclr/inc/gcinfo.h Co-authored-by: Jan Kotas <jkotas@microsoft.com> * make InterruptibleSafePointsEnabled static * fix linux-x86 build. * ARM32 actually can`t return 2 GC references, so can filter out R1 early * revert unnecessary change * Update src/coreclr/tools/aot/ILCompiler.Reflection.ReadyToRun/Amd64/GcInfo.cs Co-authored-by: Filip Navara <filip.navara@gmail.com> * removed GCINFO_VERSION cons from GcInfo.cs * Use RBM_INTRET/RBM_INTRET_1 * Update src/coreclr/tools/aot/ILCompiler.Reflection.ReadyToRun/Amd64/GcInfo.cs Co-authored-by: Jan Kotas <jkotas@microsoft.com> * do not skip safe points twice (stress failure) * revert unnecessary change in gccover. * fix after rebase * make sure to check `idIsNoGC` on all codepaths in `emitOutputInstr` * make CORINFO_HELP_CHECK_OBJ a no-gc helper (because we can) * mark a test that tests WaitForPendingFinalizers as GCStressIncompatible * NOP * these helpers should not form GC safe points * require that the new block has BBF_HAS_LABEL * Apply suggestions from code review Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com> * updated JITEEVersionIdentifier GUID --------- Co-authored-by: Jan Kotas <jkotas@microsoft.com> Co-authored-by: Filip Navara <filip.navara@gmail.com> Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>

VSadov · 2024-05-11T19:31:25Z

For the final effects on suspension robustness and outliers in CoreCLR from this change + from ported NativeAOT thread suspension algorithm (Re: #101782 )

I have tried the current bits from the main branch with the same repro as used above (#95565 (comment))

The benchmark prints out GC pauses in milliseconds. Smaller is better.

CoreCLR on Windows 10
(AMD Ryzen 9 7950X)

Before this PR and #101782 we saw multi-minute pauses as measured in #95565 (comment)

0
26
15
31
47
46
47
0
30
96160 I guess the code got tiered up here.
1175
2556
18854
351623
96504
623567
50948
295441
9274
174658

With bits from current main I see:

The suspension happens in sub-millisecond range and thus below the sensitivity of the benchmark.

* allow async interruptions on safepoints * ARM64 TODO * report GC ref/byref returns at partially interruptible callsites * enable on all platforms * tweak * fix after rebasing * do not record tailcalls * IsInterruptibleSafePoint * update gccover * turn on new behavior on a gcinfo version * tailcalls tweak * do not report unused returns * CORINFO_HELP_FAIL_FAST should not be a safepoint * treat tailcalls as emitNoGChelper * versioning tweak * enable in CoreCLR (not just for GC stress scenarios) * fix x86 build * other architectures * added a knob DOTNET_InterruptibleCallSites * moved DOTNET_InterruptibleCallSites check to the code manager * JIT_StackProbe should not be a safepoint (stack is not cleaned yet) * Hooked up GCInfo version to R2R file version * formatting * GCStress support for RISC architectures * Update src/coreclr/inc/gcinfo.h Co-authored-by: Jan Kotas <jkotas@microsoft.com> * make InterruptibleSafePointsEnabled static * fix linux-x86 build. * ARM32 actually can`t return 2 GC references, so can filter out R1 early * revert unnecessary change * Update src/coreclr/tools/aot/ILCompiler.Reflection.ReadyToRun/Amd64/GcInfo.cs Co-authored-by: Filip Navara <filip.navara@gmail.com> * removed GCINFO_VERSION cons from GcInfo.cs * Use RBM_INTRET/RBM_INTRET_1 * Update src/coreclr/tools/aot/ILCompiler.Reflection.ReadyToRun/Amd64/GcInfo.cs Co-authored-by: Jan Kotas <jkotas@microsoft.com> * do not skip safe points twice (stress failure) * revert unnecessary change in gccover. * fix after rebase * make sure to check `idIsNoGC` on all codepaths in `emitOutputInstr` * make CORINFO_HELP_CHECK_OBJ a no-gc helper (because we can) * mark a test that tests WaitForPendingFinalizers as GCStressIncompatible * NOP * these helpers should not form GC safe points * require that the new block has BBF_HAS_LABEL * Apply suggestions from code review Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com> * updated JITEEVersionIdentifier GUID --------- Co-authored-by: Jan Kotas <jkotas@microsoft.com> Co-authored-by: Filip Navara <filip.navara@gmail.com> Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>

dotnet-issue-labeler bot added the area-NativeAOT-coreclr label Dec 4, 2023

ghost assigned VSadov Dec 4, 2023

This was referenced Dec 4, 2023

[8.0] [outerloop] jit-format: Format job found errors, please apply the format patch #95159

Closed

STATUS_UNSUCCESSFUL in RsaCryptRoundtrip_OaepSHA1 #29683

Open

VSadov force-pushed the spInterrupt branch from b0ee968 to 9d365bd Compare December 4, 2023 23:57

dotnet deleted a comment from azure-pipelines bot Dec 5, 2023

VSadov force-pushed the spInterrupt branch from 9d365bd to b57bc38 Compare December 5, 2023 02:00

dotnet deleted a comment from azure-pipelines bot Dec 5, 2023

VSadov force-pushed the spInterrupt branch 2 times, most recently from c10522c to d12c536 Compare December 8, 2023 19:33

dotnet deleted a comment from azure-pipelines bot Dec 8, 2023

VSadov force-pushed the spInterrupt branch from ffc0c8f to fe3ab26 Compare December 9, 2023 20:47

dotnet deleted a comment from azure-pipelines bot Dec 9, 2023

build-analysis bot mentioned this pull request Dec 10, 2023

Tracking issue for CI build timeouts #76454

Closed

VSadov force-pushed the spInterrupt branch 3 times, most recently from 2c94646 to d569006 Compare December 28, 2023 22:17

build-analysis bot mentioned this pull request Dec 29, 2023

Test failure - System.NullReferenceException in System.Threading.Lock.TryInitializeStatics #94728

Closed

VSadov force-pushed the spInterrupt branch from d569006 to 711bab7 Compare January 26, 2024 03:47

build-analysis bot mentioned this pull request Jan 26, 2024

Checkout failure: "Git fetch failed with exit code 128" dotnet/arcade#9009

Open

2 tasks

VSadov marked this pull request as ready for review January 27, 2024 01:05

VSadov requested a review from MichalStrehovsky as a code owner January 27, 2024 01:05

jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 27, 2024

jkotas removed the area-NativeAOT-coreclr label Jan 27, 2024

VSadov and others added 9 commits May 5, 2024 16:30

revert unnecessary change in gccover.

ef9bd7d

fix after rebase

d6b2b9f

make sure to check idIsNoGC on all codepaths in emitOutputInstr

06d681e

make CORINFO_HELP_CHECK_OBJ a no-gc helper (because we can)

52e2fd6

mark a test that tests WaitForPendingFinalizers as GCStressIncompatible

0a938e0

NOP

dc12379

these helpers should not form GC safe points

c8f1ba1

require that the new block has BBF_HAS_LABEL

e99529c

Apply suggestions from code review

53c99d3

Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>

VSadov force-pushed the spInterrupt branch from 5952f03 to 53c99d3 Compare May 5, 2024 23:30

updated JITEEVersionIdentifier GUID

40e4526

build-analysis bot mentioned this pull request May 6, 2024

BinderTracingTest.ResolutionFlow.cmd Timed Out #97735

Closed

VSadov merged commit 8608181 into dotnet:main May 6, 2024
104 of 106 checks passed

VSadov deleted the spInterrupt branch May 6, 2024 23:12

VSadov mentioned this pull request May 11, 2024

Slow thread suspension in busy loop #94767

Closed

github-actions bot mentioned this pull request May 12, 2024

95565 MichalStrehovsky/rt-sz#20

Closed

VSadov mentioned this pull request May 16, 2024

[NativeAOT] Do not emit safe point for TLS_GET_ADDR calls into native runtime. #102237

Merged

jakobbotsch mentioned this pull request May 29, 2024

Test failure: Regressions/coreclr/0582/csgen.1/csgen.1.cmd #102296

Closed

github-actions bot locked and limited conversation to collaborators Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow async interruptions on safepoints #95565

Allow async interruptions on safepoints #95565

VSadov commented Dec 4, 2023 •

edited

Loading

ghost commented Dec 4, 2023

ryujit-bot commented Jan 26, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

Throughput diffs for linux/x64 ran on windows/x64

Throughput diffs for osx/arm64 ran on windows/x64

Throughput diffs for windows/arm64 ran on windows/x64

Throughput diffs for windows/x64 ran on windows/x64

Throughput diffs for linux/arm ran on windows/x86

Throughput diffs for linux/arm64 ran on linux/x64

Throughput diffs for linux/x64 ran on linux/x64

VSadov commented Jan 27, 2024

jkotas commented Jan 27, 2024 •

edited

Loading

ghost commented Jan 27, 2024

VSadov commented Jan 27, 2024 •

edited

Loading

VSadov commented May 5, 2024

AndyAyersMS commented May 6, 2024

kunalspathak commented May 6, 2024

VSadov commented May 6, 2024

AndyAyersMS commented May 6, 2024

VSadov commented May 6, 2024

VSadov commented May 6, 2024

VSadov commented May 6, 2024

VSadov commented May 11, 2024 •

edited

Loading

	// All values but GCRK_Unknown must correspond to MethodReturnKind enumeration in gcinfo.h
	enum GCRefKind : unsigned char
	{
	GCRK_Scalar = 0x00,
	GCRK_Object = 0x01,
	GCRK_Byref = 0x02,
	#ifdef TARGET_64BIT
	// Composite return kinds for value types returned in two registers (encoded with two bits per register)
	GCRK_Scalar_Obj = (GCRK_Object << 2) \| GCRK_Scalar,
	GCRK_Obj_Obj = (GCRK_Object << 2) \| GCRK_Object,
	GCRK_Byref_Obj = (GCRK_Object << 2) \| GCRK_Byref,
	GCRK_Scalar_Byref = (GCRK_Byref << 2) \| GCRK_Scalar,
	GCRK_Obj_Byref = (GCRK_Byref << 2) \| GCRK_Object,
	GCRK_Byref_Byref = (GCRK_Byref << 2) \| GCRK_Byref,

Allow async interruptions on safepoints #95565

Allow async interruptions on safepoints #95565

Conversation

VSadov commented Dec 4, 2023 • edited Loading

ghost commented Dec 4, 2023

ryujit-bot commented Jan 26, 2024

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

Throughput diffs for linux/x64 ran on windows/x64

Throughput diffs for osx/arm64 ran on windows/x64

Throughput diffs for windows/arm64 ran on windows/x64

Throughput diffs for windows/x64 ran on windows/x64

Throughput diffs for linux/arm ran on windows/x86

Throughput diffs for linux/arm64 ran on linux/x64

Throughput diffs for linux/x64 ran on linux/x64

VSadov commented Jan 27, 2024

jkotas commented Jan 27, 2024 • edited Loading

ghost commented Jan 27, 2024

VSadov commented Jan 27, 2024 • edited Loading

VSadov commented May 5, 2024

AndyAyersMS commented May 6, 2024

kunalspathak commented May 6, 2024

VSadov commented May 6, 2024

AndyAyersMS commented May 6, 2024

VSadov commented May 6, 2024

VSadov commented May 6, 2024

VSadov commented May 6, 2024

VSadov commented May 11, 2024 • edited Loading

VSadov commented Dec 4, 2023 •

edited

Loading

jkotas commented Jan 27, 2024 •

edited

Loading

VSadov commented Jan 27, 2024 •

edited

Loading

VSadov commented May 11, 2024 •

edited

Loading