Use canonical method in TransitionFrames whenever we parse signatures #33733

fadimounir · 2020-03-18T20:32:19Z

I've been investigating the crash dump from #32848, and the main issue is that the GC is calling into the TypeLoader to construct new types. This should never be allowed, and what made it worse in that crash dump is the fact that server GC was being used, so the call to GetThread() was returning NULL, causing an AV during type construction.

The AV needs the following conditions to repro:

The use of server GC. I also believe it should be possible to use regular GC, and craft a repro test that would hit an assert in the type loader, but may not necessarily AV.
Some luck with inlining heuristics
Virtual calls into generic methods (or static methods on generic types)
The generic method must be using reference types as instantiation arguments (i.e. have a valid canonical method)
For that generic method to be executing its PreStub for the first time.
And finally, for GC to trigger on another thread, and enumerate that method that is currently executing its Prestub.

cc @dotnet/crossgen-contrib

Dotnet-GitSync-Bot · 2020-03-18T20:32:23Z

I couldn't add an area label to this PR.

Checkout this page to find out which area owner to ping, or please add exactly one area label to help train me in the future.

src/coreclr/src/vm/frames.h

src/coreclr/src/vm/frames.cpp

jkotas · 2020-03-18T20:47:48Z

I believe that this is expected to be handled by passing in CLASS_LOAD_APPROXPARENTS. See https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/siginfo.cpp#L1057 and related code. Do you see CLASS_LOAD_APPROXPARENTS not being passed in when you see the crash?

fadimounir · 2020-03-18T22:05:51Z

Do you see CLASS_LOAD_APPROXPARENTS not being passed in when you see the crash?

CLASS_LOAD_APPROXPARENTS is passed correctly. I'm also able to repro this locally pretty easily with gcstress and server gc (see attached regression test)

fadimounir · 2020-03-18T22:12:38Z

@jkotas to make sure I understand the issue correctly, is this supposed to be handled on the TypeLoader side, to ensure we do not construct new types?

davidwrighton · 2020-03-19T02:47:07Z

@fadimounir I understand that approxparents is being passed correctly, but why isn't it preventing the exact type load? It seems that should be fixed instead of this.

fadimounir · 2020-03-19T20:44:00Z

I understand that approxparents is being passed correctly, but why isn't it preventing the exact type load?

This is a type that has never been loaded before, and approxparents doesn't help in this case of server GC (and AFAIK, not even with regular GC). The issue with server GC is that it doesn't have a valid Thread object associated with its thread, and the TypeLoader uses that Thread object in its locking logic. Without one, we AV right now. Even without server GC, in a checked build, we will hit this debug only code here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/siginfo.cpp#L1104, which uses the DontLoadTypes flag, and we'll hit the assert right after this call because we'll get back a null TypeHandle.

My understanding is that approxparents won't help if the type has not been previously loaded to at least approxparents, because we should never be allowed to create types (not even to the level of approxparents) during GC. Please correct me if my understanding is wrong.

One possible easy way to fix this is to handle the null Thread* case, to avoid the AV, but I believe this would be a hack to mask one of the symptoms, and the bottom line is, we should still never load any type during GC. So I don't think this would be a correct fix.

I currently see 2 patterns here where we could be calling into a MethodDesc's entry point without potentially loading the valuetypes in the signature:

Entry points returned from dictionary lookups that we directly call into (Ex: delegates. I haven't verified that the bug exists in this scenario though).
Entry points returned by the JIT_VirtualFunctionPointer helper (the case of this bug here)

In these two cases, when jitting the callers of such methods, the JIT is dealing with the canonical version of things, so all the places where the JIT/JIT interface calls into the classMustBeLoadedBeforeCodeIsRun won't help.

The possible fix I see here is to call the MethodDesc::WalkValueTypeParameters() on these MethodDescs for which we return an entry point out of the two lookups I mentioned above, and load valuetypes in the signature to at least approxparents level.

If this doesn't feel like the right thing to do, this means that when we load these MethodDescs on the TypeLoader side, we should perform this step of walking the parameters and loading them, but this could be an overkill, and I think it might be better to perform that kind of loading only when we're about to execute the methods.

@jkotas, @davidwrighton I'd like to get your thoughts on this proposed fix. If you'd like, I can post an updated PR with what I'm proposing for clarification.

davidwrighton · 2020-03-20T01:07:36Z

@fadimounir My understanding from our offline conversations is that there is some attempt to load a type like SomeGenericStruct<object>, and we are forcing that actual type to be loaded. The dropGenericArgumentLevel stuff is supposed to force us to attempt to load SomeGenericStruct<__Canon>, which was already loaded. If there isn't a canonical type already loaded, that's a different problem, but if the issue is parsing through a signature that specifies SomeGenericStruct<object> and we're actually attempting to load that type (at all), we shouldn't be doing that, and the fix isn't to use a canonical method, its to have the signature loader trigger the attempted load of the __Canon variant of the struct.

fadimounir · 2020-03-20T18:18:54Z

The dropGenericArgumentLevel stuff is supposed to force us to attempt to load SomeGenericStruct<__Canon>, which was already loaded

@davidwrighton thanks for bringing that to my attention. Let me examine this further and see why we're still attempting to load the fully instantiated type even though the dropGenericArgumentLevel is set to true

fadimounir · 2020-03-20T20:01:58Z

Looks like the issue here is that dropGenericArgumentLevel is not handled correctly. We only attempt to canonicalize some variants of the signature (ex: if the signature contains ref type arguments), but we're not handling things like ELEMENT_TYPE_MVAR correctly.

First of all, is there a reason why ELEMENT_TYPE_VAR is treated differently than ELEMENT_TYPE_MVAR or is that just a bug that hasn't been noticed before? See: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/siginfo.cpp#L1238

Also, would it make sense to add something like this for each instantiation type argument we load, before attempting to load the instantiation itself (to be on the safe side)?:

if (dropGenericArgumentLevel && level == CLASS_LOAD_APPROXPARENTS)
{
    typeHnd = ClassLoader::CanonicalizeGenericArg(typeHnd);
}

davidwrighton · 2020-03-20T21:52:32Z

@fadimounir that sounds like the right fix. I'm wresting with 2 questions as I struggle to give you advice.

Should this cover both VAR and MVAR cases?
Should this change be restricted to only the stackwalking and GC scenarios for dropGenericArgumentLevel or not? dropGenericArgumentLevel is used in a bunch of places in the typeloader, and affecting those cases might not be desireable.

I'd like to see the results of a full outerloop (including some GCSTRESS) with the change applied to VAR and MVAR, and not protected to only the stackwalking/gc scenarios.

jkotas · 2020-03-20T22:18:44Z

is there a reason why ELEMENT_TYPE_VAR is treated differently than ELEMENT_TYPE_MVAR

I do not think you can ever get substitutions with MVARs today. Substituations are only used for method overriding that makes them applicable to type VARs only.

fadimounir · 2020-03-20T22:52:16Z

Should this cover both VAR and MVAR cases?

We don't have the same problem with VARs. I couldn't repro any failures there, not even with static methods on generic types.

If there's a risk of breaking change here, we can limit it to MVAR and only to the GC/stackwalk scenario:

thRet = (psig.GetTypeVariableThrowing(pModule, typ, fLoadTypes, pTypeContext));
if (fLoadTypes == ClassLoader::LoadTypes)
    ClassLoader::EnsureLoaded(thRet, level);

if ((IsGCThread() || IsStackWalkerThread()) && dropGenericArgumentLevel && level == CLASS_LOAD_APPROXPARENTS)
    thRet = ClassLoader::CanonicalizeGenericArg(thRet);

However, i'll submit a new commit to this PR where the changes are not just scoped to GC/stackwalking, and run an outerloop GCSTRESS job to see if that causes any failures

davidwrighton · 2020-03-20T23:39:56Z

The reason I wanted the run on the VAR cases is that we have much better coverage for odd things happening to generic types than to methods. I'm not entirely confident it won't break anything, and I'm not even sure that if it doesn't we shouldn't only fix MVAR, but having some confidence that this change doesn't negatively impact even when applied to both VAR and MVAR would be good.

fadimounir · 2020-03-20T23:41:58Z

but having some confidence that this change doesn't negatively impact even when applied to both VAR and MVAR would be good.

Sounds good. I pushed a commit that applies canonicalization after the call to GetTypeHandleThrowing() returns, so this covers VAR, MVAR, and more. I'm running GCSTRESS with outerloop right now, and we'll see how this goes

fadimounir · 2020-03-31T01:06:19Z

The gcstress results are not as clean as baseline. I'll investigate further

fadimounir · 2020-04-06T20:25:58Z

@davidwrighton I reran the gcstress tests after submitting a fix around pinvokes, and the results are now comparable to the baseline (modulo the failures related to #33366).

I believe the changes should be ready to merge, although I have seen the regression test that I added fail non-deterministically on Windows arm configurations. It's not an AV or anything, but for some reason the test script returns 3 as an exit code, and the test does not emit any stdout. No idea what's going on there. I've been running the test constantly in a loop on a separate Windows arm device, using the same payload from the CI, and haven't seen any crashes. Not sure how to investigate this issue further. There are no dumps from the CI, and it doesn't seem like the test crashed really...
cc @jkotas

Here's the test output:

    GC\Regressions\Github\runtime_32848\runtime_32848\runtime_32848.cmd [FAIL]
      
      Return code:      1
      Raw output file:      D:\h\w\B2610973\w\AF970944\e\GC\Regressions\Reports\GC.Regressions\Github\runtime_32848\runtime_32848\runtime_32848.output.txt
      Raw output:
      BEGIN EXECUTION
       "D:\h\w\B2610973\p\corerun.exe" runtime_32848.dll 
      Expected: 100
      Actual: 3
      END EXECUTION - FAILED
      FAILED
      Test Harness Exitcode is : 1
      To run the test:
      > set CORE_ROOT=D:\h\w\B2610973\p
      > D:\h\w\B2610973\w\AF970944\e\GC\Regressions\Github\runtime_32848\runtime_32848\runtime_32848.cmd
      Expected: True
      Actual:   False
      Stack Trace:
        F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.arm.Checked\TestWrappers\GC.Regressions\GC.Regressions.XUnitWrapper.cs(140,0): at GC_Regressions._Github_runtime_32848_runtime_32848_runtime_32848_._Github_runtime_32848_runtime_32848_runtime_32848_cmd()
      Output:
        
        Return code:      1
        Raw output file:      D:\h\w\B2610973\w\AF970944\e\GC\Regressions\Reports\GC.Regressions\Github\runtime_32848\runtime_32848\runtime_32848.output.txt
        Raw output:
        BEGIN EXECUTION
         "D:\h\w\B2610973\p\corerun.exe" runtime_32848.dll 
        Expected: 100
        Actual: 3
        END EXECUTION - FAILED
        FAILED
        Test Harness Exitcode is : 1
        To run the test:
        > set CORE_ROOT=D:\h\w\B2610973\p
        > D:\h\w\B2610973\w\AF970944\e\GC\Regressions\Github\runtime_32848\runtime_32848\runtime_32848.cmd

davidwrighton · 2020-04-07T00:34:18Z

I suspect something very fishy is happening with test startup somewhere. Could you modify the test to always print out a message at the start of Main? That might help show the problem.
The other detail is that error code 3 in windows is ERROR_PATH_NOT_FOUND which indicates to me that something might be wrong with distributing binaries to this test hardware.

src/coreclr/tests/src/GC/Regressions/Github/runtime_32848/runtime_32848.csproj

fadimounir · 2020-04-07T16:42:42Z

@davidwrighton Same issue with the test failure here. I added a Console.WriteLine at the beginning of Main, but still that is not showing in the stdout.
@MattGal Would you have any suggestions on how to fix this, or what the underlying issue is?

MattGal · 2020-04-07T16:53:38Z

@davidwrighton Same issue with the test failure here. I added a Console.WriteLine at the beginning of Main, but still that is not showing in the stdout.
@MattGal Would you have any suggestions on how to fix this, or what the underlying issue is?

Taking a look

MattGal · 2020-04-07T17:21:54Z

I had a quick phone chat with Fadi. I'm going to take one of our machines that ran this earlier offline and try out these bits on it, and synch up with Fadi later on. My naive guess here is that the precommands being included might be messing up the current path.

MattGal · 2020-04-07T19:03:12Z

I had a quick phone chat with Fadi. I'm going to take one of our machines that ran this earlier offline and try out these bits on it, and synch up with Fadi later on. My naive guess here is that the precommands being included might be messing up the current path.

The issue is a CHK assert coming out of the runtime, which when dismissed by the dialog handler on the machine manifests as exit code 3. I'm working with Fadi on ideas but it's clear the machines we have have a newer OS and different hardware, so that is likely the source of the difference between local and remote repro of the bug.

fadimounir · 2020-04-07T19:22:38Z

I'm working with Fadi on ideas

I changed all calls to assert() in the problematic file to _ASSERTE, which first of all does not show the assert GUI dialog, and has a better chance of causing the infrastructure to save a crash dump and associate that with the test results. Hopefully we get a crash dump this time, so we're able to figure out what's going on.

fadimounir · 2020-04-08T21:49:30Z

I changed all calls to assert() in the problematic file to _ASSERTE

No crash dumps with _ASSERTE. I'll add some dummy code for now that will cause an AV, so I can get a dump for the failure

…rgumentLevel is TRUE

fadimounir · 2020-04-09T21:00:37Z

With Matt's help, I was able to get my hands on a crash dump and look at the callstack:

0:000> !clrstack -f
OS Thread Id: 0x2b3c (0)
Child SP       IP Call Site
04E0C998 68446D20 coreclr!common_assert_to_message_box + 120 at minkernel\crts\ucrt\src\appcrt\startup\assert.cpp:389
04E0CE50 68446C9F coreclr!common_assert + 79 at minkernel\crts\ucrt\src\appcrt\startup\assert.cpp:424
04E0CE70 68447BEB coreclr!_wassert + 27 at minkernel\crts\ucrt\src\appcrt\startup\assert.cpp:444
04E0CE90 68359225 coreclr!GCToOSInterface::GetCurrentProcessorNumber + 57 at F:\workspace\_work\1\s\src\coreclr\src\vm\gcenv.os.cpp:274
04E0CEA8 6839D0C3 coreclr!SVR::heap_select::select_heap + 19 at F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp:5267
04E0CED0 683866CF coreclr!SVR::gc_heap::balance_heaps + 31 at F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp:14174
04E0CF28 6838281B coreclr!SVR::gc_heap::allocate_more_space + 31 at F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp:14533
04E0CF50 6837D731 coreclr!SVR::GCHeap::Alloc + 273 at F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp:37197
04E0CF90 68236FC5 coreclr!Alloc + 205 at F:\workspace\_work\1\s\src\coreclr\src\vm\gchelpers.cpp:242
04E0CFB8 682377B1 coreclr!AllocateObject + 157 at F:\workspace\_work\1\s\src\coreclr\src\vm\gchelpers.cpp:1024
04E0CFE8 681CF3F1 coreclr!JIT_New + 209 at F:\workspace\_work\1\s\src\coreclr\src\vm\jithelpers.cpp:2313
04E0D024          [HelperMethodFrame: 04e0d024] 
04E0D0C0 299ad68a system.console.dll!System.Console..cctor() + 30
04E0DBFC          [HelperMethodFrame: 04e0dbfc] 
04E0DC98 299ad590 system.console.dll!System.Console.get_Out() + 40
04E0DCC0 299ad140 system.console.dll!System.Console.WriteLine(System.String) + 24
04E0DCD8 299acdbe runtime_32848.dll!Test.Main() + 34

There's nothing special here: it's a simple Console.WriteLine that is triggering the assert, and the only special thing is that server GC is enabled. I also couldn't find any evidence of heap corruption. There is a pretty large number of threads though (27 threads). It could be the case that this is just hitting a corner case bug in the OS. The processor number returned by the GetCurrentProcessorNumberEx() API is 0x70, which doesn't make sense.

So here's what i'm going to do:

Package a repro and as much info and send that to some contact on the Windows team (@davidwrighton you mentioned you knew someone there, could you please send me some contact info)?
Remove complus_gcserver from the test. The bug exists with non-server gc, and we have asserts in place on debug builds that would catch the issue (no crashes like the server GC case).

Also changing a bunch of assert() calls to _ASSERTE. Usually when _ASSERTE fails in CI lab runs, we tend to get crash dumps associated with test results, unlike assert() which shows a GUI dialog that DHandler dismisses by clicking on the Abort button.

The test will run with gcstress as part of the gcstress lab runs

fadimounir requested a review from davidwrighton March 18, 2020 20:32

fadimounir added area-GC-coreclr tenet-reliability Reliability/stability related issue (stress, load problems, etc.) labels Mar 18, 2020

jkotas reviewed Mar 18, 2020

View reviewed changes

src/coreclr/src/vm/frames.h Outdated Show resolved Hide resolved

jkotas reviewed Mar 18, 2020

View reviewed changes

src/coreclr/src/vm/frames.cpp Outdated Show resolved Hide resolved

fadimounir changed the title ~~Use canonical method in TransitionFrames whenever we parse signatures~~ [WIP] Use canonical method in TransitionFrames whenever we parse signatures Mar 18, 2020

jaredpar mentioned this pull request Mar 23, 2020

Builds failing with "remote provider unable to process the request" #33965

Closed

fadimounir closed this Mar 25, 2020

fadimounir reopened this Mar 25, 2020

This was referenced Apr 3, 2020

Exception handling not working correctly with InlinedCallFrames #34524

Closed

Fix incorrect assumption around the presence of ICF frames in EH codebase for 64-bit targets #34526

Merged

fadimounir force-pushed the Fix32848 branch from a0ea184 to 92737bd Compare April 6, 2020 03:09

fadimounir changed the title ~~[WIP] Use canonical method in TransitionFrames whenever we parse signatures~~ Use canonical method in TransitionFrames whenever we parse signatures Apr 6, 2020

fadimounir force-pushed the Fix32848 branch from 92737bd to dafa0b0 Compare April 7, 2020 02:57

janvorli reviewed Apr 7, 2020

View reviewed changes

src/coreclr/tests/src/GC/Regressions/Github/runtime_32848/runtime_32848.csproj Outdated Show resolved Hide resolved

riarenas mentioned this pull request Apr 8, 2020

Failures to restore from Azure artifacts feeds due to throttling dotnet/arcade#4190

Closed

Canonicalize arguments before loading instantiation when dropGenericA…

a04b965

…rgumentLevel is TRUE

fadimounir force-pushed the Fix32848 branch from 1bfdcb7 to b2b45f7 Compare April 9, 2020 07:49

fadimounir force-pushed the Fix32848 branch from b2b45f7 to a26ac47 Compare April 9, 2020 21:05

davidwrighton approved these changes Apr 10, 2020

View reviewed changes

Remove gcstress switch.

5853df3

The test will run with gcstress as part of the gcstress lab runs

fadimounir merged commit 7fe4e5b into dotnet:master Apr 11, 2020

fadimounir mentioned this pull request Apr 11, 2020

Application exception exit: segfault #32848

Closed

am11 mentioned this pull request Jul 28, 2020

Redirect DllImport of hostpolicy to the main executable when it's embedded #40014

Merged

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use canonical method in TransitionFrames whenever we parse signatures #33733

Use canonical method in TransitionFrames whenever we parse signatures #33733

fadimounir commented Mar 18, 2020

Dotnet-GitSync-Bot commented Mar 18, 2020

jkotas commented Mar 18, 2020

fadimounir commented Mar 18, 2020

fadimounir commented Mar 18, 2020

davidwrighton commented Mar 19, 2020

fadimounir commented Mar 19, 2020

davidwrighton commented Mar 20, 2020

fadimounir commented Mar 20, 2020

fadimounir commented Mar 20, 2020

davidwrighton commented Mar 20, 2020

jkotas commented Mar 20, 2020

fadimounir commented Mar 20, 2020 •

edited

Loading

davidwrighton commented Mar 20, 2020

fadimounir commented Mar 20, 2020

fadimounir commented Mar 31, 2020

fadimounir commented Apr 6, 2020 •

edited

Loading

davidwrighton commented Apr 7, 2020

fadimounir commented Apr 7, 2020

MattGal commented Apr 7, 2020

MattGal commented Apr 7, 2020

MattGal commented Apr 7, 2020

fadimounir commented Apr 7, 2020

fadimounir commented Apr 8, 2020

fadimounir commented Apr 9, 2020

Use canonical method in TransitionFrames whenever we parse signatures #33733

Use canonical method in TransitionFrames whenever we parse signatures #33733

Conversation

fadimounir commented Mar 18, 2020

Dotnet-GitSync-Bot commented Mar 18, 2020

jkotas commented Mar 18, 2020

fadimounir commented Mar 18, 2020

fadimounir commented Mar 18, 2020

davidwrighton commented Mar 19, 2020

fadimounir commented Mar 19, 2020

davidwrighton commented Mar 20, 2020

fadimounir commented Mar 20, 2020

fadimounir commented Mar 20, 2020

davidwrighton commented Mar 20, 2020

jkotas commented Mar 20, 2020

fadimounir commented Mar 20, 2020 • edited Loading

davidwrighton commented Mar 20, 2020

fadimounir commented Mar 20, 2020

fadimounir commented Mar 31, 2020

fadimounir commented Apr 6, 2020 • edited Loading

davidwrighton commented Apr 7, 2020

fadimounir commented Apr 7, 2020

MattGal commented Apr 7, 2020

MattGal commented Apr 7, 2020

MattGal commented Apr 7, 2020

fadimounir commented Apr 7, 2020

fadimounir commented Apr 8, 2020

fadimounir commented Apr 9, 2020

fadimounir commented Mar 20, 2020 •

edited

Loading

fadimounir commented Apr 6, 2020 •

edited

Loading