Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOS exception -- g_ExtControl is null #5193

Closed
bendono opened this issue Feb 26, 2016 · 14 comments
Closed

SOS exception -- g_ExtControl is null #5193

bendono opened this issue Feb 26, 2016 · 14 comments
Assignees
Milestone

Comments

@bendono
Copy link
Contributor

bendono commented Feb 26, 2016

I synced and rebuilt both coreclr and corefx today. While testing a scenario, my app crashed. I opened up the dump for analysis and loaded SOS. I noticed that no matter what the first SOS command is, this results in the following error:

c0000005 Exception in D:\git\coreclr-demo\runtime\sos.threads debugger extension.
PC: 00007ffec8b2a504 VA: 0000000000000000 R/W: 0 Parameter: 00000000`00000000

I have seen this error many times in full framework as well, but this time I had the full private symbols, so I decided to analyze it.
I attached another season of WinDbg to the current one and set a break on c0000005. (sxe c0000005)
Reproducing the issue, it stopped as follows.

(2c88.2bf0): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
sos!IsInterrupt+0x14:
00007ffec8b2a504 488b00 mov rax,qword ptr [rax] ds:0000000000000000=????????????????`

The stack trace for this is as follows:

0:004> k
Child-SP RetAddr Call Site
0000000a9c2fd2e0 00007ffec8bd147e sos!IsInterrupt+0x14 [d:\git\coreclr\src\toolbox\sos\strike\exts.h @ 170]
0000000a9c2fd320 00007ffec8bbf462 sos!GetCMDOption+0x6e [d:\git\coreclr\src\toolbox\sos\strike\util.cpp @ 3948]
0000000a9c2fd400 00007ffed0a0353b sos!Threads+0x262 [d:\git\coreclr\src\toolbox\sos\strike\strike.cpp @ 5983]
0000000a9c2fd510 00007ffed0a03718 dbgeng!ExtensionInfo::CallA+0x233
0000000a9c2fd5d0 00007ffed0a037f8 dbgeng!ExtensionInfo::Call+0x16c
0000000a9c2fd7d0 00007ffed0a02689 dbgeng!ExtensionInfo::CallAny+0x78
0000000a9c2fd810 00007ffed0a3a89b dbgeng!ParseBangCmd+0x4a9
0000000a9c2fdcd0 00007ffed0a3b6ab dbgeng!ProcessCommands+0xa8f
0000000a9c2fdda0 00007ffed0995fe8 dbgeng!ProcessCommandsAndCatch+0x8f
0000000a9c2fde10 00007ffed099628f dbgeng!Execute+0x24c
0000000a9c2fe2e0 00007ff7329c5c72 dbgeng!DebugClient::ExecuteWide+0x83
0000000a9c2fe340 00007ff7329c60d5 windbg!ProcessCommand+0x2b2
0000000a9c2fe760 00007ff7329c7c17 windbg!ProcessEngineCommands+0x185
0000000a9c2ff7a0 00007ffef8408102 windbg!EngineLoop+0x3e3
0000000a9c2ff7e0 00007ffefc23c574 KERNEL32!BaseThreadInitThunk+0x22
0000000a9c2ff810 0000000000000000 ntdll!RtlUserThreadStart+0x34

The source code window popped up for coreclr\src\toolbox\sos\strike\exts.h:

inline BOOL IsInterrupt()
{
if (!ControlC && g_ExtControl->GetInterrupt() == S_OK)
{
ExtOut("Command cancelled at the user's request.\n");
ControlC = TRUE;
}

return ControlC;

}

Checking the variables, apparently g_ExtControl is null.
I do not know enough about SOS internals, but this only occurs once.
If you repeat the same command, it no longer occurs, so it appears to be an initialization issue.

A search online reveals that many other people have encountered this exception as well. (Though in full.)
Could this be resolved by 1) initializing g_ExtControl earlier and/or 2) adding a null check here?

@bendono
Copy link
Contributor Author

bendono commented Feb 26, 2016

CC @mikem8361 You seem to be the SOS expert. When you have time, would you be able to take a look at this? Thanks.

@mikem8361
Copy link
Member

I've tried to repro this and no luck. Looking at the stack trace you included tells me (and looking at my recent changes to see if there is any possible way they broke something) that g_ExtControl should have been properly initialized when we entered "sos!threads" (or any other command). Adding a null check in is IsInterrupt would just hide/delay the crash until the next access of g_ExtControl (happens in a lot of the other sos code).

Can I get more details about your repro, version of windbg and anything else that might help me repro this? It is currently a mystery to me.

@mikem8361
Copy link
Member

If you had time, could you debug it yourself a little? If you do, set a breakpoint at the beginning of the "Threads" command on or before the INIT_API macro in strike.cpp and a bp on IsInterrupt(). When you hit the one in Threads, step into the call to ExtQuery. You should see/step over the code that initializes g_ExtControl. Now maybe it is being trashed/cleared by some code that I'm not aware so setting a watchpoint on it and continuing may tell us what is going on. Thanks.

@bendono
Copy link
Contributor Author

bendono commented Feb 27, 2016

@mikem8361, thank you for taking a look at it. Based on your comments, I have attempted to debug this further. It appears that g_ExtControl is being cleared in ExtQuery:
SOS_ExtQueryFailGo(g_ExtControl, IDebugControl2);

The callstack is as follows:

0:004> k
Child-SP RetAddr Call Site
000000d5fc37bd10 00007ffec78c4529 sos!ExtQuery+0x18 [d:\git\coreclr\src\toolbox\sos\strike\exts.cpp @ 81]
000000d5fc37bd90 00007ffec8e3695b sos!DebugExtensionInitialize+0xf9 [d:\git\coreclr\src\toolbox\sos\strike\exts.cpp @ 225]
000000d5fc37be50 00007ffec8e37e96 dbgeng!DebugCreate+0x8355b
000000d5fc37c340 00007ffec8e8797b dbgeng!DebugCreate+0x84a96
000000d5fc37c380 00007ffec8e88758 dbgeng!DebugCreate+0xd457b
000000d5fc37c5e0 00007ffec8e33a43 dbgeng!DebugCreate+0xd5358
000000d5fc37ce30 00007ffec79482da dbgeng!DebugCreate+0x80643
000000d5fc37d390 00007ffec792f2e6 sos!LoadClrDebugDll+0x3a [d:\git\coreclr\src\toolbox\sos\strike\util.cpp @ 4194]
000000d5fc37d3e0 00007ffec8e37203 sos!Threads+0xe6 [d:\git\coreclr\src\toolbox\sos\strike\strike.cpp @ 5969]
000000d5fc37d4f0 00007ffec8e37400 dbgeng!DebugCreate+0x83e03
000000d5fc37d5b0 00007ffec8e374d4 dbgeng!DebugCreate+0x84000
000000d5fc37d7b0 00007ffec8e361a3 dbgeng!DebugCreate+0x840d4
000000d5fc37d7f0 00007ffec8e7f0d0 dbgeng!DebugCreate+0x82da3
000000d5fc37dcb0 00007ffec8e801c5 dbgeng!DebugCreate+0xcbcd0
000000d5fc37de00 00007ffec8dbc5a5 dbgeng!DebugCreate+0xccdc5
000000d5fc37de70 00007ffec8dbc8b3 dbgeng!DebugCreate+0x91a5
000000d5fc37e350 00007ff60103a0cc dbgeng!DebugCreate+0x94b3
000000d5fc37e3b0 00007ff60103a5cc windbg+0x1a0cc
000000d5fc37e7e0 00007ff60103caf9 windbg+0x1a5cc
000000d5fc37f830 00007ffef8408102 windbg+0x1caf9
000000d5fc37f8e0 00007ffefc23c574 KERNEL32!BaseThreadInitThunk+0x22
000000d5fc37f910 0000000000000000 ntdll!RtlUserThreadStart+0x34

You'll notice from the line numbers that this is at the start of Threads, well before g_ExtControl is needed for GetCMDOption --> IsInterrupt.

Digging a little further, SOS_ExtQueryFailGo appears to be just trying to get the interface, IDebugControl2 in this case. Since it is setting g_ExtControl to NULL, I would assume that it is failing. However, Status is 0 (S_OK), so the jump to Fail never occurs and it continues on to the following interfaces.

By the way, this is not due to a recent change. I have been seeing it on a weekly basis for 3-4 years. However, this is the first time that I have been able to confirm it in CoreCLR. On CoreCLR, I am on Windows 10 (upgraded from Windows 7). In full, I am usually working on Windows 7 or Server 2008 R2 or 2012 R2 environments, but the results are the same. WinDbg version is the latest from the Windows 10 SDK: 10.0.10075.9. I have also seen it in the earlier 6.3.9600 (Windows 7? 8?) versions as well.

@mikem8361
Copy link
Member

That means windbg/dbgeng isn't returning the IDebugControl2 interface and it is really broken that it isn't returning an error from QueryInterface. You could try upgrading to a newer version windbg. sos isn't doing anything wrong as far as I can tell.

@bendono
Copy link
Contributor Author

bendono commented Feb 27, 2016

Experimenting with this a little more, I noticed the following:
If I load sos with ".loadby sos coreclr", the first attempt at (for example) !sos.threads fails.
If I load sos with ".loadby sos.dll coreclr", the above error does not occur at the first attempt at (for example) !sos.threads..
I typically prefer the former (without the extension) as it is shorter, which is perhaps why I so frequently encounter this.
I'll check this hypothesis on a few other machines at work next week to see if it is not just a coincidence.

@mikem8361
Copy link
Member

The only thing I could do is to also check that the IDebugControl2 interface returned from the QI is null and return an error. This doesn’t actually fix the problem (which seems to be in Windbg which I have no control over) but your would get an error message instead of crashing.

@bendono
Copy link
Contributor Author

bendono commented Feb 28, 2016

Hi @mikem8361,

I do not think that checking the interface for NULL would be a worthwhile change. I'm more interested in a solution where it does not need to fail (even gracefully) during initialization. I am still looking into this, but first a few comments:

  • I updated WinDbg to the latest public release 10.0.10586.567. No change.
  • Threads has a client (IDebugClient*) argument. Rather than initializing IDebugControl2 from the client returned from DebugCreate, querying the client argument for IDebugControl2 succeeds on the first try for me.
  • Defrag Tools dotnet/runtime#3902 - Writing a Debugger Extension Part 3 indicates 1) that the WdbgExt API (DebugCreate etc) is deprecated and 2) not to save the interfaces returned from DebugCreate in global variables (around 10:30 mark).

Initialization is done via deprecated DebugCreate and the required interfaces are derived from it and saved globally. I wonder if this could be the result of the combination of the two points above. The fact that querying for IDebugControl2 from the client argument rather than the global client suggests that this may the right track. However, removing the global and passing the client argument around would be a significant change. Do you have any thoughts on this?

@mikem8361
Copy link
Member

Actually the Threads command and all sos commands do use the IDebugClient that is passed in from windbg to query IDebugControl2 in ExtQuery(). DebugCreate is only used in to get the WdbgExts API to evaluate expressions used by a few commands and I’m sure that DebugCreate hasn’t been deprecated and pretty sure the WdbgExt API hasn’t either.

I’m not sure what is going on. I looked at some older windbg code I have lying around (not sure what version and I currently don’t have access to the latest) for the IDebugClient QueryInterface and there is no way it could return NULL for the interface and S_OK for the return code.

@bendono
Copy link
Contributor Author

bendono commented Mar 4, 2016

I have taken a more extended look into this issue. It seems that DebugExtensionInitialize is being called twice: once when sos is loaded (.loadby sos coreclr)--which is to be expected--, and once again by the first sos commandlet. It is during this second call that g_ExtControl has a valid address but is later reset to null. The callstack for the second call is as follows:

0:003> k

Child-SP RetAddr Call Site

00 000000f40b2fc1f0 00007ffec8654529 sos!ExtQuery+0x23 [d:\git\coreclr\src\toolbox\sos\strike\exts.cpp @ 81]
01 000000f40b2fc270 00007ffec8accf64 sos!DebugExtensionInitialize+0xf9 [d:\git\coreclr\src\toolbox\sos\strike\exts.cpp @ 232]
02 000000f40b2fc330 00007ffec8ace4a2 dbgeng!ExtensionInfo::Load+0x5d0
03 000000f40b2fc820 00007ffec8b16e77 dbgeng!ExtensionInfo::CheckAdd+0x6e
04 000000f40b2fc860 00007ffec8b17ce4 dbgeng!LoadSOSAndCheckVer+0x3b
05 000000f40b2fcac0 00007ffec8ac9ee2 dbgeng!ProcessInfo::LoadClrDebugDllForExt+0xc84
06 000000f40b2fd310 00007ffec86d82da dbgeng!ExtIoctl+0xd52
07 000000f40b2fd870 00007ffec86bf2e6 sos!LoadClrDebugDll+0x3a [d:\git\coreclr\src\toolbox\sos\strike\util.cpp @ 4194]
08 000000f40b2fd8c0 00007ffec8acd82d sos!Threads+0xe6 [d:\git\coreclr\src\toolbox\sos\strike\strike.cpp @ 5969]
09 000000f40b2fd9d0 00007ffec8acda54 dbgeng!ExtensionInfo::CallA+0x2a5
0a 000000f40b2fda90 00007ffec8acdb38 dbgeng!ExtensionInfo::Call+0x16c
0b 000000f40b2fdc90 00007ffec8acc6d5 dbgeng!ExtensionInfo::CallAny+0x78
0c 000000f40b2fdcd0 00007ffec8b0eb28 dbgeng!ParseBangCmd+0x4d1
0d 000000f40b2fe190 00007ffec8b0fd68 dbgeng!ProcessCommands+0xf48
0e 000000f40b2fe2c0 00007ffec8a412e4 dbgeng!ProcessCommandsAndCatch+0xfc
0f 000000f40b2fe340 00007ffec8a41603 dbgeng!Execute+0x2d4
10 000000f40b2fe820 00007ff7869ba44a dbgeng!DebugClient::ExecuteWide+0x83
11 000000f40b2fe880 00007ff7869ba954 windbg!ProcessCommand+0x306
12 000000f40b2feca0 00007ff7869bcdfd windbg!ProcessEngineCommands+0x16c
13 000000f40b2ffcf0 00007ffef8408102 windbg!EngineLoop+0x5ed
14 000000f40b2ffd60 00007ffefc23c574 KERNEL32!BaseThreadInitThunk+0x22
15 000000f40b2ffd90 0000000000000000 ntdll!RtlUserThreadStart+0x34

It looks like the call to LoadClrDebugDll is what ultimately forces a second initialization. It is the following line that sets this off:
if (!Ioctl(IG_GET_CLR_DATA_INTERFACE, &Query, sizeof(Query)))

Is there anyway to prevent this second initialization?

There is a simplistic work around for this.
Add the following global:
bool g_isIntialized = false;
Then at the top of DebugExtensionInitialize I added the following:
if (g_isIntialized)
{
return S_OK;
}
Finally, at the end of the function, just before return S_OK, I set it true:
g_isIntialized = true;

I recompiled and tested again.
The access violation no longer occurs for me, and all future commandlets seem fine as well.

Ideally DebugExtensionInitialize should only be called once, I think.
But if that cannot be fixed, the workaround is simple enough.
Would such a PR be accepted?

@bendono
Copy link
Contributor Author

bendono commented Mar 4, 2016

And now that I know what to look for, it seems that someone else has also come to same conclusion. But they have provided much fuller analysis into why DebugExtensionInitialize is being called twice.

Link: https://social.msdn.microsoft.com/Forums/en-us/46723fca-8d7c-4d61-b447-14e3bef01df8/windbg-access-violation-exception-0xc0000005-when-running-the-clrstack-command?forum=windbg

@mikem8361
Copy link
Member

@bendono Thanks for getting to the bottom of this. I didn't think that sos would be reentered like that because of the IOCtl call.

Your solution of adding an flag to DebugExtensionInitialize sounds great. Please submit a PR or if you want I came get this change in.

@mikem8361
Copy link
Member

PR dotnet/coreclr#3513 fixes this issue.

@bendono
Copy link
Contributor Author

bendono commented Mar 4, 2016

@mikem8361 In conclusion, this looks like a dbgeng issue. Perhaps a proper fix can be done there someday, but in the mean time this workaround resolves the issue for me. I suggest adding a small comment to explain why the code is necessary, but otherwise it LGTM.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the 1.0.0-rc2 milestone Jan 30, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Jan 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants