New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOS exception -- g_ExtControl is null #5193
Comments
CC @mikem8361 You seem to be the SOS expert. When you have time, would you be able to take a look at this? Thanks. |
I've tried to repro this and no luck. Looking at the stack trace you included tells me (and looking at my recent changes to see if there is any possible way they broke something) that g_ExtControl should have been properly initialized when we entered "sos!threads" (or any other command). Adding a null check in is IsInterrupt would just hide/delay the crash until the next access of g_ExtControl (happens in a lot of the other sos code). Can I get more details about your repro, version of windbg and anything else that might help me repro this? It is currently a mystery to me. |
If you had time, could you debug it yourself a little? If you do, set a breakpoint at the beginning of the "Threads" command on or before the INIT_API macro in strike.cpp and a bp on IsInterrupt(). When you hit the one in Threads, step into the call to ExtQuery. You should see/step over the code that initializes g_ExtControl. Now maybe it is being trashed/cleared by some code that I'm not aware so setting a watchpoint on it and continuing may tell us what is going on. Thanks. |
@mikem8361, thank you for taking a look at it. Based on your comments, I have attempted to debug this further. It appears that g_ExtControl is being cleared in ExtQuery: The callstack is as follows: 0:004> k You'll notice from the line numbers that this is at the start of Threads, well before g_ExtControl is needed for GetCMDOption --> IsInterrupt. Digging a little further, SOS_ExtQueryFailGo appears to be just trying to get the interface, IDebugControl2 in this case. Since it is setting g_ExtControl to NULL, I would assume that it is failing. However, Status is 0 (S_OK), so the jump to Fail never occurs and it continues on to the following interfaces. By the way, this is not due to a recent change. I have been seeing it on a weekly basis for 3-4 years. However, this is the first time that I have been able to confirm it in CoreCLR. On CoreCLR, I am on Windows 10 (upgraded from Windows 7). In full, I am usually working on Windows 7 or Server 2008 R2 or 2012 R2 environments, but the results are the same. WinDbg version is the latest from the Windows 10 SDK: 10.0.10075.9. I have also seen it in the earlier 6.3.9600 (Windows 7? 8?) versions as well. |
That means windbg/dbgeng isn't returning the IDebugControl2 interface and it is really broken that it isn't returning an error from QueryInterface. You could try upgrading to a newer version windbg. sos isn't doing anything wrong as far as I can tell. |
Experimenting with this a little more, I noticed the following: |
The only thing I could do is to also check that the IDebugControl2 interface returned from the QI is null and return an error. This doesn’t actually fix the problem (which seems to be in Windbg which I have no control over) but your would get an error message instead of crashing. |
Hi @mikem8361, I do not think that checking the interface for NULL would be a worthwhile change. I'm more interested in a solution where it does not need to fail (even gracefully) during initialization. I am still looking into this, but first a few comments:
Initialization is done via deprecated DebugCreate and the required interfaces are derived from it and saved globally. I wonder if this could be the result of the combination of the two points above. The fact that querying for IDebugControl2 from the client argument rather than the global client suggests that this may the right track. However, removing the global and passing the client argument around would be a significant change. Do you have any thoughts on this? |
Actually the Threads command and all sos commands do use the IDebugClient that is passed in from windbg to query IDebugControl2 in ExtQuery(). DebugCreate is only used in to get the WdbgExts API to evaluate expressions used by a few commands and I’m sure that DebugCreate hasn’t been deprecated and pretty sure the WdbgExt API hasn’t either. I’m not sure what is going on. I looked at some older windbg code I have lying around (not sure what version and I currently don’t have access to the latest) for the IDebugClient QueryInterface and there is no way it could return NULL for the interface and S_OK for the return code. |
I have taken a more extended look into this issue. It seems that DebugExtensionInitialize is being called twice: once when sos is loaded (.loadby sos coreclr)--which is to be expected--, and once again by the first sos commandlet. It is during this second call that g_ExtControl has a valid address but is later reset to null. The callstack for the second call is as follows: 0:003> k Child-SP RetAddr Call Site00 000000f4 It looks like the call to LoadClrDebugDll is what ultimately forces a second initialization. It is the following line that sets this off: Is there anyway to prevent this second initialization? There is a simplistic work around for this. I recompiled and tested again. Ideally DebugExtensionInitialize should only be called once, I think. |
And now that I know what to look for, it seems that someone else has also come to same conclusion. But they have provided much fuller analysis into why DebugExtensionInitialize is being called twice. |
@bendono Thanks for getting to the bottom of this. I didn't think that sos would be reentered like that because of the IOCtl call. Your solution of adding an flag to DebugExtensionInitialize sounds great. Please submit a PR or if you want I came get this change in. |
PR dotnet/coreclr#3513 fixes this issue. |
@mikem8361 In conclusion, this looks like a dbgeng issue. Perhaps a proper fix can be done there someday, but in the mean time this workaround resolves the issue for me. I suggest adding a small comment to explain why the code is necessary, but otherwise it LGTM. |
I synced and rebuilt both coreclr and corefx today. While testing a scenario, my app crashed. I opened up the dump for analysis and loaded SOS. I noticed that no matter what the first SOS command is, this results in the following error:
I have seen this error many times in full framework as well, but this time I had the full private symbols, so I decided to analyze it.
I attached another season of WinDbg to the current one and set a break on c0000005. (sxe c0000005)
Reproducing the issue, it stopped as follows.
The stack trace for this is as follows:
The source code window popped up for coreclr\src\toolbox\sos\strike\exts.h:
Checking the variables, apparently g_ExtControl is null.
I do not know enough about SOS internals, but this only occurs once.
If you repeat the same command, it no longer occurs, so it appears to be an initialization issue.
A search online reveals that many other people have encountered this exception as well. (Though in full.)
Could this be resolved by 1) initializing g_ExtControl earlier and/or 2) adding a null check here?
The text was updated successfully, but these errors were encountered: