-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Fix debugger launch race hitting breakpoints in startup code. #8951
Conversation
FYI @gregg-miskelly. If you could somehow verify this fixes your race, that would be great. The change isn't that big (only two files and localized); you could apply it to your VS copy/branch/version of coreclr. |
@dotnet-bot test Ubuntu x64 Checked Build and Test |
@dotnet-bot test OSX x64 Checked Build and Test |
@mikem8361 It will take me a little while to get setup again, but I will try to do that next week. |
@noahfalk could you review? |
I haven't had time to give this deep scrutiny, but it looks like a correct implementation of what we discussed and I'd say go ahead and check it in. |
@gregg-miskelly @noahfalk I just pushed another commit that addresses the launch race condition pointed out by Gregg for another issue they ran into. Could you both look at the new changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
{ | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't find anything, would it make more sense to wait for the startup event to get set? (ex:
CloseCLREnumeration(*ppHandleArrayOut, *ppStringArrayOut, *pdwArrayLengthOut);
*ppHandleArrayOut = NULL;
*ppStringArrayOut = NULL;
*pdwArrayLengthOut = 0;
return S_OK;
If we do decide to retry, we need to free the memory before doing so or we will leak.
BTW: Not sure if it matters, but we might want to port the dbgshim part of the fix to some release branch to produce a new dbgshim that we can ship with VS 2017 RTW. |
I kind of assumed that you may want other fix too in the 1.1 release branch at some point. The breakpoint unreliability seems as bad as this launch race.
|
Definitely. If you can port this to whole PR to the release branch, I will definitely not complain :) But since the ship vehicles are somewhat different (.NET Core vs. VS) figured I would call it out. |
If EnumerateCLRs/InternalEnumerateCLRs returns that it found no runtimes, InvokeStartupCallback will return coreclr doesn’t exist back to the StartupHelperThread which will block on the startup event.
|
StartupHelperThread callings InvokeStartupCallback up to twice. The first call is good as you said -- it will just wait for the startup event. |
I did make the change to turn that assert into a check if coreclrExists and fail if it doesn’t exist on the second call and we will invoke the callback with an error.
|
src/dlls/dbgshim/dbgshim.cpp
Outdated
HANDLE h = (*ppHandleArray)[i]; | ||
if (h != NULL && h != INVALID_HANDLE_VALUE) | ||
{ | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you intending to break out of the while loop? This is going to break out of the for loop only. This version looks like it is looping 25 times and then giving S_OK assuming it was successful on the last iteration?
I think you also wanted to retry if any handle was NULL, but if the break applied to the outer while loop this code retries only if all handles are NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that. I've redone the changes and should be correct now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that. I've redone the changes and should be correct now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that. I've redone the changes and should be correct now.
531562a
to
dcc26c0
Compare
Thanks for catching that. I push some changes fixes it.
|
src/dlls/dbgshim/dbgshim.cpp
Outdated
for (int i = 0; i < (int)arrayLength; i++) | ||
{ | ||
HANDLE h = handleArray[i]; | ||
if (h == NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this code treat NULL and INVALID_HANDLE_VALUE the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't really sure if it should check for INVALID_HANDLE_VALUE. It looks to me like the only way it could be INVALID_HANDLE_VALUE if the CreateEvent fails on the runtime side. For that unlikely condition (the runtime has a CONSISTENCY_CHECK for it which I think is DEBUG only), dbgshim should probably fail instead of sleep/retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but the comment says that it could. Fix the comment?
// fill in event handle -- if GetContinueStartupEvent fails, it will still return
// INVALID_HANDLE_VALUE in hContinueStartupEvent, which is what we want. we don't
// want to bail out of the enumeration altogether if we can't get an event from
// one telesto.
HANDLE hContinueStartupEvent = INVALID_HANDLE_VALUE;
HRESULT hr = GetContinueStartupEvent(debuggeePID, pStringArray[idx], &hContinueStartupEvent);
_ASSERTE(SUCCEEDED(hr) == (hContinueStartupEvent != INVALID_HANDLE_VALUE));
pEventArray[idx] = hContinueStartupEvent;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it looks like there are at least some code paths where failure would be INVALID_HANDLE_VALUE (unless I am missing something)
src/dlls/dbgshim/dbgshim.cpp
Outdated
// If EnumerateCLRs succeeded but any of the handles are null, then sleep and retry also. This fixes a race | ||
// condition where dbgshim catches the coreclr module just being loaded but before g_hContinueStartupEvent | ||
// has been initialized. | ||
if (NoNullHandles(*ppHandleArray, *pdwArrayLength)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading EnumerateCLRs, it looks like we just set the handles to NULL on Unix. So should this code be ifdefed to always return hr;
on Unix?
#ifndef FEATURE_PAL
// fill in event handle -- if GetContinueStartupEvent fails, it will still return
// INVALID_HANDLE_VALUE in hContinueStartupEvent, which is what we want. we don't
...
#else
pEventArray[idx] = NULL;
#endif // FEATURE_PAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On unix, this function isn't used by the RegisterForRuntimeStartup and used in general. The only thing we can return is the array of module name strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay.
I was wrong (yet again). I didn’t look at GetContinueStartupEvent close enough. There are other conditions that would return INVALID_HANDLE_VALUE.
|
It looks like I have to back out the "if any handle is null, wait/retry" logic because it will get hit when for a late attach (the runtime NULLs the continue event after the check if launched/wait logic). This will cause delays for late attach. And it seems to be causing almost all of the debugger tests to fail. They are launching their debuggees and the logging I've added shows that everything is normal (no null handles/no retries). I'm still trying to come up with an explanation. |
{ | ||
HANDLE h = handleArray[i]; | ||
if (h == NULL) | ||
if (h == INVALID_HANDLE_VALUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of older CoreCLR's, this will still be NULL, so should we check both?
I should have commented on this better. The main problem with the original code that checked for NULL is that it causes “attach” problems. The g_hContinueStartupEvent is closed and set to NULL after the check for named startup event/wait on continue event logic. So, late attaches (as opposed to launch or attaches before coreclr is loaded) will go through the sleep/retry logic in InternalEnumerateCLRs and then after 41secs or so fail with the timeout code.
The launch race won’t be fixed if the new dbgshim is used against an older coreclr, but it will still work as before without any problems. I’m hoping to get all of these changes into our next service release.
|
Ah. Makes sense. Thanks for the clarification. |
400cac0
to
383cc0c
Compare
@noahfalk (and @gregg-miskelly optionally). This is ready for a final code review now. I decided to leave initializing g_hContinueStartupEvent = INVALID_HANDLE_VALUE (-1) because it looks like it does get loaded that way by the loader. It should be just all the changes in dbgshim.cpp and in debugger.cpp. |
src/dlls/dbgshim/dbgshim.cpp
Outdated
{ | ||
return hr; | ||
} | ||
// If EnumerateCLRs succeeded but any of the handles are null, then sleep and retry also. This fixes a race |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null [](start = 73, length = 4)
I think you meant to say INVALID_HANDLE_VALUE
|
The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.
@dotnet-bot test Windows_NT x64 Debug Build and Test |
383cc0c
to
e2cd068
Compare
…#8951) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.
…#8951) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.
…#8951) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.
…#8951) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.
…#8951) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.
…#9062) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.
…#9060) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.
…/coreclr#8951) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race. Commit migrated from dotnet/coreclr@3d76888
The attached flag was been set asynchronously relative to the DebugActiveProcess
returning. This could cause a race where the initial module load notification being
missed/not sent to the debugger.
This fix sets the attached flag before any notifications sent during launch if the runtime was
launched/attached using the startup handshake after dbgshim tells the runtime to "continue"
when the runtime startup API callback returns.
Issue: https://github.com/dotnet/coreclr/issues/8606