Fix debugger launch race hitting breakpoints in startup code. #8951

mikem8361 · 2017-01-13T23:52:22Z

The attached flag was been set asynchronously relative to the DebugActiveProcess
returning. This could cause a race where the initial module load notification being
missed/not sent to the debugger.

This fix sets the attached flag before any notifications sent during launch if the runtime was
launched/attached using the startup handshake after dbgshim tells the runtime to "continue"
when the runtime startup API callback returns.

Issue: https://github.com/dotnet/coreclr/issues/8606

mikem8361 · 2017-01-13T23:54:59Z

FYI @gregg-miskelly. If you could somehow verify this fixes your race, that would be great. The change isn't that big (only two files and localized); you could apply it to your VS copy/branch/version of coreclr.

mikem8361 · 2017-01-14T01:23:49Z

@dotnet-bot test Ubuntu x64 Checked Build and Test

mikem8361 · 2017-01-14T01:24:03Z

@dotnet-bot test OSX x64 Checked Build and Test

gregg-miskelly · 2017-01-14T02:20:05Z

@mikem8361 It will take me a little while to get setup again, but I will try to do that next week.

mikem8361 · 2017-01-17T18:41:44Z

@noahfalk could you review?

noahfalk · 2017-01-17T21:59:02Z

I haven't had time to give this deep scrutiny, but it looks like a correct implementation of what we discussed and I'd say go ahead and check it in.

mikem8361 · 2017-01-18T00:53:27Z

@gregg-miskelly @noahfalk I just pushed another commit that addresses the launch race condition pointed out by Gregg for another issue they ran into. Could you both look at the new changes.

gregg-miskelly · 2017-01-18T01:13:06Z

                        _ASSERTE(coreclrExists);

Should we change this from an assert to some sort of real failure since this is what will happen presumably if we ever can't get the continue event for some reason.

Refers to: src/dlls/dbgshim/dbgshim.cpp:551 in 3eb0e03. [](commit_id = 3eb0e03, deletion_comment = False)

gregg-miskelly

Otherwise LGTM

gregg-miskelly · 2017-01-18T01:06:30Z

src/dlls/dbgshim/dbgshim.cpp

+                    {
+                        break;
+                    }
+                }


If we don't find anything, would it make more sense to wait for the startup event to get set? (ex:

CloseCLREnumeration(*ppHandleArrayOut, *ppStringArrayOut, *pdwArrayLengthOut); *ppHandleArrayOut = NULL; *ppStringArrayOut = NULL; *pdwArrayLengthOut = 0; return S_OK;

If we do decide to retry, we need to free the memory before doing so or we will leak.

gregg-miskelly · 2017-01-18T01:18:03Z

BTW: Not sure if it matters, but we might want to port the dbgshim part of the fix to some release branch to produce a new dbgshim that we can ship with VS 2017 RTW.

mikem8361 · 2017-01-18T01:27:20Z

I kind of assumed that you may want other fix too in the 1.1 release branch at some point. The breakpoint unreliability seems as bad as this launch race.

gregg-miskelly · 2017-01-18T01:34:54Z

Definitely. If you can port this to whole PR to the release branch, I will definitely not complain :) But since the ship vehicles are somewhat different (.NET Core vs. VS) figured I would call it out.

mikem8361 · 2017-01-18T01:42:30Z

If EnumerateCLRs/InternalEnumerateCLRs returns that it found no runtimes, InvokeStartupCallback will return coreclr doesn’t exist back to the StartupHelperThread which will block on the startup event.

gregg-miskelly · 2017-01-18T01:48:07Z

StartupHelperThread callings InvokeStartupCallback up to twice.

The first call is good as you said -- it will just wait for the startup event.
The second call is good as long as coreclrExists comes back true. But if we have some, yet unknown, bug where we can't get the continue event (or otherwise are unable to find a coreclr), currently dbgshim will assert, but otherwise just slightly exit without invoking the callback. See line 551.

mikem8361 · 2017-01-18T02:04:23Z

I did make the change to turn that assert into a check if coreclrExists and fail if it doesn’t exist on the second call and we will invoke the callback with an error.

noahfalk · 2017-01-18T07:28:51Z

src/dlls/dbgshim/dbgshim.cpp

+                    HANDLE h = (*ppHandleArray)[i];
+                    if (h != NULL && h != INVALID_HANDLE_VALUE)
+                    {
+                        break;


I think you intending to break out of the while loop? This is going to break out of the for loop only. This version looks like it is looping 25 times and then giving S_OK assuming it was successful on the last iteration?

I think you also wanted to retry if any handle was NULL, but if the break applied to the outer while loop this code retries only if all handles are NULL.

Thanks for catching that. I've redone the changes and should be correct now.

mikem8361 · 2017-01-18T08:17:43Z

Thanks for catching that. I push some changes fixes it.

gregg-miskelly · 2017-01-18T17:04:37Z

src/dlls/dbgshim/dbgshim.cpp

+        for (int i = 0; i < (int)arrayLength; i++)
+        {
+            HANDLE h = handleArray[i];
+            if (h == NULL)


Should this code treat NULL and INVALID_HANDLE_VALUE the same?

I wasn't really sure if it should check for INVALID_HANDLE_VALUE. It looks to me like the only way it could be INVALID_HANDLE_VALUE if the CreateEvent fails on the runtime side. For that unlikely condition (the runtime has a CONSISTENCY_CHECK for it which I think is DEBUG only), dbgshim should probably fail instead of sleep/retry.

I am not sure, but the comment says that it could. Fix the comment?

// fill in event handle -- if GetContinueStartupEvent fails, it will still return // INVALID_HANDLE_VALUE in hContinueStartupEvent, which is what we want. we don't // want to bail out of the enumeration altogether if we can't get an event from // one telesto. HANDLE hContinueStartupEvent = INVALID_HANDLE_VALUE; HRESULT hr = GetContinueStartupEvent(debuggeePID, pStringArray[idx], &hContinueStartupEvent); _ASSERTE(SUCCEEDED(hr) == (hContinueStartupEvent != INVALID_HANDLE_VALUE)); pEventArray[idx] = hContinueStartupEvent;

Actually, it looks like there are at least some code paths where failure would be INVALID_HANDLE_VALUE (unless I am missing something)

gregg-miskelly · 2017-01-18T17:11:35Z

src/dlls/dbgshim/dbgshim.cpp

+                // If EnumerateCLRs succeeded but any of the handles are null, then sleep and retry also. This fixes a race
+                // condition where dbgshim catches the coreclr module just being loaded but before g_hContinueStartupEvent 
+                // has been initialized.
+                if (NoNullHandles(*ppHandleArray, *pdwArrayLength))


Reading EnumerateCLRs, it looks like we just set the handles to NULL on Unix. So should this code be ifdefed to always return hr; on Unix?

#ifndef FEATURE_PAL // fill in event handle -- if GetContinueStartupEvent fails, it will still return // INVALID_HANDLE_VALUE in hContinueStartupEvent, which is what we want. we don't ... #else pEventArray[idx] = NULL; #endif // FEATURE_PAL

On unix, this function isn't used by the RegisterForRuntimeStartup and used in general. The only thing we can return is the array of module name strings.

mikem8361 · 2017-01-18T19:09:33Z

I was wrong (yet again). I didn’t look at GetContinueStartupEvent close enough. There are other conditions that would return INVALID_HANDLE_VALUE.

mikem8361 · 2017-01-19T18:44:55Z

It looks like I have to back out the "if any handle is null, wait/retry" logic because it will get hit when for a late attach (the runtime NULLs the continue event after the check if launched/wait logic). This will cause delays for late attach.

And it seems to be causing almost all of the debugger tests to fail. They are launching their debuggees and the logging I've added shows that everything is normal (no null handles/no retries). I'm still trying to come up with an explanation.

gregg-miskelly · 2017-01-19T22:35:14Z

src/dlls/dbgshim/dbgshim.cpp

        {
            HANDLE h = handleArray[i];
-            if (h == NULL)
+            if (h == INVALID_HANDLE_VALUE)


In the case of older CoreCLR's, this will still be NULL, so should we check both?

mikem8361 · 2017-01-19T22:42:11Z

I should have commented on this better. The main problem with the original code that checked for NULL is that it causes “attach” problems. The g_hContinueStartupEvent is closed and set to NULL after the check for named startup event/wait on continue event logic. So, late attaches (as opposed to launch or attaches before coreclr is loaded) will go through the sleep/retry logic in InternalEnumerateCLRs and then after 41secs or so fail with the timeout code. The launch race won’t be fixed if the new dbgshim is used against an older coreclr, but it will still work as before without any problems. I’m hoping to get all of these changes into our next service release.

gregg-miskelly · 2017-01-19T22:44:07Z

Ah. Makes sense. Thanks for the clarification.

mikem8361 · 2017-01-20T23:37:35Z

@noahfalk (and @gregg-miskelly optionally). This is ready for a final code review now. I decided to leave initializing g_hContinueStartupEvent = INVALID_HANDLE_VALUE (-1) because it looks like it does get loaded that way by the loader. It should be just all the changes in dbgshim.cpp and in debugger.cpp.

noahfalk · 2017-01-21T01:13:02Z

src/dlls/dbgshim/dbgshim.cpp

+                {
+                    return hr;
+                }
+                // If EnumerateCLRs succeeded but any of the handles are null, then sleep and retry also. This fixes a race


null [](start = 73, length = 4)

I think you meant to say INVALID_HANDLE_VALUE

noahfalk · 2017-01-21T01:19:58Z

        _ASSERTE(SUCCEEDED(hr) == (hContinueStartupEvent != INVALID_HANDLE_VALUE));

This assert is no longer true

Refers to: src/dlls/dbgshim/dbgshim.cpp:1166 in 383cc0c. [](commit_id = 383cc0c, deletion_comment = False)

noahfalk · 2017-01-21T01:23:29Z

The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.

mikem8361 · 2017-01-21T01:31:14Z

@dotnet-bot test Windows_NT x64 Debug Build and Test

…#8951) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.

…#9062) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.

…#9060) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race.

…/coreclr#8951) The attached flag was been set asynchronously relative to the DebugActiveProcess returning. This could cause a race where the initial module load notification being missed/not sent to the debugger. This fix sets the attached flag before any notifications sent during launch if the runtime was launched/attached using the startup handshake after dbgshim tells the runtime to "continue" when the runtime startup API callback returns. Also fixes another race condition in dbgshim where EnumerateCLRs returns a NULL continue event handle because the coreclr module was loaded but the g_hContinueStartupEvent wasn't initialized on the runtime side yet. Changed the static initialization of g_hContinueStartupEvent to INVALID_HANDLE_VALUE and the InternalEnumerateCLRs sleep/retry loop to retry when any of the handles are INVALID_HANDLE_VALUE. This fixes the race only when you have the latest dbgshim and coreclr binaries and the old/new mixes still function but don't fix the race. Commit migrated from dotnet/coreclr@3d76888

mikem8361 added the area-Diagnostics label Jan 13, 2017

mikem8361 added this to the 2.0.0 milestone Jan 13, 2017

mikem8361 self-assigned this Jan 13, 2017

dnfclas added the cla-already-signed label Jan 13, 2017

mikem8361 requested a review from noahfalk January 13, 2017 23:52

gregg-miskelly approved these changes Jan 18, 2017

View reviewed changes

noahfalk reviewed Jan 18, 2017

View reviewed changes

mikem8361 force-pushed the fixlaunchrace branch from 531562a to dcc26c0 Compare January 18, 2017 08:13

gregg-miskelly reviewed Jan 18, 2017

View reviewed changes

gregg-miskelly reviewed Jan 19, 2017

View reviewed changes

mikem8361 force-pushed the fixlaunchrace branch from 400cac0 to 383cc0c Compare January 20, 2017 23:34

gregg-miskelly approved these changes Jan 20, 2017

View reviewed changes

noahfalk reviewed Jan 21, 2017

View reviewed changes

mikem8361 force-pushed the fixlaunchrace branch from 383cc0c to e2cd068 Compare January 21, 2017 01:45

mikem8361 merged commit 3d76888 into dotnet:master Jan 21, 2017

mikem8361 deleted the fixlaunchrace branch January 21, 2017 01:46

Fix debugger launch race hitting breakpoints in startup code. #8951

Fix debugger launch race hitting breakpoints in startup code. #8951

Uh oh!

Conversation

mikem8361 commented Jan 13, 2017

Uh oh!

mikem8361 commented Jan 13, 2017

Uh oh!

mikem8361 commented Jan 14, 2017

Uh oh!

mikem8361 commented Jan 14, 2017

Uh oh!

gregg-miskelly commented Jan 14, 2017

Uh oh!

mikem8361 commented Jan 17, 2017

Uh oh!

noahfalk commented Jan 17, 2017

Uh oh!

mikem8361 commented Jan 18, 2017

Uh oh!

gregg-miskelly commented Jan 18, 2017

Uh oh!

gregg-miskelly left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gregg-miskelly commented Jan 18, 2017

Uh oh!

mikem8361 commented Jan 18, 2017 via email

Uh oh!

gregg-miskelly commented Jan 18, 2017

Uh oh!

mikem8361 commented Jan 18, 2017 via email

Uh oh!

gregg-miskelly commented Jan 18, 2017

Uh oh!

mikem8361 commented Jan 18, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikem8361 commented Jan 18, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikem8361 commented Jan 18, 2017 via email

Uh oh!

mikem8361 commented Jan 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikem8361 commented Jan 19, 2017 via email

Uh oh!

gregg-miskelly commented Jan 19, 2017

Uh oh!

mikem8361 commented Jan 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noahfalk commented Jan 21, 2017

Uh oh!

noahfalk commented Jan 21, 2017