-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix debugger funceval deadlock #72179
Conversation
Fixes dotnet#60565 Visual Studio devs reported that debugger funcevals were deadlocking because the PinnedHeapHandleTableCrst was held while threads were suspended. This change refactors that code path so that the AllocateHandles() operation where the lock is held gets split into potentially multiple iterations and the GC allocation where the debugger could suspend is extracted outside the region where the critical section is held. There is another path into AllocateHandles that takes a different lock and theoretically it could generate a similar deadlock, but I didn't do all the refactoring necessary to prevent that path.
Tagging subscribers to this area: @tommcdon Issue DetailsFixes #60565 Visual Studio devs reported that debugger funcevals were deadlocking because the I'm still testing to confirm this works as intended, but please take a look
|
Sorry, I think you miss typed my user XD Maybe @davidwrighton |
The tests are hitting intermittent crashes and asserts, e.g.:
I think that the logic in this PR is close to impossible to follow. Can this be refactored so that it is easier to follow? For example, can we have a method that looks like this?
|
and still resolve the handle from the cache on the 2nd attempt. Refactored using a NewHolder and initializing the outsideLockWork at the top of the function reduce the logic complexity and make it more robust to overlooked return points.
@jkotas - yeah I saw the test failure log earlier and recognized the case I missed after seeing my assert failure. I also refactored the logic to hopefully make it simpler to the reader. Let me know if you if you think I was successful or if you have other suggestions.
Unfortunately two passes isn't guaranteed to be enough, we need a retry loop. It is possible that thread A does the first step, then thread B does both steps, then when thread A returns attempting to complete step 2 it discovers the internal state has changed and the preallocated bucket is no longer the correct size. Another alternative if folks prefer is that I can probably write a lock-free version. The downside is that it adds more complexity in terms of potential thread interleavings and memory ordering issues. |
The array that was allocated outside the lock should be still sufficient to full-fill the request on the current thread. I do not think we need any retry loops. |
I can certainly thread them on to the linked list, but what are you thinking should happen afterwards? As a concrete example two threads could concurrently each ask for 1 handle at a point where the NextBucketSize is 4000 and we'd thread both of those large arrays into the list. Currently there is a special case for allocations of size 1 where it searches beyond the head node, but for any other request size 3999 of the newly added array slots would be unusable. Were you thinking that we'd couple this with changes to searching algorithm that try harder to reuse previously allocated nodes or we'd just count on re-entrant operations to be rare and accept the loss of efficiency if it happened? |
Is the core of your concern the performance/correctness/efficiency of the algorithm with the loop, or just that it is hard as a reader to understand the logic? |
Yes, this change has a risk of pathological performance issues. This risk is both designs (abandon the redundant allocation, add the redundant allocation to the list). The problem is that these allocations go to pinned heap. The pinned heap is used rarely, and it cannot be compacted. It means that once you create a fragment on this heap, there is a good chance that it won't be re-used for anything. We run into similar problems when we explored use of the pinned heap for string literals. The core of my concern is that this is turning something simple (allocate a piece of managed memory under a lock) into something pretty complex, hard to reason about and potential source of pathological performance issues. Allocating managed memory under a lock is done all the time in C# and I am sure there are number of places where it is done in VM. The solution implemented here won't scale as a fix. |
@jkotas - I haven't tested this updated version beyond that it builds so far, but feel free to take a look to see if it aligns with what we discussed. I'll ping again and convert back to a real PR once I've had a chance to better validate that it works as intended. |
Yes, this looks good. Thanks! |
Failure is known issue: #70969 |
OK this is looking good again. In the old code the PinnedHeapHandleTable was synchronized by one of two different locks, either the AppDomainHandleTable lock or the GlobalStringLiteralMap. In the new code AppDomainHandleTable lock is renamed to PinnedHeapHandleTable lock and this lock is always what synchronizes the PinnedHeapHandleTable code. In the string literal code path the GlobalStringLiteralMap is taken first and the PinnedHeapHandleTable lock is taken second, but the PinnedHeapHandleTable is no longer reliant on that outer GlobalStringLiteralMap lock for correctness. In terms of testing I can verify under a debugger that I can suspend in the GC allocation point and the PinnedHeapHandleTable lock isn't held. This doesn't of course prevent other locks from being held so at best it is a partial fix for the issue. Nobody had a known repro so I wasn't able to verify anything more specifically. I also confirmed the race cases on the InterlockedExchange paths worked how they were intended by forcing them with a native debugger. |
I suspect that this locks is also exposed to the debugger deadlock. |
Yeah I expect so as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
Merging, failure is known issue: #70969 |
Fixes #60565
Visual Studio devs reported that debugger funcevals were deadlocking because the
PinnedHeapHandleTableCrst was held while threads were suspended. This change
refactors that code path so that the AllocateHandles() operation where the lock is held
gets split into potentially multiple iterations and the GC allocation where the debugger
could suspend is extracted outside the region where the critical section is held. There is
another path into AllocateHandles that takes a different lock and theoretically it could
generate a similar deadlock, but I didn't do all the refactoring necessary to prevent
that path.
I'm still testing to confirm this works as intended, but please take a look
@davidwr @hoyosjs @mikem8361 @cshung