Conversation
- The AV is fixed by stopping the iteration when seg1 becomes null - don't count segments that have been decommited - stop iterating when seg1 becomes freeable_soh_segment - this takes care of the case where we have put a segment on the list, but the OS call to decommit hasn't come back. These fixes will still leave a window where the result will be inaccurate. However, the window will be fairly small, and hopefully tolerable.
|
Tagging subscribers to this area: @dotnet/gc |
|
The test has been running for 3 hours without a crash - so reliability is much better. |
ChrisAhna
left a comment
There was a problem hiding this comment.
Thanks Peter. Instead of ending the scan and proceeding, what do you think about doing "YieldThread(); goto RestartScanFromScratch;" whenever inconsistent list structure is detected? Inconsistency should only exist for a few instructions in generation_delete_heap_segment, so the scan should always succeed after a very small number of retries.
I'm not aware of other cases where the reported value will be "way off", so I think retrying would add some value (at the cost of some complexity) by preserving the guarantee that the reported value always comes from a successful scan of the full segment list (instead of a surprising partial scan).
|
Thanks @ChrisAhna - actually, I like this idea! We may think about retrying this only a limited number of times though, just to make sure we don't retry indefinitely. Say retry 3 times, if we succeed, great, otherwise report the possibly inaccurate result. |
|
I'm not sure what the best policy is if indefinite retry failures occur. Truly indefinite failures can only happen if the SOH segment list really is corrupt. This is a fatal error that should never occur in practice (i.e., means CLR behavior will be undefined, which means CLR code is "allowed" to hang or failfast). I lean toward not returning to the caller in these cases. Failfast would be best for diagnosability, but I lean toward letting it hang (since that gets rid of any complexity and ambiguity around trying distinguish "the list is corrupt" from "we're just in an unbelievably unlucky retry sequence"). |
|
there's another angle that allows us to not have to deal with the concurrency problem at all - we know the following -
this way we don't have to walk the gen2 segment list at all. |
|
to report the size it's actually much simpler to just not do all this walking segment list stuff - at the end of a blocking GC we already know the dd_current_size for all the condemned generations. we do want to handle gen0/LOH specially if we want to be accurate (BTW, it's not accurate for gen0 because it uses heap_segment_allocated instead of alloc_allocated which is where the actual end of ephemeral segment is) but we can keep track of how much we allocated and the free_list/free_obj is always updated as we go. |
…akes sense to retry and possibly get a better result.
|
@ChrisAhna : regarding retry, I am more comfortable reporting an inaccurate result rather than hanging, especially as I'm assuming we want to put this fix into .NET 5. @Maoni0 : What you are suggesting sounds better than what we are doing now, but it means also more changes and more potential bugs. How if we use the small changes I suggested for .NET 5 and implement the better solution for 6? |
|
@PeterSolMS I understand your concern about managing risks for checking into 5.0. what do you think of this change in my branch that only changes when we are in BGC sweep phase? you could make this a bit more accurate by counting gen2's |
|
@Maoni0 I don't see a problem with your solution, but I'm still more comfortable keeping changes to a minimum for .NET 5 - the runway is pretty short at this point. |
| // Get small block heap size info | ||
| totsize = (pGenGCHeap->alloc_allocated - heap_segment_mem (eph_seg)); | ||
| heap_segment* seg1 = generation_start_segment (pGenGCHeap->generation_of (max_generation)); | ||
| while (seg1 != eph_seg && seg1 != nullptr && seg1 != pGenGCHeap->freeable_soh_segment) |
There was a problem hiding this comment.
nit -
while ((seg1 != eph_seg) && (seg1 != nullptr) && (seg1 != pGenGCHeap->freeable_soh_segment))
Maoni0
left a comment
There was a problem hiding this comment.
other than the one nit, it LGTM
|
@PeterSolMS that's fine! I've approved. |
|
/backport to release/5.0 |
|
Started backporting to release/5.0: https://github.com/dotnet/runtime/actions/runs/234192066 |
Initial attempt to fix the AV issue in GCHeap::ApproxTotalBytesInUse:
These fixes will still leave a window where the result will be inaccurate. However, the window will be fairly small, and hopefully tolerable.
After rebuilding coreclr.dll with the fix, I haven't observed a crash yet - before, it was a matter of minutes.