Skip to content

Fix issue 41413#41445

Merged
PeterSolMS merged 6 commits intodotnet:masterfrom
PeterSolMS:Fix_issue_41413
Sep 1, 2020
Merged

Fix issue 41413#41445
PeterSolMS merged 6 commits intodotnet:masterfrom
PeterSolMS:Fix_issue_41413

Conversation

@PeterSolMS
Copy link
Copy Markdown
Contributor

Initial attempt to fix the AV issue in GCHeap::ApproxTotalBytesInUse:

  • The AV is fixed by stopping the iteration when seg1 becomes null
  • don't count segments that have been decommited
  • stop iterating when seg1 becomes freeable_soh_segment - this takes care of the case where we have put a segment on the list, but the OS call to decommit hasn't come back.

These fixes will still leave a window where the result will be inaccurate. However, the window will be fairly small, and hopefully tolerable.

After rebuilding coreclr.dll with the fix, I haven't observed a crash yet - before, it was a matter of minutes.

- The AV is fixed by stopping the iteration when seg1 becomes null
- don't count segments that have been decommited
- stop iterating when seg1 becomes freeable_soh_segment - this takes care of the case where we have put a segment on the list, but the OS call to decommit hasn't come back.

These fixes will still leave a window where the result will be inaccurate. However, the window will be fairly small, and hopefully tolerable.
@ghost
Copy link
Copy Markdown

ghost commented Aug 27, 2020

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@PeterSolMS
Copy link
Copy Markdown
Contributor Author

The test has been running for 3 hours without a crash - so reliability is much better.

Copy link
Copy Markdown
Member

@ChrisAhna ChrisAhna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Peter. Instead of ending the scan and proceeding, what do you think about doing "YieldThread(); goto RestartScanFromScratch;" whenever inconsistent list structure is detected? Inconsistency should only exist for a few instructions in generation_delete_heap_segment, so the scan should always succeed after a very small number of retries.

I'm not aware of other cases where the reported value will be "way off", so I think retrying would add some value (at the cost of some complexity) by preserving the guarantee that the reported value always comes from a successful scan of the full segment list (instead of a surprising partial scan).

@PeterSolMS
Copy link
Copy Markdown
Contributor Author

Thanks @ChrisAhna - actually, I like this idea! We may think about retrying this only a limited number of times though, just to make sure we don't retry indefinitely. Say retry 3 times, if we succeed, great, otherwise report the possibly inaccurate result.

@ChrisAhna
Copy link
Copy Markdown
Member

I'm not sure what the best policy is if indefinite retry failures occur. Truly indefinite failures can only happen if the SOH segment list really is corrupt. This is a fatal error that should never occur in practice (i.e., means CLR behavior will be undefined, which means CLR code is "allowed" to hang or failfast). I lean toward not returning to the caller in these cases. Failfast would be best for diagnosability, but I lean toward letting it hang (since that gets rid of any complexity and ambiguity around trying distinguish "the list is corrupt" from "we're just in an unbelievably unlucky retry sequence").

@Maoni0
Copy link
Copy Markdown
Member

Maoni0 commented Aug 27, 2020

there's another angle that allows us to not have to deal with the concurrency problem at all - we know the following -

  • BGC sweep only handles the gen2 portion of the heap at the end of BGC mark so we know exactly the size of gen2 at that time 'cause EE is stopped.

  • BGC sweep will update free_list_space/free_obj_space as it goes; it just doesn't deduct the size of the deleted segs but that's easy to do (just keep a size that indicates the accumulated size of (heap_segment_allocated (deleted_seg) - heap_segment_mem (deleted_seg)) - I'll call this accumulated_deleted. then for gen2 during BGC sweep you can do (size_of_gen2_at_end_of_BGC_mark - gen2_free_list_space - gen2_free_obj_space - accumulated_deleted)

  • for ephemeral generations we can use the dd_current_size (dynamic_data_of (gen)) which is the amount of space objects take up, ie, total_gen_size - fragmentation. we can add this to the gen2 info and report that. the size info at this point does not have to be accurate (it's not accurate right now) - this is missing the portion of gen2 growth during BGC sweep but that's acceptable.

this way we don't have to walk the gen2 segment list at all.

@Maoni0
Copy link
Copy Markdown
Member

Maoni0 commented Aug 27, 2020

to report the size it's actually much simpler to just not do all this walking segment list stuff - at the end of a blocking GC we already know the dd_current_size for all the condemned generations. we do want to handle gen0/LOH specially if we want to be accurate (BTW, it's not accurate for gen0 because it uses heap_segment_allocated instead of alloc_allocated which is where the actual end of ephemeral segment is) but we can keep track of how much we allocated and the free_list/free_obj is always updated as we go.

…akes sense to retry and possibly get a better result.
@PeterSolMS
Copy link
Copy Markdown
Contributor Author

@ChrisAhna : regarding retry, I am more comfortable reporting an inaccurate result rather than hanging, especially as I'm assuming we want to put this fix into .NET 5.

@Maoni0 : What you are suggesting sounds better than what we are doing now, but it means also more changes and more potential bugs. How if we use the small changes I suggested for .NET 5 and implement the better solution for 6?

@Maoni0
Copy link
Copy Markdown
Member

Maoni0 commented Aug 31, 2020

@PeterSolMS I understand your concern about managing risks for checking into 5.0. what do you think of this change in my branch that only changes when we are in BGC sweep phase? you could make this a bit more accurate by counting gen2's generation_condemned_allocated and generation_end_seg_allocated and etc but I don't think that's worthwhile to do. if you are not comfortable with this I'm completely fine with your proposal.

@PeterSolMS
Copy link
Copy Markdown
Contributor Author

@Maoni0 I don't see a problem with your solution, but I'm still more comfortable keeping changes to a minimum for .NET 5 - the runway is pretty short at this point.

Comment thread src/coreclr/src/gc/gc.cpp Outdated
// Get small block heap size info
totsize = (pGenGCHeap->alloc_allocated - heap_segment_mem (eph_seg));
heap_segment* seg1 = generation_start_segment (pGenGCHeap->generation_of (max_generation));
while (seg1 != eph_seg && seg1 != nullptr && seg1 != pGenGCHeap->freeable_soh_segment)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit -

while ((seg1 != eph_seg) && (seg1 != nullptr) && (seg1 != pGenGCHeap->freeable_soh_segment))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

Copy link
Copy Markdown
Member

@Maoni0 Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than the one nit, it LGTM

@Maoni0
Copy link
Copy Markdown
Member

Maoni0 commented Aug 31, 2020

@PeterSolMS that's fine! I've approved.

@PeterSolMS PeterSolMS merged commit b240be8 into dotnet:master Sep 1, 2020
@PeterSolMS
Copy link
Copy Markdown
Contributor Author

/backport to release/5.0

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Sep 1, 2020

Started backporting to release/5.0: https://github.com/dotnet/runtime/actions/runs/234192066

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants