fixed problems with how sampling is done and how we suspend to change heap count in DATAS #91712

Maoni0 · 2023-09-07T06:43:14Z

Moved the sample recording into when we are suspended. The way we were calculating the throughput cost was in check_heap_count (which is called right after we restart EE on heap0), we record the msl_wait_time (and reset it to 0 for soh/uoh). This is not synchronized with the allocating threads (which are already running at this point). So what can happen is the allocating threads are already accumulated more wait time which is attributed to this GC but it's not within the period we are counting for this GC (and we lose this part for the next GC). For BGC this is incorrect. If an ephemeral GC did happen before the BGC starts, we'd be adding a sample for that GC which is basically correct for that eph GC. But if an eph GC did not happen, we are just adding a random sample which is calculating the tcp as (msl wait + whatever GC that was finished before this BGC) so obviously incorrect.
Added gen2 sampling - this was adapted from Peter's gen2 sampling changes. This serves as a backstop in case the existing sampling doesn't ever pick gen2 GC costs. I made the following fixes -

changed the way we calculated the median
moved where this is calculated to again avoid timing issues
made the gen2 samples actually count instead of losing that info if we happen to sample when a gen2 didn't just occur.

Changed when check_heap_count is called - the previous place is right after a suspension which does not help with spacing the suspension time out (it was "suspend for GC" then "immediately suspend to change heap count"). And it caused a problem with BGC which was it always tried to change heap count when it couldn't because BGC was in progress. I changed this to be on a timeout to intentionally space the suspensions out. Now most of the time, heap count changes happen due to this time out. If we are really in a situation where GCs are happening too quickly and we return from waiting on the ee_suspend_event due to a GC started, we change the heap count right before we do a GC. So this also helps with the BGC problem.

… median is calculated

ghost · 2023-09-07T06:43:26Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

currently the samples for DATAS are taken in check_heap_count which is executed when EE is not suspended. this causes races with both BGC and allocating threads.

also adding a separate array of samples for gen2 GCs. they serve as a backstop if the gen2 GCs keep not getting picked by the main sample array.

this is WIP - I'm working on more changes to make this more robust.

Author:	Maoni0
Assignees:	Maoni0
Labels:	`area-GC-coreclr`
Milestone:	-

…g STW which eliminates races with BGC and allocating threads. fix

…e using a timeout so we can space out the suspension time; if we don't get to do that (eg, GC happens too frequently or it's mostly BGCs) we change it before we start a GC

src/coreclr/gc/gc.cpp

mangod9 · 2023-09-18T22:32:07Z

/backport to release/8.0

github-actions · 2023-09-18T22:32:20Z

Started backporting to release/8.0: https://github.com/dotnet/runtime/actions/runs/6228726593

github-actions · 2023-09-18T22:34:41Z

@mangod9 backporting to release/8.0 failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: this was based on Peter's gen2 sampling changes but I changed the way median is calculated
.git/rebase-apply/patch:178: trailing whitespace.
        now, saved_last_sample_time, 
warning: 1 line adds whitespace errors.
Using index info to reconstruct a base tree...
M	src/coreclr/gc/gc.cpp
M	src/coreclr/gc/gcpriv.h
Falling back to patching base and 3-way merge...
Auto-merging src/coreclr/gc/gcpriv.h
Auto-merging src/coreclr/gc/gc.cpp
CONFLICT (content): Merge conflict in src/coreclr/gc/gc.cpp
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 this was based on Peter's gen2 sampling changes but I changed the way median is calculated
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

github-actions · 2023-09-18T22:34:42Z

@mangod9 an error occurred while backporting to release/8.0, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

… heap count in DATAS (dotnet#91712) + Moved the sample recording into when we are suspended. The way we were calculating the throughput cost was in check_heap_count (which is called right after we restart EE on heap0), we record the msl_wait_time (and reset it to 0 for soh/uoh). This is not synchronized with the allocating threads (which are already running at this point). So what can happen is the allocating threads are already accumulated more wait time which is attributed to this GC but it's not within the period we are counting for this GC (and we lose this part for the next GC). For BGC this is incorrect. If an ephemeral GC did happen before the BGC starts, we'd be adding a sample for that GC which is basically correct for that eph GC. But if an eph GC did not happen, we are just adding a random sample which is calculating the tcp as (msl wait + whatever GC that was finished before this BGC) so obviously incorrect. + Added gen2 sampling - this was adapted from Peter's gen2 sampling changes. This serves as a backstop in case the existing sampling doesn't ever pick gen2 GC costs. I made the following fixes - 1) changed the way we calculated the median 2) moved where this is calculated to again avoid timing issues 3) made the gen2 samples actually count instead of losing that info if we happen to sample when a gen2 didn't just occur. + Changed when check_heap_count is called - the previous place is right after a suspension which does not help with spacing the suspension time out (it was "suspend for GC" then "immediately suspend to change heap count"). And it caused a problem with BGC which was it always tried to change heap count when it couldn't because BGC was in progress. I changed this to be on a timeout to intentionally space the suspensions out. Now most of the time, heap count changes happen due to this time out. If we are really in a situation where GCs are happening too quickly and we return from waiting on the ee_suspend_event due to a GC started, we change the heap count right before we do a GC. So this also helps with the BGC problem.

* new synchronization mechanism for DATAS (#90726) The current mechanism has a fundamental flaw which is the idling threads can start running at unpredictable times when they are woken up. This causes all sorts of problems. For example, when a thread gets here in gc_thread_function - `if (n_heaps <= heap_number)` if it's true it's supposed to wait. But its execution could be delayed so after it reads n_heaps it can stop for a while since no thread is waiting on this thread anyway... till some time later when a heap count changes happens again and it requires this thread to participating. And now this thread does the comparison and discovers that it needs to wait so it goes idle and all other threads will just be waiting for this thread to join. Another example is it's not safe to change the heap count for a join from a larger one to a smaller one. It's fine to change from a smaller one to a larger one because all the threads participating will have to run in order for a join to finish. But if no one is waiting on a thread, it could just wake up from the event being set by the last thread joining and not run for a while. Then go back to the respin loop at a point where the color was changed and changed again! So now it thinks it can proceed with a join it does not belong to. And of course that wouldn't work. The way threads are going idle/waking up is hard to keep track of - not only does it involve the gc_start_event and gc_idle_thread_event, it also uses WaitForGCEvent which is used by SuspendEE/RestartEE which in turn means whenever we want to call these we'd need to care about how that would affect this. The new mechanism only uses gc_start_event and gc_idle_thread_event, but I changed gc_idle_thread_event to a per heap event. We can easily track which threads are going idling easily - whenever a thread is about to wait on the idle event, we increase the current idle_thread_count. And when we increase the heap count we only set the gc_idle_thread_event for the new heaps that are about to participate so we can deduct that many from idle_thread_count. There's a much simpler code path between "we know we don't need these threads anymore" to "these threads are at a known point" because the next time gc_start_event is set (ie, a GC is requested) we make sure to get these threads to a good known point, ie, we wait till all of them have completed increasing idle_thread_count. Also fixed a couple of other problems that I hit while testing the new mechanism - We are setting freeable_uoh_segment and freeable_soh_segment in decommission_heap to DECOMMISSIONED_REGION_P. And this causes us to simply lose the value for them. We should make sure we do push these to the free regions before we start changing the heap count. We should also call background_delay_delete_uoh_segments before we start changing the heap count so we can get rid of the regions marked with heap_segment_flags_uoh_delete. If we allow these to be rearranged in equalize_promoted_bytes it means the order can change the invariant of the first region never being deleted no longer holds true and we can AV in this method. I added an new method delay_free_segments to perform both tasks. The accounting of generation_free_list_space is slightly off for LOH which causes us to hit assert (gen_size >= dd_fragmentation (dd)); in change_heap_count because we were not counting the loh_pad size. I also disabled assert (free_list_space_decrease <= dd_fragmentation (dd)); for gen2 since I'm seeing this fired while I'm doing stress runs. I have yet to investigate this since I didn't want to add yet more changes to this PR. * fixed problems with how sampling is done and how we suspend to change heap count in DATAS (#91712) + Moved the sample recording into when we are suspended. The way we were calculating the throughput cost was in check_heap_count (which is called right after we restart EE on heap0), we record the msl_wait_time (and reset it to 0 for soh/uoh). This is not synchronized with the allocating threads (which are already running at this point). So what can happen is the allocating threads are already accumulated more wait time which is attributed to this GC but it's not within the period we are counting for this GC (and we lose this part for the next GC). For BGC this is incorrect. If an ephemeral GC did happen before the BGC starts, we'd be adding a sample for that GC which is basically correct for that eph GC. But if an eph GC did not happen, we are just adding a random sample which is calculating the tcp as (msl wait + whatever GC that was finished before this BGC) so obviously incorrect. + Added gen2 sampling - this was adapted from Peter's gen2 sampling changes. This serves as a backstop in case the existing sampling doesn't ever pick gen2 GC costs. I made the following fixes - 1) changed the way we calculated the median 2) moved where this is calculated to again avoid timing issues 3) made the gen2 samples actually count instead of losing that info if we happen to sample when a gen2 didn't just occur. + Changed when check_heap_count is called - the previous place is right after a suspension which does not help with spacing the suspension time out (it was "suspend for GC" then "immediately suspend to change heap count"). And it caused a problem with BGC which was it always tried to change heap count when it couldn't because BGC was in progress. I changed this to be on a timeout to intentionally space the suspensions out. Now most of the time, heap count changes happen due to this time out. If we are really in a situation where GCs are happening too quickly and we return from waiting on the ee_suspend_event due to a GC started, we change the heap count right before we do a GC. So this also helps with the BGC problem. * gen0_bricks_cleared flag needs to be propagated when we change heap count (#90457) when we change the heap count, in heap X we get a region from heap Y and the gen0_bricks_cleared flag from Y says false but heap X says true. So when we check the bricks on heap X, we assume it’s true but it’s not. the fix is to detect if any heap has this flag as false and if so make all heaps’ flag false (tracking which region is moved from which other heap is something we need additional recording for and it’s not really worth doing just for this) * a logging change --------- Co-authored-by: Maoni0 <maoni@microsoft.com>

this was based on Peter's gen2 sampling changes but I changed the way…

009f787

… median is calculated

dotnet-issue-labeler bot added the area-GC-coreclr label Sep 7, 2023

ghost assigned Maoni0 Sep 7, 2023

Maoni0 added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) and removed area-GC-coreclr labels Sep 7, 2023

Maoni0 force-pushed the datas_sampling branch from f399a9c to 9d4830e Compare September 7, 2023 08:16

build-analysis bot mentioned this pull request Sep 7, 2023

Tracking issue for CI build timeouts #76454

Closed

reworked when the samples are taken - now they are taken always durin…

ebb00a7

…g STW which eliminates races with BGC and allocating threads. fix

Maoni0 force-pushed the datas_sampling branch from 9d4830e to ebb00a7 Compare September 7, 2023 18:51

Maoni0 added 3 commits September 11, 2023 22:01

reworked when we actually change the heap count - now we mostly chang…

db9114d

…e using a timeout so we can space out the suspension time; if we don't get to do that (eg, GC happens too frequently or it's mostly BGCs) we change it before we start a GC

logging changes

51d544c

fixes and cleanup

fb527f5

Maoni0 force-pushed the datas_sampling branch from c86d4ee to fb527f5 Compare September 17, 2023 05:52

Maoni0 changed the title ~~[WIP] changes in how sampling is done for DATAS~~ changes in how sampling is done and how we suspend to change heap count for DATAS Sep 17, 2023

Maoni0 changed the title ~~changes in how sampling is done and how we suspend to change heap count for DATAS~~ fixed problems with how sampling is done and how we suspend to change heap count in DATAS Sep 17, 2023

Maoni0 removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Sep 17, 2023

mangod9 reviewed Sep 18, 2023

View reviewed changes

src/coreclr/gc/gc.cpp Show resolved Hide resolved

mangod9 reviewed Sep 18, 2023

View reviewed changes

src/coreclr/gc/gc.cpp Show resolved Hide resolved

mangod9 reviewed Sep 18, 2023

View reviewed changes

src/coreclr/gc/gc.cpp Show resolved Hide resolved

mangod9 approved these changes Sep 18, 2023

View reviewed changes

Maoni0 merged commit e1ca02f into dotnet:main Sep 18, 2023
109 checks passed

Maoni0 mentioned this pull request Sep 20, 2023

[release/8.0-rc2] porting DATAS change back to RC2 #92323

Merged

ghost locked as resolved and limited conversation to collaborators Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed problems with how sampling is done and how we suspend to change heap count in DATAS #91712

fixed problems with how sampling is done and how we suspend to change heap count in DATAS #91712

Maoni0 commented Sep 7, 2023 •

edited

Loading

ghost commented Sep 7, 2023

mangod9 commented Sep 18, 2023

github-actions bot commented Sep 18, 2023

github-actions bot commented Sep 18, 2023

github-actions bot commented Sep 18, 2023

fixed problems with how sampling is done and how we suspend to change heap count in DATAS #91712

fixed problems with how sampling is done and how we suspend to change heap count in DATAS #91712

Conversation

Maoni0 commented Sep 7, 2023 • edited Loading

ghost commented Sep 7, 2023

mangod9 commented Sep 18, 2023

github-actions bot commented Sep 18, 2023

github-actions bot commented Sep 18, 2023

github-actions bot commented Sep 18, 2023

Maoni0 commented Sep 7, 2023 •

edited

Loading