Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed problems with how sampling is done and how we suspend to change heap count in DATAS #91712

Merged
merged 5 commits into from
Sep 18, 2023

Conversation

Maoni0
Copy link
Member

@Maoni0 Maoni0 commented Sep 7, 2023

  • Moved the sample recording into when we are suspended. The way we were calculating the throughput cost was in check_heap_count (which is called right after we restart EE on heap0), we record the msl_wait_time (and reset it to 0 for soh/uoh). This is not synchronized with the allocating threads (which are already running at this point). So what can happen is the allocating threads are already accumulated more wait time which is attributed to this GC but it's not within the period we are counting for this GC (and we lose this part for the next GC). For BGC this is incorrect. If an ephemeral GC did happen before the BGC starts, we'd be adding a sample for that GC which is basically correct for that eph GC. But if an eph GC did not happen, we are just adding a random sample which is calculating the tcp as (msl wait + whatever GC that was finished before this BGC) so obviously incorrect.

  • Added gen2 sampling - this was adapted from Peter's gen2 sampling changes. This serves as a backstop in case the existing sampling doesn't ever pick gen2 GC costs. I made the following fixes -

  1. changed the way we calculated the median
  2. moved where this is calculated to again avoid timing issues
  3. made the gen2 samples actually count instead of losing that info if we happen to sample when a gen2 didn't just occur.
  • Changed when check_heap_count is called - the previous place is right after a suspension which does not help with spacing the suspension time out (it was "suspend for GC" then "immediately suspend to change heap count"). And it caused a problem with BGC which was it always tried to change heap count when it couldn't because BGC was in progress. I changed this to be on a timeout to intentionally space the suspensions out. Now most of the time, heap count changes happen due to this time out. If we are really in a situation where GCs are happening too quickly and we return from waiting on the ee_suspend_event due to a GC started, we change the heap count right before we do a GC. So this also helps with the BGC problem.

@ghost
Copy link

ghost commented Sep 7, 2023

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

currently the samples for DATAS are taken in check_heap_count which is executed when EE is not suspended. this causes races with both BGC and allocating threads.

also adding a separate array of samples for gen2 GCs. they serve as a backstop if the gen2 GCs keep not getting picked by the main sample array.

this is WIP - I'm working on more changes to make this more robust.

Author: Maoni0
Assignees: Maoni0
Labels:

area-GC-coreclr

Milestone: -

@Maoni0 Maoni0 added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) and removed area-GC-coreclr labels Sep 7, 2023
…g STW which eliminates races with BGC and allocating threads.

fix
Maoni0 added 3 commits September 11, 2023 22:01
…e using a timeout so we can space out the suspension time; if we don't get to do that (eg, GC happens too frequently or it's mostly BGCs) we change it before we start a GC
@Maoni0 Maoni0 changed the title [WIP] changes in how sampling is done for DATAS changes in how sampling is done and how we suspend to change heap count for DATAS Sep 17, 2023
@Maoni0 Maoni0 changed the title changes in how sampling is done and how we suspend to change heap count for DATAS fixed problems with how sampling is done and how we suspend to change heap count in DATAS Sep 17, 2023
@Maoni0 Maoni0 removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Sep 17, 2023
@Maoni0 Maoni0 merged commit e1ca02f into dotnet:main Sep 18, 2023
109 checks passed
@mangod9
Copy link
Member

mangod9 commented Sep 18, 2023

/backport to release/8.0

@github-actions
Copy link
Contributor

Started backporting to release/8.0: https://github.com/dotnet/runtime/actions/runs/6228726593

@github-actions
Copy link
Contributor

@mangod9 backporting to release/8.0 failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: this was based on Peter's gen2 sampling changes but I changed the way median is calculated
.git/rebase-apply/patch:178: trailing whitespace.
        now, saved_last_sample_time, 
warning: 1 line adds whitespace errors.
Using index info to reconstruct a base tree...
M	src/coreclr/gc/gc.cpp
M	src/coreclr/gc/gcpriv.h
Falling back to patching base and 3-way merge...
Auto-merging src/coreclr/gc/gcpriv.h
Auto-merging src/coreclr/gc/gc.cpp
CONFLICT (content): Merge conflict in src/coreclr/gc/gc.cpp
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 this was based on Peter's gen2 sampling changes but I changed the way median is calculated
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

@github-actions
Copy link
Contributor

@mangod9 an error occurred while backporting to release/8.0, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

Maoni0 added a commit to Maoni0/runtime that referenced this pull request Sep 20, 2023
… heap count in DATAS (dotnet#91712)

+ Moved the sample recording into when we are suspended. The way we were calculating the throughput cost was in check_heap_count (which is called right after we restart EE on heap0), we record the msl_wait_time (and reset it to 0 for soh/uoh). This is not synchronized with the allocating threads (which are already running at this point). So what can happen is the allocating threads are already accumulated more wait time which is attributed to this GC but it's not within the period we are counting for this GC (and we lose this part for the next GC). For BGC this is incorrect. If an ephemeral GC did happen before the BGC starts, we'd be adding a sample for that GC which is basically correct for that eph GC. But if an eph GC did not happen, we are just adding a random sample which is calculating the tcp as (msl wait + whatever GC that was finished before this BGC) so obviously incorrect.

+ Added gen2 sampling - this was adapted from Peter's gen2 sampling changes. This serves as a backstop in case the existing sampling doesn't ever pick gen2 GC costs. I made the following fixes -

1) changed the way we calculated the median

2) moved where this is calculated to again avoid timing issues

3) made the gen2 samples actually count instead of losing that info if we happen to sample when a gen2 didn't just occur.

+ Changed when check_heap_count is called - the previous place is right after a suspension which does not help with spacing the suspension time out (it was "suspend for GC" then "immediately suspend to change heap count"). And it caused a problem with BGC which was it always tried to change heap count when it couldn't because BGC was in progress. I changed this to be on a timeout to intentionally space the suspensions out. Now most of the time, heap count changes happen due to this time out. If we are really in a situation where GCs are happening too quickly and we return from waiting on the ee_suspend_event due to a GC started, we change the heap count right before we do a GC. So this also helps with the BGC problem.
carlossanlop pushed a commit that referenced this pull request Sep 20, 2023
* new synchronization mechanism for DATAS (#90726)

The current mechanism has a fundamental flaw which is the idling threads can start running at unpredictable times when they are woken up. This causes all sorts of problems. For example, when a thread gets here in gc_thread_function -

`if (n_heaps <= heap_number)`

if it's true it's supposed to wait. But its execution could be delayed so after it reads n_heaps it can stop for a while since no thread is waiting on this thread anyway... till some time later when a heap count changes happens again and it requires this thread to participating. And now this thread does the comparison and discovers that it needs to wait so it goes idle and all other threads will just be waiting for this thread to join.

Another example is it's not safe to change the heap count for a join from a larger one to a smaller one. It's fine to change from a smaller one to a larger one because all the threads participating will have to run in order for a join to finish. But if no one is waiting on a thread, it could just wake up from the event being set by the last thread joining and not run for a while. Then go back to the respin loop at a point where the color was changed and changed again! So now it thinks it can proceed with a join it does not belong to. And of course that wouldn't work.

The way threads are going idle/waking up is hard to keep track of - not only does it involve the gc_start_event and gc_idle_thread_event, it also uses WaitForGCEvent which is used by SuspendEE/RestartEE which in turn means whenever we want to call these we'd need to care about how that would affect this.

The new mechanism only uses gc_start_event and gc_idle_thread_event, but I changed gc_idle_thread_event to a per heap event. We can easily track which threads are going idling easily - whenever a thread is about to wait on the idle event, we increase the current idle_thread_count. And when we increase the heap count we only set the gc_idle_thread_event for the new heaps that are about to participate so we can deduct that many from idle_thread_count. There's a much simpler code path between "we know we don't need these threads anymore" to "these threads are at a known point" because the next time gc_start_event is set (ie, a GC is requested) we make sure to get these threads to a good known point, ie, we wait till all of them have completed increasing idle_thread_count.

Also fixed a couple of other problems that I hit while testing the new mechanism -

We are setting freeable_uoh_segment and freeable_soh_segment in decommission_heap to DECOMMISSIONED_REGION_P. And this causes us to simply lose the value for them. We should make sure we do push these to the free regions before we start changing the heap count.

We should also call background_delay_delete_uoh_segments before we start changing the heap count so we can get rid of the regions marked with heap_segment_flags_uoh_delete. If we allow these to be rearranged in equalize_promoted_bytes it means the order can change the invariant of the first region never being deleted no longer holds true and we can AV in this method.

I added an new method delay_free_segments to perform both tasks.

The accounting of generation_free_list_space is slightly off for LOH which causes us to hit assert (gen_size >= dd_fragmentation (dd)); in change_heap_count because we were not counting the loh_pad size.
I also disabled assert (free_list_space_decrease <= dd_fragmentation (dd)); for gen2 since I'm seeing this fired while I'm doing stress runs. I have yet to investigate this since I didn't want to add yet more changes to this PR.

* fixed problems with how sampling is done and how we suspend to change heap count in DATAS (#91712)

+ Moved the sample recording into when we are suspended. The way we were calculating the throughput cost was in check_heap_count (which is called right after we restart EE on heap0), we record the msl_wait_time (and reset it to 0 for soh/uoh). This is not synchronized with the allocating threads (which are already running at this point). So what can happen is the allocating threads are already accumulated more wait time which is attributed to this GC but it's not within the period we are counting for this GC (and we lose this part for the next GC). For BGC this is incorrect. If an ephemeral GC did happen before the BGC starts, we'd be adding a sample for that GC which is basically correct for that eph GC. But if an eph GC did not happen, we are just adding a random sample which is calculating the tcp as (msl wait + whatever GC that was finished before this BGC) so obviously incorrect.

+ Added gen2 sampling - this was adapted from Peter's gen2 sampling changes. This serves as a backstop in case the existing sampling doesn't ever pick gen2 GC costs. I made the following fixes -

1) changed the way we calculated the median

2) moved where this is calculated to again avoid timing issues

3) made the gen2 samples actually count instead of losing that info if we happen to sample when a gen2 didn't just occur.

+ Changed when check_heap_count is called - the previous place is right after a suspension which does not help with spacing the suspension time out (it was "suspend for GC" then "immediately suspend to change heap count"). And it caused a problem with BGC which was it always tried to change heap count when it couldn't because BGC was in progress. I changed this to be on a timeout to intentionally space the suspensions out. Now most of the time, heap count changes happen due to this time out. If we are really in a situation where GCs are happening too quickly and we return from waiting on the ee_suspend_event due to a GC started, we change the heap count right before we do a GC. So this also helps with the BGC problem.

* gen0_bricks_cleared flag needs to be propagated when we change heap count (#90457)

when we change the heap count, in heap X we get a region from heap Y and the gen0_bricks_cleared flag from Y says false but heap X says true. So when we check the bricks on heap X, we assume it’s true but it’s not.

the fix is to detect if any heap has this flag as false and if so make all heaps’ flag false (tracking which region is moved from which other heap is something we need additional recording for and it’s not really worth doing just for this)

* a logging change

---------

Co-authored-by: Maoni0 <maoni@microsoft.com>
@ghost ghost locked as resolved and limited conversation to collaborators Oct 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants