Skip to content

Commit

Permalink
[release/8.0-rc2] porting DATAS change back to RC2 (#92323)
Browse files Browse the repository at this point in the history
* new synchronization mechanism for DATAS (#90726)

The current mechanism has a fundamental flaw which is the idling threads can start running at unpredictable times when they are woken up. This causes all sorts of problems. For example, when a thread gets here in gc_thread_function -

`if (n_heaps <= heap_number)`

if it's true it's supposed to wait. But its execution could be delayed so after it reads n_heaps it can stop for a while since no thread is waiting on this thread anyway... till some time later when a heap count changes happens again and it requires this thread to participating. And now this thread does the comparison and discovers that it needs to wait so it goes idle and all other threads will just be waiting for this thread to join.

Another example is it's not safe to change the heap count for a join from a larger one to a smaller one. It's fine to change from a smaller one to a larger one because all the threads participating will have to run in order for a join to finish. But if no one is waiting on a thread, it could just wake up from the event being set by the last thread joining and not run for a while. Then go back to the respin loop at a point where the color was changed and changed again! So now it thinks it can proceed with a join it does not belong to. And of course that wouldn't work.

The way threads are going idle/waking up is hard to keep track of - not only does it involve the gc_start_event and gc_idle_thread_event, it also uses WaitForGCEvent which is used by SuspendEE/RestartEE which in turn means whenever we want to call these we'd need to care about how that would affect this.

The new mechanism only uses gc_start_event and gc_idle_thread_event, but I changed gc_idle_thread_event to a per heap event. We can easily track which threads are going idling easily - whenever a thread is about to wait on the idle event, we increase the current idle_thread_count. And when we increase the heap count we only set the gc_idle_thread_event for the new heaps that are about to participate so we can deduct that many from idle_thread_count. There's a much simpler code path between "we know we don't need these threads anymore" to "these threads are at a known point" because the next time gc_start_event is set (ie, a GC is requested) we make sure to get these threads to a good known point, ie, we wait till all of them have completed increasing idle_thread_count.

Also fixed a couple of other problems that I hit while testing the new mechanism -

We are setting freeable_uoh_segment and freeable_soh_segment in decommission_heap to DECOMMISSIONED_REGION_P. And this causes us to simply lose the value for them. We should make sure we do push these to the free regions before we start changing the heap count.

We should also call background_delay_delete_uoh_segments before we start changing the heap count so we can get rid of the regions marked with heap_segment_flags_uoh_delete. If we allow these to be rearranged in equalize_promoted_bytes it means the order can change the invariant of the first region never being deleted no longer holds true and we can AV in this method.

I added an new method delay_free_segments to perform both tasks.

The accounting of generation_free_list_space is slightly off for LOH which causes us to hit assert (gen_size >= dd_fragmentation (dd)); in change_heap_count because we were not counting the loh_pad size.
I also disabled assert (free_list_space_decrease <= dd_fragmentation (dd)); for gen2 since I'm seeing this fired while I'm doing stress runs. I have yet to investigate this since I didn't want to add yet more changes to this PR.

* fixed problems with how sampling is done and how we suspend to change heap count in DATAS (#91712)

+ Moved the sample recording into when we are suspended. The way we were calculating the throughput cost was in check_heap_count (which is called right after we restart EE on heap0), we record the msl_wait_time (and reset it to 0 for soh/uoh). This is not synchronized with the allocating threads (which are already running at this point). So what can happen is the allocating threads are already accumulated more wait time which is attributed to this GC but it's not within the period we are counting for this GC (and we lose this part for the next GC). For BGC this is incorrect. If an ephemeral GC did happen before the BGC starts, we'd be adding a sample for that GC which is basically correct for that eph GC. But if an eph GC did not happen, we are just adding a random sample which is calculating the tcp as (msl wait + whatever GC that was finished before this BGC) so obviously incorrect.

+ Added gen2 sampling - this was adapted from Peter's gen2 sampling changes. This serves as a backstop in case the existing sampling doesn't ever pick gen2 GC costs. I made the following fixes -

1) changed the way we calculated the median

2) moved where this is calculated to again avoid timing issues

3) made the gen2 samples actually count instead of losing that info if we happen to sample when a gen2 didn't just occur.

+ Changed when check_heap_count is called - the previous place is right after a suspension which does not help with spacing the suspension time out (it was "suspend for GC" then "immediately suspend to change heap count"). And it caused a problem with BGC which was it always tried to change heap count when it couldn't because BGC was in progress. I changed this to be on a timeout to intentionally space the suspensions out. Now most of the time, heap count changes happen due to this time out. If we are really in a situation where GCs are happening too quickly and we return from waiting on the ee_suspend_event due to a GC started, we change the heap count right before we do a GC. So this also helps with the BGC problem.

* gen0_bricks_cleared flag needs to be propagated when we change heap count (#90457)

when we change the heap count, in heap X we get a region from heap Y and the gen0_bricks_cleared flag from Y says false but heap X says true. So when we check the bricks on heap X, we assume it’s true but it’s not.

the fix is to detect if any heap has this flag as false and if so make all heaps’ flag false (tracking which region is moved from which other heap is something we need additional recording for and it’s not really worth doing just for this)

* a logging change

---------

Co-authored-by: Maoni0 <maoni@microsoft.com>
  • Loading branch information
Maoni0 and Maoni0 committed Sep 20, 2023
1 parent 09429dc commit b0b595f
Show file tree
Hide file tree
Showing 3 changed files with 746 additions and 349 deletions.
Loading

0 comments on commit b0b595f

Please sign in to comment.