Problems with TryStartNoGCRegion #6809

Open
mattwarren opened this Issue Aug 19, 2016 · 11 comments

Projects

None yet

4 participants

@mattwarren
Contributor

As part of writing a blog post ("Preventing .NET Garbage Collections with the TryStartNoGCRegion API"), I came across some strange behaviour when using TryStartNoGCRegion.

With test code like this:

try
{
    Console.WriteLine("\nCalling TryStartNoGCRegion(..) with totalSize = {0:N0} MB ({1:N0} K or {2:N0} bytes)",
                        size / 1024.0 / 1024.0, size / 1024.0, size);
    var succeeded = GC.TryStartNoGCRegion(size);
    Console.WriteLine("{0:N0} MB ({1:N0} K or {2:N0} bytes) {3}",
                        size / 1024.0 / 1024.0, size / 1024.0, size, succeeded ? "SUCCEEDED" : "FAILED");
}
catch (Exception ex)
{
    Console.WriteLine("{0:N0} MB ({1:N0} K or {2:N0} bytes) - {3} {4}",
                      size / 1024.0 / 1024.0, size / 1024.0, size, ex.GetType().Name, ex.Message);
}

I get the following results:

totalSize Result
240 * 1024 * 1024 Returns true
242 * 1024 * 1024 Hangs
244 * 1024 * 1024 Hangs
246 * 1024 * 1024 Hangs
248 * 1024 * 1024 Hangs
250 * 1024 * 1024 Hangs
252 * 1024 * 1024 Hangs
254 * 1024 * 1024 Hangs
256 * 1024 * 1024 Throws ArgumentOutOfRangeException("totalSize is too large")

It seems that there are some values of totalSize that cause TryStartNoGCRegion to never return? On some runs, the values that hang will instead return false, but more often than not it just hangs?

Note: There's tests were all done on .NET 4.6,1, 32-bit, Server GC (<gcServer enabled="true"/>).

@swgillespie
Member
swgillespie commented Aug 19, 2016 edited

Thanks for the report! cc @Maoni0

I have a repro, here are the interesting threads:

   0  Id: 45d8.47b8 Suspend: 1 Teb: 006e7000 Unfrozen
 # ChildEBP RetAddr  
00 008feb38 766ce201 ntdll!NtWaitForSingleObject+0xc
01 008febac 741ecf67 KERNELBASE!WaitForSingleObjectEx+0x91
02 008febdc 741ecfae clr!CLREventWaitHelper2+0x31
03 008fec2c 741ecf31 clr!CLREventWaitHelper+0x2a
04 008fec64 7436a29f clr!CLREventBase::WaitEx+0x152
05 008fec88 7436a3a1 clr!SVR::gc_heap::wait_for_gc_done+0x66
06 008fecb4 742f8a0b clr!SVR::GCHeap::GarbageCollectGeneration+0x1c2
07 008fecc4 742f8a76 clr!SVR::GCHeap::GarbageCollectTry+0x56
08 008fecf0 745d9333 clr!SVR::GCHeap::GarbageCollect+0x103
09 008fed24 744e1ead clr!SVR::GCHeap::StartNoGCRegion+0x43
0a 008fed80 72d35d7f clr!GCInterface::StartNoGCRegion+0x6c
0b 008fedcc 72dc307d mscorlib_ni+0x9a5d7f
0c 008fedec 72dc30c0 mscorlib_ni+0xa3307d
0d 008fee04 010705c1 mscorlib_ni+0xa330c0
WARNING: Frame IP not in any known module. Following frames may be wrong.
0e 008feee8 741e1376 0x10705c1
0f 008feef4 741e366f clr!CallDescrWorkerInternal+0x34
10 008fef48 741ed376 clr!CallDescrWorkerWithHandler+0x6b
11 008fefb0 742fa204 clr!MethodDescCallSite::CallTargetWorker+0x158
12 008ff0d4 742fa40b clr!RunMain+0x1aa
13 008ff348 742faa33 clr!Assembly::ExecuteMainMethod+0x124
14 008ff850 742faad9 clr!SystemDomain::ExecuteMainMethod+0x651
15 008ff8a8 742fa2da clr!ExecuteEXE+0x4c
16 008ff8e8 7431599f clr!_CorExeMainInternal+0xdc
17 008ff924 749dcccb clr!_CorExeMain+0x4d
18 008ff964 74a5dd05 mscoreei!_CorExeMain+0x10a
19 008ff974 767b38f4 MSCOREE!_CorExeMain_Exported+0xa5
1a 008ff988 77df5de3 KERNEL32!BaseThreadInitThunk+0x24
1b 008ff9d0 77df5dae ntdll!__RtlUserThreadStart+0x2f
1c 008ff9e0 00000000 ntdll!_RtlUserThreadStart+0x1b

   3  Id: 45d8.213c Suspend: 1 Teb: 006f9000 Unfrozen
 # ChildEBP RetAddr  
00 1b45f7e0 766ce201 ntdll!NtWaitForSingleObject+0xc
01 1b45f854 741ecf67 KERNELBASE!WaitForSingleObjectEx+0x91
02 1b45f884 741ecfae clr!CLREventWaitHelper2+0x31
03 1b45f8d4 741ecf31 clr!CLREventWaitHelper+0x2a
04 1b45f90c 742bac82 clr!CLREventBase::WaitEx+0x152
05 1b45f920 7436ad0d clr!CLREventBase::Wait+0x1a
06 1b45f948 7431fa3a clr!SVR::gc_heap::gc_thread_function+0x129
07 1b45fa60 767b38f4 clr!SVR::gc_heap::gc_thread_stub+0x71
08 1b45fa74 77df5de3 KERNEL32!BaseThreadInitThunk+0x24
09 1b45fabc 77df5dae ntdll!__RtlUserThreadStart+0x2f
0a 1b45facc 00000000 ntdll!_RtlUserThreadStart+0x1b

   4  Id: 45d8.2588 Suspend: 1 Teb: 006fc000 Unfrozen
 # ChildEBP RetAddr  
00 1b51f4a8 766ce201 ntdll!NtWaitForSingleObject+0xc
01 1b51f51c 741ecf67 KERNELBASE!WaitForSingleObjectEx+0x91
02 1b51f54c 741ecfae clr!CLREventWaitHelper2+0x31
03 1b51f59c 741ecf31 clr!CLREventWaitHelper+0x2a
04 1b51f5d4 7436a6b1 clr!CLREventBase::WaitEx+0x152
05 1b51f5fc 745cf390 clr!SVR::t_join::join+0x87
06 1b51f614 745cf88e clr!SVR::gc_heap::allocate_for_no_gc_after_gc+0x1c3
07 1b51f668 7436a104 clr!SVR::gc_heap::garbage_collect+0x773
08 1b51f688 7431fa3a clr!SVR::gc_heap::gc_thread_function+0x13a
09 1b51f89c 767b38f4 clr!SVR::gc_heap::gc_thread_stub+0x71
0a 1b51f8b0 77df5de3 KERNEL32!BaseThreadInitThunk+0x24
0b 1b51f8f8 77df5dae ntdll!__RtlUserThreadStart+0x2f
0c 1b51f908 00000000 ntdll!_RtlUserThreadStart+0x1b

   5  Id: 45d8.3130 Suspend: 1 Teb: 006ff000 Unfrozen
 # ChildEBP RetAddr  
00 1b5df638 766ce201 ntdll!NtWaitForSingleObject+0xc
01 1b5df6ac 741ecf67 KERNELBASE!WaitForSingleObjectEx+0x91
02 1b5df6dc 741ecfae clr!CLREventWaitHelper2+0x31
03 1b5df72c 741ecf31 clr!CLREventWaitHelper+0x2a
04 1b5df764 7436a6b1 clr!CLREventBase::WaitEx+0x152
05 1b5df78c 745cf390 clr!SVR::t_join::join+0x87
06 1b5df7a4 745cf88e clr!SVR::gc_heap::allocate_for_no_gc_after_gc+0x1c3
07 1b5df7f8 7436a104 clr!SVR::gc_heap::garbage_collect+0x773
08 1b5df818 7431fa3a clr!SVR::gc_heap::gc_thread_function+0x13a
09 1b5dfb2c 767b38f4 clr!SVR::gc_heap::gc_thread_stub+0x71
0a 1b5dfb40 77df5de3 KERNEL32!BaseThreadInitThunk+0x24
0b 1b5dfb88 77df5dae ntdll!__RtlUserThreadStart+0x2f
0c 1b5dfb98 00000000 ntdll!_RtlUserThreadStart+0x1b

   6  Id: 45d8.32f4 Suspend: 1 Teb: 00702000 Unfrozen
 # ChildEBP RetAddr  
00 1b69f908 766ce201 ntdll!NtWaitForSingleObject+0xc
01 1b69f97c 741ecf67 KERNELBASE!WaitForSingleObjectEx+0x91
02 1b69f9ac 741ecfae clr!CLREventWaitHelper2+0x31
03 1b69f9fc 741ecf31 clr!CLREventWaitHelper+0x2a
04 1b69fa34 7436a6b1 clr!CLREventBase::WaitEx+0x152
05 1b69fa5c 745cf390 clr!SVR::t_join::join+0x87
06 1b69fa74 745cf88e clr!SVR::gc_heap::allocate_for_no_gc_after_gc+0x1c3
07 1b69fac8 7436a104 clr!SVR::gc_heap::garbage_collect+0x773
08 1b69fae8 7431fa3a clr!SVR::gc_heap::gc_thread_function+0x13a
09 1b69ff00 767b38f4 clr!SVR::gc_heap::gc_thread_stub+0x71
0a 1b69ff14 77df5de3 KERNEL32!BaseThreadInitThunk+0x24
0b 1b69ff5c 77df5dae ntdll!__RtlUserThreadStart+0x2f
0c 1b69ff6c 00000000 ntdll!_RtlUserThreadStart+0x1b

   7  Id: 45d8.3378 Suspend: 1 Teb: 00705000 Unfrozen
 # ChildEBP RetAddr  
00 1b75f358 766ce201 ntdll!NtWaitForSingleObject+0xc
01 1b75f3cc 741ecf67 KERNELBASE!WaitForSingleObjectEx+0x91
02 1b75f3fc 741ecfae clr!CLREventWaitHelper2+0x31
03 1b75f44c 741ecf31 clr!CLREventWaitHelper+0x2a
04 1b75f484 7436a6b1 clr!CLREventBase::WaitEx+0x152
05 1b75f4ac 745cf390 clr!SVR::t_join::join+0x87
06 1b75f4c4 745cf88e clr!SVR::gc_heap::allocate_for_no_gc_after_gc+0x1c3
07 1b75f518 7436a104 clr!SVR::gc_heap::garbage_collect+0x773
08 1b75f538 7431fa3a clr!SVR::gc_heap::gc_thread_function+0x13a
09 1b75fa54 767b38f4 clr!SVR::gc_heap::gc_thread_stub+0x71
0a 1b75fa68 77df5de3 KERNEL32!BaseThreadInitThunk+0x24
0b 1b75fab0 77df5dae ntdll!__RtlUserThreadStart+0x2f
0c 1b75fac0 00000000 ntdll!_RtlUserThreadStart+0x1b

   8  Id: 45d8.49cc Suspend: 1 Teb: 00708000 Unfrozen
 # ChildEBP RetAddr  
00 1b81f5a0 766ce201 ntdll!NtWaitForSingleObject+0xc
01 1b81f614 741ecf67 KERNELBASE!WaitForSingleObjectEx+0x91
02 1b81f644 741ecfae clr!CLREventWaitHelper2+0x31
03 1b81f694 741ecf31 clr!CLREventWaitHelper+0x2a
04 1b81f6cc 742bac82 clr!CLREventBase::WaitEx+0x152
05 1b81f6e0 7436ad0d clr!CLREventBase::Wait+0x1a
06 1b81f708 7431fa3a clr!SVR::gc_heap::gc_thread_function+0x129
07 1b81fd24 767b38f4 clr!SVR::gc_heap::gc_thread_stub+0x71
08 1b81fd38 77df5de3 KERNEL32!BaseThreadInitThunk+0x24
09 1b81fd80 77df5dae ntdll!__RtlUserThreadStart+0x2f
0a 1b81fd90 00000000 ntdll!_RtlUserThreadStart+0x1b

   9  Id: 45d8.1748 Suspend: 1 Teb: 0070b000 Unfrozen
 # ChildEBP RetAddr  
00 1b8df1b8 766ce201 ntdll!NtWaitForSingleObject+0xc
01 1b8df22c 741ecf67 KERNELBASE!WaitForSingleObjectEx+0x91
02 1b8df25c 741ecfae clr!CLREventWaitHelper2+0x31
03 1b8df2ac 741ecf31 clr!CLREventWaitHelper+0x2a
04 1b8df2e4 7436a6b1 clr!CLREventBase::WaitEx+0x152
05 1b8df30c 745cf390 clr!SVR::t_join::join+0x87
06 1b8df324 745cf88e clr!SVR::gc_heap::allocate_for_no_gc_after_gc+0x1c3
07 1b8df378 7436a104 clr!SVR::gc_heap::garbage_collect+0x773
08 1b8df398 7431fa3a clr!SVR::gc_heap::gc_thread_function+0x13a
09 1b8dfaac 767b38f4 clr!SVR::gc_heap::gc_thread_stub+0x71
0a 1b8dfac0 77df5de3 KERNEL32!BaseThreadInitThunk+0x24
0b 1b8dfb08 77df5dae ntdll!__RtlUserThreadStart+0x2f
0c 1b8dfb18 00000000 ntdll!_RtlUserThreadStart+0x1b

The deadlock is caused by threads 3 and 8 waiting on gc_start_event while the rest of the threads are waiting for threads 3 and 8 to catch up (SVR::t_join::join).

CoreCLR does not currently expose this feature but it will in the very near future, so CoreCLR will also be affected by this issue.

My repro program's output (the same as @mattwarren 's with some extra logging):

Server GC: True
Pointer size: 4
Version: 4.0.30319.42000

Calling TryStartNoGCRegion(..) with totalSize = 244 MB (249,856 K or 255,852,544 bytes)
@mattwarren
Contributor

Cool, glad you were able to repo it, maybe there's similar values that would trigger the hang for 64-bit and/or Workstation mode, but I haven't had time to figure them out yet.


@swgillespie @Maoni0 I have one related question about TryStartNoGCRegion, the MSDN docs say the following (emphasis mine):

The amount of memory in bytes to allocate without triggering a garbage collection. It must be less than or equal to the size of an ephemeral segment. For information on the size of an ephemeral segment, see the "Ephemeral generations and segments" section in the Fundamentals of Garbage Collection article.

However if you visit the linked article on GC Fundamentals, it has no exact figure for the size of an ephemeral segment, it does however have this stark warning:

Important
The size of segments allocated by the garbage collector is implementation-specific and is subject to change at any time, including in periodic updates. Your app should never make assumptions about or depend on a particular segment size, nor should it attempt to configure the amount of memory available for segment allocations.

So how can you determine whether a value of totalSize is going to succeed? Obviously you can:

  1. At run-time call TryStartNoGCRegion, catch ArgumentOutOfRangeException and then adjust totalSize accordingly (i.e trial-and-error)
  2. Experiment beforehand, until you find a known value for a given setup (32/64-bit, Workstation/Server, # of CPU Cores, .NET Framework version). Then repeat the experiment if you change any part of your setup

Is there another option I'm missing?

@mattwarren
Contributor

One last thing, unless I'm mistaken these functions are declared in gcpriv.h but not implemented in gc.cpp:

Could they be removed?

@Maoni0 Maoni0 was assigned by gkhanna79 Aug 20, 2016
@gkhanna79 gkhanna79 added this to the 1.1.0 milestone Aug 20, 2016
@Maoni0
Member
Maoni0 commented Aug 21, 2016 edited

Thanks for reporting this.

Exact segment sizes are documented in "the "Ephemeral generations and segments" section in the Fundamentals of Garbage Collection article." as MSDN says. I just looked and verified that it's there.

@mattwarren
Contributor

Exact segment sizes are documented in "the "Ephemeral generations and segments" section in the Fundamentals of Garbage Collection article." as MSDN says. I just looked and verified that it's there.

I'm so sorry, I can't believe that missed that, you're correct it's right there on the page! (I feel a bit stupid now)

@mattwarren mattwarren added a commit to mattwarren/mattwarren.github.io that referenced this issue Aug 22, 2016
@mattwarren mattwarren Correcting my mistake!! 9c432c3
@mattwarren
Contributor

@Maoni0, one other question, mostly just for my own curiosity.

Why does the GC suspend the runtime even if a GC Collection doesn't happen?

In gc_heap::gc_thread_function() there is code like this, where it calls GCToEEInterface::SuspendEE(..) first and then decides if it should run a GC or not (should_proceed_with_gc()).

while (1)
{
    assert (!gc_t_join.joined());

    if (heap_number == 0)
    {
        gc_heap::ee_suspend_event.Wait(INFINITE, FALSE);

        BEGIN_TIMING(suspend_ee_during_log);
        GCToEEInterface::SuspendEE(GCToEEInterface::SUSPEND_FOR_GC);
        END_TIMING(suspend_ee_during_log);

        proceed_with_gc_p = TRUE;

        if (!should_proceed_with_gc())
        {
            update_collection_counts_for_no_gc();
            proceed_with_gc_p = FALSE;
        }
        else
            settings.init_mechanisms();
        dprintf (3, ("%d gc thread waiting...", heap_number));
        gc_start_event.Set();
    }
    else
    {
        gc_start_event.Wait(INFINITE, FALSE);
        dprintf (3, ("%d gc thread waiting... Done", heap_number));
    }

    if (proceed_with_gc_p)
        garbage_collect (GCHeap::GcCondemnedGeneration);

   ....

Isn't it a waste to suspend and resume the EE if no GC happens? Or is a suspend and/or resume considered so cheap that it's not a problem? Is there a technical reason why it can't call should_proceed_with_gc() first and then only suspend the EE if a GC is needed?

@Maoni0
Member
Maoni0 commented Aug 23, 2016

First of all, people are not calling the NoGCRegion API all the time so we can be afford to sacrifice a bit of perf in exchange of code simplicity. Secondly, we might need to do what I call a "minimal gc" which is just to add a new segment. For that we do want to suspend threads.

@mattwarren
Contributor
mattwarren commented Aug 23, 2016 edited

Thanks for the info, I've recently been digging into EE suspension for the GC (http://mattwarren.org/2016/08/08/GC-Pauses-and-Safe-Points/) and this is good to know

@swgillespie swgillespie modified the milestone: 1.2.0, 1.1.0 Oct 25, 2016
@swgillespie
Member

Moving this out to 1.2.0 since this API isn't exposed in 1.1.0.

@mattwarren
Contributor

Just out of interest, is this going to be fixed in Desktop/Full CLR as well in CoreCLR?

@Maoni0
Member
Maoni0 commented Oct 26, 2016

yes, it will be fixed in both....I should be able to get to this soon :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment