Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High gen0 collect overhead with (suppressed) finalizer objects #48937

Open
peppy opened this issue Mar 2, 2021 · 44 comments
Open

High gen0 collect overhead with (suppressed) finalizer objects #48937

peppy opened this issue Mar 2, 2021 · 44 comments
Labels
area-GC-coreclr tenet-performance Performance related issue
Milestone

Comments

@peppy
Copy link

peppy commented Mar 2, 2021

Overview

When allocating many objects with a finalizer present, there is a non-negligible overhead on GC gen0 collects, even if the finalizer has been suppressed via GC.SuppressFinalize. The hypothesis is that this is due to the emptying of the finalizer queue.

This is causing our real-time application to hitch on every gen0 collection (around 5-10ms pause time). These gen0 collections only happen every 20-60 seconds.

Use Case

A bit more information on our situation, in case it helps to put things into perspective.

Our team develops a rhythm game/framework which requires sustained very low latency execution. We have recently been tracking user reports of occasional high frame times which align with GC invocation, specifically gen0 collections.

Having worked with .NET (framework / core) for several decades, I have a general idea of what to expect in terms of gen0 collection performance, and the numbers we are seeing are much higher than expected, in the range of 5-15ms per collection with low (<1MB/s) alloc throughput and near zero promotion.

One cause turned out to be a texture upload class we have, which rents memory from ArrayPool and returns on disposal. This class may be constructed every frame for streaming texture data. While we do our best to explicitly dispose after consumption, it has a finalizer implemented as a safety measure, to ensure the memory is returned to the ArrayPool no matter what (I think this is a pretty common practice).

With our findings here, it seems that finalizers should be avoided in such cases, where objects are constructed in abundance. This is our general direction to resolve this issue, for what it's worth.

Reproduction

using System;

namespace TestBasicAllocs
{
    public static class Class1
    {
        public static void Main(string[] args)
        {
            bool finalizers = args[0] == "1";

            for (int i = 0; i < 10000000; i++)
            {
                if (finalizers)
                {
                    var thing = new FinalizingThing();
                    GC.SuppressFinalize(thing);
                }
                else
                    new NonFinalizingThing();
            }
        }
    }

    public class NonFinalizingThing
    {
        public NonFinalizingThing()
        {
        }
    }

    public class FinalizingThing
    {
        public FinalizingThing()
        {
        }

        ~FinalizingThing()
        {
        }
    }
}

Results

dotnet-trace collect -- .\bin\Debug\net5.0\TestBasicAllocs.exe 0

Photo Mar 2, 2021 05809

dotnet-trace collect -- .\bin\Debug\net5.0\TestBasicAllocs.exe 1

Photo Mar 2, 2021 05818

I am writing this issue up without a clear distinction of whether this should be considered a bug, performance issue, or an accepted (and potentially better-documented) hidden overhead of finalizers. I have read through .NET memory performance analysis documentation, which does mention that finalizers should be avoided, but also that calling GC.SuppressFinalize should recover all but the allocation overhead. Similar information is present in "official" documentation and user comments, but I have been unable to find anything referencing a gen0 collection-time overhead.

Also worth noting that while memory analysis guides and profilers like Jetbrains dotMemory will highlight finalizers that were not suppressed, it cannot provide visibility and does not find issue with large numbers of allocations of objects with finalizers present in general. Maybe in the majority of cases this pause overhead is considered acceptable, but do consider that the above benchmarks are cases where gen0s are happening quite regularly. In our actual usage we have seen pause times as long as 30-50ms due to the same underlying issue, which implies that this overhead is not part of the consideration as to how often to run gen0 collects.

I have tested against net472, .net core 3.1/3.2/5.0 and this is not a regression.

@peppy peppy added the tenet-performance Performance related issue label Mar 2, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 2, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Mar 2, 2021

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Overview

When allocating many objects with a finalizer present, there is a non-negligible overhead on GC gen0 collects, even if the finalizer has been suppressed via GC.SuppressFinalize. The hypothesis is that this is due to the emptying of the finalizer queue.

This is causing our real-time application to hitch on every gen0 collection (around 5-10ms pause time). These gen0 collections only happen every 20-60 seconds.

Use Case

A bit more information on our situation, in case it helps to put things into perspective.

Our team develops a rhythm game/framework which requires sustained very low latency execution. We have recently been tracking user reports of occasional high frame times which align with GC invocation, specifically gen0 collections.

Having worked with .NET (framework / core) for several decades, I have a general idea of what to expect in terms of gen0 collection performance, and the numbers we are seeing are much higher than expected, in the range of 5-15ms per collection with low (<1MB/s) alloc throughput and near zero promotion.

One cause turned out to be a texture upload class we have, which rents memory from ArrayPool and returns on disposal. This class may be constructed every frame for streaming texture data. While we do our best to explicitly dispose after consumption, it has a finalizer implemented as a safety measure, to ensure the memory is returned to the ArrayPool no matter what (I think this is a pretty common practice).

With our findings here, it seems that finalizers should be avoided in such cases, where objects are constructed in abundance. This is our general direction to resolve this issue, for what it's worth.

Reproduction

using System;

namespace TestBasicAllocs
{
    public static class Class1
    {
        public static void Main(string[] args)
        {
            bool finalizers = args[0] == "1";

            for (int i = 0; i < 10000000; i++)
            {
                if (finalizers)
                {
                    var thing = new FinalizingThing();
                    GC.SuppressFinalize(thing);
                }
                else
                    new NonFinalizingThing();
            }
        }
    }

    public class NonFinalizingThing
    {
        public NonFinalizingThing()
        {
        }
    }

    public class FinalizingThing
    {
        public FinalizingThing()
        {
        }

        ~FinalizingThing()
        {
        }
    }
}

Results

dotnet-trace collect -- .\bin\Debug\net5.0\TestBasicAllocs.exe 0

Photo Mar 2, 2021 05809

dotnet-trace collect -- .\bin\Debug\net5.0\TestBasicAllocs.exe 1

Photo Mar 2, 2021 05818

I am writing this issue up without a clear distinction of whether this should be considered a bug, performance issue, or an accepted (and potentially better-documented) hidden overhead of finalizers. I have read through .NET memory performance analysis documentation, which does mention that finalizers should be avoided, but also that calling GC.SuppressFinalize should recover all but the allocation overhead. Similar information is present in "official" documentation and user comments, but I have been unable to find anything referencing a gen0 collection-time overhead.

Also worth noting that while memory analysis guides and profilers like Jetbrains dotMemory will highlight finalizers that were not suppressed, it cannot provide visibility and does not find issue with large numbers of allocations of objects with finalizers present in general. Maybe in the majority of cases this pause overhead is considered acceptable, but do consider that the above benchmarks are cases where gen0s are happening quite regularly. In our actual usage we have seen pause times as long as 30-50ms due to the same underlying issue, which implies that this overhead is not part of the consideration as to how often to run gen0 collects.

I have tested against net472, .net core 3.1/3.2/5.0 and this is not a regression.

Author: peppy
Assignees: -
Labels:

area-GC-coreclr, tenet-performance, untriaged

Milestone: -

@jkotas
Copy link
Member

jkotas commented Mar 2, 2021

it has a finalizer implemented as a safety measure, to ensure the memory is returned to the ArrayPool no matter what (I think this is a pretty common practice).

The typical pattern that we use in the core libraries is to just let garbage collector to take care of cleaning up after rare error cases. I do not think we have any place in the core libraries where we have a finalizer just to return memory to the array pool.

@peppy
Copy link
Author

peppy commented Mar 2, 2021

@jkotas That is what I could see in my investigation (the lack of finalizer), but could you explain how the clean-up is done in this case? Will the arraypool eventually assume the array isn't being returned and stop tracking it? Figured this out (memory is only added back on Return).

We have a few other cases where we do use finalizers for non-ArrayPool.Return reasons (although are currently looking to remove them as a result of this investigation). Another example where the finalizer is there for safety is a texture upload which is handling unmanaged (ffmpeg) memory handles, for instance.

Is it safe to say that your advice here is that finalizers should be avoided at all costs, even when SuppressFinalizer is correctly used? Is this referenced somewhere in documentation or usage guidelines that I have somehow missed? If not, it might be a worthwhile addition to warn others that potentially see finalizers as a method of running clean-up logic, expecting there to be a negligible overhead with correct (seldomly-invoked) usage.

@jkotas
Copy link
Member

jkotas commented Mar 2, 2021

The pooled arrays are just regular arrays. Once they become unreachable, they will get collected just like regular array. Of course, you do not want to depend on the pooled arrays getting collected all the time since it would effectively disable pooling. It is fine to depend on GC to collect the pooled array in rare circumstances.

Finalizers make sense on types that are holding unmanaged resources if you would like to make these types robust against incorrect use. For public types, .NET design guidelines recommend using SafeHandles for unmanaged resource lifetime management. SafeHandles take care of finalization if resource is not disposed properly, but also protect against race conditions between use and disposal.

@En3Tho
Copy link
Contributor

En3Tho commented Mar 2, 2021

Having worked with wincrypt and other unmanaged apis, my 2 cents is that you should only use finalizers when it's absolutely necessary.

For example, when array is not returned to pool there is no risk of application blowing up, only risk of a slowdown and that could be detected by some kind of script runner and collecting GC/fps/other info in your building system. But when unmanaged resources are not released it could lead to memory leaks, dangling handles, unexpected failures. Those are much harder to detect and debug afterwards. And on top of that, I understand gamedev imposes very harsh requirements on latency and execution speed.

I would recommend using some kind of ref struct wrappers and few self made analyzers to detect their proper usage.

@smoogipoo
Copy link

smoogipoo commented Mar 2, 2021

Hi, I'm working on the same project referenced in the OP.

One case that we've recently discovered is WeakReference<T>. We're using ImageSharp's memory allocator to allocate buffers in the hopes that we can keep texture-related data using the same pool.
As it turns out, ImageSharp's ArrayPoolMemoryAllocator allocates Buffer<T>s which allocate WeakReference<T>s, which has a finalizer and triggers the same case as in the OP.

This was being done for graphing runtime work diagnostics, and thus we'd like to keep this a real-time/per-frame process.

The biggest problem we're having is that there's very little visibility on this, besides the raw number of objects allocated. We can get close by using ETW events and inspecting the types to see if they have finalizers, but it's by no means an exhaustive solution considering GCAllocationTick only occurs every ~100KB (this is our current path). We can write a compile-time analyzer but that's also pretty non-trivial if we have to inspect other assemblies and .NET internals like this. And finally, we can also look into having an automated test harness inspect GC pause times for any weirdness in a bunch of scenarios, but that's also tricky to do for various reasons though this is a long-term goal for general regression tests.

For example, if PerfView provided a count of finalizable objects at GC time, it would help track these issues down. If GCFinalizersEnd provided the type name (of just the last type name, as per the other ETW events), that may even be better since we can analyse logs received from users and see if this is the cause. That's just a few ideas that would help, if this overhead is otherwise unavoidable.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Mar 2, 2021
@mangod9 mangod9 added this to the Future milestone Mar 2, 2021
@Maoni0
Copy link
Member

Maoni0 commented Mar 3, 2021

hi @peppy, thanks for your report. a comment -

I have read through .NET memory performance analysis documentation, which does mention that finalizers should be avoided, but also that calling GC.SuppressFinalize should recover all but the allocation overhead.

so that's not exactly what I said in the doc 😃 this is what I said in the Finalizers section -

However, if you suppress the finalizer with GC.SuppressFinalize, what you tell the GC is you don't need to run the finalizer of this object. So GC would have no reason to promote it. It will be reclaimed when GC finds it dead.

it just says GC will not need to promote it. it doesn't say GC would not need to scan it, and this text is about scanning -

When a GC happens, it will discover the objects that are still live and promote them. Then it will check with objects on the finalize queue to see if they are promoted - if an object is not promoted it means it's dead, even though it can't be reclaimed (see why in the next paragraph). If you have tons of finalizable objects in the generations being collected, this cost alone could be noticeable.

having said that, I completely agree we should make this more diagnosable. I did talk about the internals of finalization in this blog post where I mentioned the method names related to scanning are mentioned - you could look those up before we provide an easier solution. we can provide this time via events like we already do in the mark phase for other types of scanning (stack, handle table, etc). would that be sufficient for you?

hi @smoogipoo, right now we do log the type of the object that's being finalized in the FinalizeObject event if you have the type keyword enabled on informational level. can you please try that and see if the type for those objects show up for you? of course for your specific case, you will not see these events for WeakReference because for WeakReference we don't actually run their finalizers - GC just take a shortcut and frees the handle right away. what I mentioned above was we could fire an event that tells us how much time we spent in the finalization related scanning. that's another way. we can also include how many objects we scanned.

@peppy
Copy link
Author

peppy commented Mar 3, 2021

@Maoni0 Thanks for your reply; glad you found this issue (and thanks for your guide, it is a great read). Also apparently I am blind since I missed that specific part about scanning. I think the structure of the document may have played part in that – the "Finalizer" section header is a weird bullet point and is also smaller than the sub-headers following it, making them look like a new section.

@Maoni0
Copy link
Member

Maoni0 commented Mar 3, 2021

@peppy I agree - I really struggled with the formatting since markdown offers pretty primitive options. I tried to make the line that says Finalizer part be more prominent by adding some kind of symbol but then it wouldn't let me make it a link sigh :) maybe I'll get one of the folks on my team who's better at this stuff to help me with formatting this better.

@smoogipoo
Copy link

@Maoni0 Thanks for letting me know about the FinalizeObject event, it's already come in handy.

we could fire an event that tells us how much time we spent in the finalization related scanning

This would definitely help, as we're still trying to determine whether all remaining overhead is related to finalizers or not - they're just the biggest clue we have right now. Part of it is that we don't have full coverage due to WeakReference not being output as you mentioned, which is unfortunate given one of the core parts of the project is data bindings via WeakReference. Having time spent in finalization scanning would tell us if we're being led astray by this.

@Maoni0
Copy link
Member

Maoni0 commented Mar 6, 2021

@smoogipoo, @ivdiazsa and @cshung will be working on adding this info to the GC events.

@smoogipoo
Copy link

smoogipoo commented May 13, 2021

@Maoni0 Just coming back to this issue, I've tried quite a few things to get an idea of what's going wrong.

One thing that I found was your blog post which mentioned using PerfView: https://devblogs.microsoft.com/dotnet/you-should-never-see-this-callstack-in-production/
Unfortunately I can't seem to get this to work with our game (osu), I'm running:

.\PerfView64.exe EnableKernelStacks

.\PerfView64.exe /nogui /accepteula /BufferSizeMB:4096 /CircularMB:2000 /CollectMultiple:3 /StopOnGcOverMsec:10 /DelayAfterTriggerSec:0 /KernelEvents:Memory,VirtualAlloc,Default /Process:"osu!" /OnlyProviders:"ClrPrivate:1:5,Clr:0x40000001:5" /NoRundown /NoNGenRundown /NoClrRundown /Merge:true /Zip:true collect

The most I can drill down to is coreclr!JIT_New, but I assume there's some blocking GC operation inside it that doesn't show up:
image

I suspect this may be because we're getting 97% broken stacks, with the tool suggesting to either:

  • NGEN the exe
    • This doesn't exist for CoreCLR does it? I've tried to publish R2R with no improvement.
  • Run with x86
    • Reduces broken stacks to 22% (also running with non-x64 PerfView).
    • It seems like we don't get the same GC spikes of 28+ms Gen0 - all GC pause times are around 4ms which is admittedly still quite high for us but more acceptable.
    • It still doesn't give any GC stacks:
      image
  • Run with Windows 8
    • I haven't tried this yet, is this a valid solution if I'm on Windows 10?

However, PerfView does give GC stats when run as above on just the game framework itself (osu.framework), even with 97% broken stacks and without R2R:
image
Note that the above stacks are useless in this context, because I've induced this by intentionally allocating and doesn't represent the real-world case we're seeing with the game itself.
There's nothing different between the runtime configuration of the game and the game framework, so I don't know why this difference exists.

Last time I tried it, dotnet-trace doesn't give native callstacks so it can't be used to dig into GC internals like this.

Am I missing something with PerfView? Is there an issue that I can report somewhere?

@Maoni0
Copy link
Member

Maoni0 commented May 13, 2021

what are you trying to do exactly? if you want to see the CPU samples while a GC is happening broken stacks shouldn't matter since you just care about the native part of the stacks. if perfview doesn't show the native stacks maybe @brianrob could take a look but he's OOF this week. also if possible, it would help if you could share the trace.

@smoogipoo
Copy link

smoogipoo commented May 13, 2021

what are you trying to do exactly?

The primary question I'm trying to answer is why there's a large spike in Gen1 time in very similar application states. If I can answer this then I can start digging into the GC myself and check for reasons why that path may be getting hotter:
image
image
image

... Or why it's even hot in the first place (see below). I can't repro these results in an isolated context no matter how much Gen0/1 allocations I do, so I don't think it's related to the small 10MB/s being allocated.

Additionally, I've had a suspicion for some time that it's due to our extensive use of WeakReferences, because finalizables are promoted directly into Gen1 and we've previously discovered finalizables as a cause of issues that have since been mostly removed. There's two remaining hot places where WeakReferences are used:

  1. Deep data bindings between objects (I can theorise why this would cause problems with finding GC roots).
  2. Some unoptimised logic that currently uses ImageSharp array pools (which use WeakReference internally) and has relatively high churn rate.

This is all theoretical at the moment because I don't know where to look without the native stacks.

if you want to see the CPU samples while a GC is happening broken stacks shouldn't matter

Right. So then the issue is just that I'm not getting GC native stacks. Here's the traces I used for this post: PerfViewData.zip

@Maoni0
Copy link
Member

Maoni0 commented May 13, 2021

puzzling...I'm looking at your perfviewdata.etl,zip and I can see the callstacks just fine - this is showing the stacks for GC#158, a gen1 GC that took 24.5ms -

image

in general your gen1 GCs are taking longer because they simply survived a lot more. it's interesting that find_first_object is taking a significant amount of time. another puzzling thing with the trace is I don't see a FileVersion event for the coreclr.dll you are using in the osu! process, could you please let me know which version you are using? the trace shows that you are using this version of coreclr.dll in one of the dotnet processes (pid: 7776) -

Event Name                            Time MSecProcess Name Rest  
KernelTraceControl/ImageID/FileVersion  127.470dotnet (7776)ThreadID="-1" ProcessorNumber="4" ImageSize="5,255,168" TimeDateStamp="1,607,740,589" BuildTime="12/11/2020 6:36:29 PM" OrigFileName="CoreCLR.dll" FileDescription="Microsoft .NET Runtime" FileVersion="5,0,220,61120 @Commit: cb5f173" BinFileVersion="5.0.220.61120" VerLanguage="1033" ProductName="Microsoft® .NET" CompanyName="Microsoft Corporation" ProductVersion="5,0,220,61120 @Commit: cb5f173" FileId="" ProgramId=""

of course you could be using a different version in the osu! process. we did do some improvement in 5.0 that would help with the perf of find_first_object but we could do more but the biggest reason is as I pointed out above that GC just needed to do more work.

to make progress on this it would be very useful to step back a bit and help me understand what your perf goals are - do you care mostly about the GCs that are >10ms which really means your gen1 GCs, or do you care about the long gen1 GCs? also another question is would it be possible to run your workload on the current 6.0 build? that would allow us to do experiments a lot more easily if we need to.

@peppy
Copy link
Author

peppy commented May 14, 2021

to make progress on this it would be very useful to step back a bit and help me understand what your perf goals are - do you care mostly about the GCs that are >10ms which really means your gen1 GCs, or do you care about the long gen1 GCs?

Taking a look from a high level:

  • We have internal goals of (as close to) zero allocs during a normal gameplay frame loop. Large allocations generally occur during loading sequences or "break" time (for incremental loading).
  • Throughout the game we have no blocking load screens, so hope to place as little load on the GC as possible to avoid user identifiable "stutter" frames during the complete application lifetime. That is what we are attempting to address in this issue. With users running 60-240hz displays, this can mean a single blocking GC >4-16ms long is noticeable.

So i guess you could say we care about any single GC operations that run longer than 4ms, optimally. Not specifically gen1 but any blocking GC calls.

also another question is would it be possible to run your workload on the current 6.0 build? that would allow us to do experiments a lot more easily if we need to.

Yes, we can run on 6.0 releases without issue.

@smoogipoo
Copy link

smoogipoo commented May 14, 2021

@Maoni0 It looks like the reason I wasn't seeing the GC stacks is an issue on my side - PerfView automatically populates the "GroupPats" textbox which filters them out.

I'm now able to see the GC stacks after clearing that textbox :)

could you please let me know which version you are using

Looks to match the event you have there:
Capture

also another question is would it be possible to run your workload on the current 6.0 build

I've done some more profiling. As requested, I've done this also for net6.0. This time I have two versions:

  • A normal version, taken similarly to the above.
  • A "+mania-pooling" version which reduces the overall gen2 size and potentially some allocations throughout the process.

net5.zip
net5+mania-pooling.zip
net6.zip
net6+mania-pooling.zip

All captures were taken with roughly the same reproduction steps, so they should be comparable. There doesn't seem to be much of a difference between net5 and net6 as far as I'm seeing, but I'll defer to you on that.

I've written a small test program which allocates ~10MB/sec with varying numbers of finalizable objects and posted the results here: https://gist.github.com/smoogipoo/4c44af65bbf6fb4cbaea3fae29bef504
It seems like with 10MB/sec of plain old objects, you're very hard pressed to get below 16ms GC time. As you add finalizable objects, the gen1 time starts to get spikey and experiences large variance after around ~1MB/sec of finalizable objects (n=10), even with very little survival rates.
So it seems like the path forward here for us is to make sure no finalizable objects are churned through and that allocations are generally kept very low, pretty much around 3MB/sec looks to be the maximum that we're able to do to meet our performance goal of 4ms GCs.
Does that sound about right, or am I completely off the mark?

@Maoni0
Copy link
Member

Maoni0 commented May 14, 2021

it looks like you have concurrent GC disabled, is that the case?

@smoogipoo
Copy link

smoogipoo commented May 14, 2021

I thought that concurrent GC was turned on by default (that's what this doc says: https://docs.microsoft.com/en-us/dotnet/core/run-time-config/garbage-collector). Is that not the case anymore?

We don't turn it off, the only setting we change is GCSettings.LatencyMode = GCLatencyMode.SustainedLowLatency;.

I've forced <ConcurrentGarbageCollection>true</ConcurrentGarbageCollection> in our exe csproj, does this look more like what you expect?: with_concurrent.zip

Edit: As for the latency mode, we're currently discussing because it looks like for during important sections LowLatency may be better suited for us, though we experience practically no difference with SustainedLowLatency vs Interactive and is mostly a historical thing.

@Maoni0
Copy link
Member

Maoni0 commented May 14, 2021

it is on by default. but somehow it's turned off for you. the gen0 budget is supposed to be 6mb max for workstation GC but yours is 32MiB. do you have any GC configs set? can you take a look under the debugger if wks::gc_heap::gc_can_use_concurrent is true or false?

@smoogipoo
Copy link

smoogipoo commented May 14, 2021

It looks to be turned on:
image
No GC configs set.

If it makes any difference, I'm running this under a VMWare VM:

OS Name	Microsoft Windows 10 Home
Version	10.0.19041 Build 19041
Other OS Description 	Not Available
OS Manufacturer	Microsoft Corporation
System Name	DESKTOP-4EBGNSC
System Manufacturer	VMware, Inc.
System Model	VMware7,1
System Type	x64-based PC
System SKU	
Processor	AMD Ryzen 9 3950X 16-Core Processor, 3700 Mhz, 8 Core(s), 8 Logical Processor(s)
BIOS Version/Date	VMware, Inc. VMW71.00V.16722896.B64.2008100651, 8/10/2020
SMBIOS Version	2.7
Embedded Controller Version	255.255
BIOS Mode	UEFI
BaseBoard Manufacturer	Intel Corporation
BaseBoard Product	440BX Desktop Reference Platform
BaseBoard Version	None
Platform Role	Desktop
Secure Boot State	Off
PCR7 Configuration	Binding Not Possible
Windows Directory	C:\Windows
System Directory	C:\Windows\system32
Boot Device	\Device\HarddiskVolume1
Locale	United States
Hardware Abstraction Layer	Version = "10.0.19041.906"
User Name	DESKTOP-4EBGNSC\smgi
Time Zone	Tokyo Standard Time
Installed Physical Memory (RAM)	8.00 GB
Total Physical Memory	8.00 GB
Available Physical Memory	4.04 GB
Total Virtual Memory	9.25 GB
Available Virtual Memory	4.79 GB
Page File Space	1.25 GB
Page File	C:\pagefile.sys
Kernel DMA Protection	Off
Virtualization-based security	Not enabled
Device Encryption Support	Reasons for failed automatic device encryption: TPM is not usable, PCR7 binding is not supported, Hardware Security Test Interface failed and device is not Modern Standby, Un-allowed DMA capable bus/device(s) detected, TPM is not usable
A hypervisor has been detected. Features required for Hyper-V will not be displayed.	

Host:

Kernel: x86_64 Linux 5.10.30-1-MANJARO

@Maoni0
Copy link
Member

Maoni0 commented May 14, 2021

my bad - I'm doing this

    gen0_max_size = max (gen0_min_size, gen0_max_size);

I mistakenly thought I was capping this the other way (max to min). usually folks run client apps on machines with small cache sizes. so that means your gen0_min_size is larger than 6mb then. one thing you could do is to limit the gen0 max budget with the COMPlus_GCGen0MaxBudget env var (note that it reads the value as a hex number, so if you specify it as 600000 it would be 6mb. would you mind trying this and capturing another trace? it's not gonna help with gen1 much (if at all) but it'd be useful to see the result of that before digging more into gen1s.

@smoogipoo
Copy link

smoogipoo commented May 17, 2021

@Maoni0: Sorry for the wait, it looks much better with 6MB max Gen0. Here's a log (GDrive link because GitHub seems to not like the 28MB zip): https://drive.google.com/file/d/1CcGcbHCD4a4Rx727InxzFnHbXkmTiBY1/view?usp=sharing

I did some frame time analysis with different budgets:
image
It definitely looks like 6MB is where we'd like to be.

I'm not sure if this is related, but I noticed that up to 32MB Gen0 is allocated from the very start of the application during which only some EFCore population + loading and JIT compilation is taking place (checked via dotMemory/dotTrace):
image
Edit: Slightly incorrect - there seems to also be a few other processes using the TPL and larger reflection logic.

@Maoni0
Copy link
Member

Maoni0 commented May 17, 2021

It definitely looks like 6MB is where we'd like to be.

that's great to hear! I do have a couple of other observations -

  • your trace doesn't contain any gen2 GCs so I can't tell how big the gen2 budget is - I can see that gen2's free list is basically already used up (by the end it's 9% and I'm sure most of the space on the free list would be quite small so not very useful to fit gen1 survivors). this in itself is not necessarily a bad thing - if you have memory GC could choose to use it for a client app we'd like to be not so aggressive about using memory. however, client apps also have their own perf heuristics that differ from server which is they may not need to stay up for that long. I see your trace is about a min long. how long do you think your process usually stays up for? if it's much longer then you may want to capture a GCCollectOnly trace (which is very lightweight and can be kept on all the time) for the whole duration so you don't get surprising behavior later. the trace you shared last time was not merged so I can't look at the CPU samples anyway.

  • it'd be interesting to know what gets promoted to gen2 'cause that becomes your long lived data. you could use the generational aware analysis we introduced in .NET 5 (explained here, you don't need to do the Analyzing part the same - it uses our perf infra but you already have a trace so you could just follow the Capturing part and specify bytes/index based on what you see in GCStats - essentially you just want to specify the promoted bytes in a gen1 GC which is a lot larger than in gen0 GCs as COMPLUS_GCGenAnalysisBytes and an index that's large enough to get into steady state for COMPLUS_GCGenAnalysisIndex]. then you can open it up in PerfView like it says.

@smoogipoo
Copy link

smoogipoo commented Jun 1, 2021

I've attempted to do the gen aware analysis, but I've run into a few problems. I also tested with the blog post's code and here's what I found:

  1. Is this supposed to work on linux? I've tested with net5.0 and net6.0 and the .nettrace files don't get generated at all.
  2. (Testing on Windows now) I've had to change the analysis index from 3E8 to 64 as it wouldn't complete the analysis.
  3. I don't see the "walkable objects" listing in PerfView:
    image
    How do I drill down into the heap snapshot item? Here's the trace from the screenshot: gcgenaware.nettrace.zip

I see your trace is about a min long. how long do you think your process usually stays up for? if it's much longer then you may want to capture a GCCollectOnly trace (which is very lightweight and can be kept on all the time) for the whole duration so you don't get surprising behavior later

This is quite hard to answer because the application has many stages. It's a game so the time that any individual user keeps it open is variable, but generally not longer than 48hrs.
However there's many parts to the game such as menus and overlays, which aren't "gameplay" but in which users could still spend a lot of time, equal to that and maybe exceeding the time spent in gameplay.
Gameplay itself generally lasts anywhere from 30 seconds to 10 minutes, with exceptions. Users can (and generally do) jump in and out of gameplay sessions very often, even so far as to spend as little as 5 seconds in a session at any one time. All of this without closing the application itself.

Stutters in menus and overlays are not desirable but understandable. For example, when loading online data to display in an overlay, it's understandable that the GC would collect/promote more.
For this reason, the traces above were captured only during gameplay which is the single area that we require 0 stutters and where we're trying really hard to remove all allocations in an effort to achieve this.

Regardless, here's a few GCCollectOnly traces I captured:

trace description
trace.nettrace.zip 6 minutes of gameplay only
trace.nettrace.zip A bit longer including multiple gameplay sessions and menus, with the last 6 minutes of gameplay as per the trace above. Gameplay is slightly different though so may not be directly comparable.
trace.nettrace.zip Multiple gameplay sessions and menus as per the above, except with 6MB gen0 limit. Last 6 minutes matches the above.

Not sure what I should be looking at here to move forward.

@Maoni0
Copy link
Member

Maoni0 commented Jun 1, 2021

@cshung could you please help @smoogipoo?

@cshung
Copy link
Member

cshung commented Jun 8, 2021

@smoogipoo I think I figured out what is going on with the generational aware analysis. It looks like the view is guarded against the AppLog.InternalUsers check here.

https://github.com/microsoft/perfview/blob/06942acdd711b5156b671f5f0528d437c748e865/src/PerfView/PerfViewData.cs#L8247-L8277

And the InternalUser flag itself is determined by whether a machine named clrMain exists

https://github.com/microsoft/perfview/blob/06942acdd711b5156b671f5f0528d437c748e865/src/PerfView/App.cs#L1184-L1195

For the short term, we should be able to workaround that limitation. A cleaner approach is to build your own PerfView with that check removed. If building is too much work, we might be able to fake PerfView by adding clrMain to the host file so that PerfView believes that machine exists. Either way would work.

Longer term, @brianrob, do we really need to guard those views under AppLog.InternalUsers, can we just expose them? the next release of PerfView will work as expected.

@brianrob
Copy link
Member

brianrob commented Jun 8, 2021

@cshung, no I don't think that these views need to be under this check. Feel free to post a PR to remove the check.

@smoogipoo
Copy link

smoogipoo commented Jun 9, 2021

First of all, thanks everyone for your continued interest and support in this issue. I have compiled PerfView myself and looked into the gen-aware analysis. This is going to be a long one - I have some promising and instructive results but still not 100% sure what the path forward is.

(note: In all of the following, the Gen1 with large Pause Time is the one that was gen-aware analysed)

Data

Traces: https://drive.google.com/file/d/1mJUfpnmxswdejP5HaYpsPPnpRDITyVQ9/view?usp=sharing
(contains ~500MB of traces analysed below)

Relevant osu-framework branch: https://github.com/smoogipoo/osu-framework/tree/custom-vbo-array-pool
Relevant osu branch: https://github.com/smoogipoo/osu/tree/gc-debug

Initial gen-aware analysis

First I did a gc-gen aware analysis on our master/mainline branches:
Screenshot_2021-06-09_18-28-31
Screenshot_2021-06-09_18-29-05

$env:COMPLUS_GCGenAnalysisIndex="14"
$env:COMPLUS_GCGenAnalysisBytes="7A1200"

(data: mainline)

This makes sense, the top two entires are related to QuadVertexBuffer, which I'd previosly mentioned in this thread is an unoptimal piece of our code that's using ImageSharp's pools. Optimising this has been on our mind for quite a long time, so I thought to start there.

Perhaps instructive: Note the difference in magnitudes of the Inc Ct. column at this point.

Replacing ImageSharp pools with unmanaged memory

Initial replacement of the ImageSharp pools with .NET pools didn't lead to a noteworthy improvement, so I took it a step further and converted this particular code to use unmanaged memory that's completely hidden away from the GC (apart from add/remove memory pressure which my experience tells me is good practice when using Marhsal.AllocHGlobal() like this).
Screenshot_2021-06-09_18-34-01
Screenshot_2021-06-09_18-34-42

$env:COMPLUS_GCGenAnalysisIndex="14"
$env:COMPLUS_GCGenAnalysisBytes="5B8D80"

(data: mainline+unmanaged-vbo)

Nothing much seems to have changed here. We're promoting much less now (less than half of mainline), but still have very high gen1 numbers.

I also tested with not applying memory pressure, which didn't lead to any Gen1 difference other than induced Gen2s not occurring (which seem to be costly here) (data: mainline+unmanaged-vbo-no-memory-pressure).

After reaching this dead end, I decided to look into the root System.Object[] which includes PoolableSkinnableSample.

Disabling PoolableSkinnableSample

It's not quite easy for us to "fix" PoolableSkinnableSample - in-fact, I'm not sure what the fix there is yet. To move things along, I decided to test with it completely disabled.
Screenshot_2021-06-09_18-53-45
Screenshot_2021-06-09_18-54-14

$env:COMPLUS_GCGenAnalysisIndex="14"
$env:COMPLUS_GCGenAnalysisBytes="3D0900"

(data: mainline+no-poolable-samples)

This looks promising. The Gen1 times are getting very close to acceptable levels (keep in mind we're aiming for 4ms). And combining this change with the unmanaged arrays from above:
Screenshot_2021-06-09_19-03-44
Screenshot_2021-06-09_19-04-11

$env:COMPLUS_GCGenAnalysisIndex="14"
$env:COMPLUS_GCGenAnalysisBytes="1E8480"

(data: mainline+unmanaged-vbo-no-memory-pressure+no-poolable-samples)

There's still a few spikes - the above traces show Gen1s taking >8ms later in the snapshots, but this is definitely better.

Conclusion

This is where my analysis stops, anything more from here is extremely challenging and requires significant re-architecting of our code to test.

But the main takeaway from the above is that, if I'm understanding the data correctly, the count of objects promoted (Inc Ct. column) seems to be a better indicator than the amount of data promoted (Inc column).

Am I correct in thinking that Inc Ct. (and Exc. Ct for that matter) indicate the count of objects promoted? This is quite important because it's very weird that almost 1700 PoolableSkinnableSample objects are promoted (see the initial analysis) - there should only be 2 magnitudes less than that in Gen1 at any point in time. I'll have to look into this to see if it's possible from our side.
Also, if these are counts, does the count include objects alive outside of the walkable tree seen there? I.e. if there are already objects in Gen2, would they be included in the counts?

I'll definitely report back on our progress and with further questions and detailings - I'm not entirely sure how to fix even PoolableSkinnableSample right now and the DrawNode[]/Drawable[] structures at the bottom of those traces are not insignificant (count-wise) and probably are the cause of remaining issues, but are even more challenging to test.

@cshung
Copy link
Member

cshung commented Jun 9, 2021

These Action objects are curious:
image

I suspect somewhere in the code we have a delegate (which could look like a lambda) being set to a static field, the delegate itself is an instance method, leading the system also rooting the object hosting the method as its closure. This is a typical reason why memory is leaked.

This is quite important because it's very weird that almost 1700 PoolableSkinnableSample objects are promoted (see the initial analysis) - there should only be 2 magnitudes less than that in Gen1 at any point in time

With your comment, I suspect the whole pool is leaked into gen2 through the above mechanism.

If you could take a crash dump, using !DumpHeap to find some random PoolableSkinnableSample objects and trace its !gcroots, you might be able to figure out what delegate that is.

We could just bet on a random dump, but if we are wanted to increase the odds of finding it, here is the perfect time to capture that dump. At this point, the gen-aware analysis reaches the point it captured the spiking GC. If we have a leaking delegate, the delegate is set right before this GC.

if (gcGenAnalysisState == GcGenAnalysisState::Enabled)
{
#ifndef GEN_ANALYSIS_STRESS
if ((condemnedGeneration == gcGenAnalysisGen) && (promoted_bytes > (uint64_t)gcGenAnalysisBytes) && (gcIndex > (uint64_t)gcGenAnalysisIndex))
#endif
{
EventPipeAdapter::ResumeSession(gcGenAnalysisEventPipeSession);
FireEtwGenAwareBegin((int)gcIndex, GetClrInstanceId());
s_forcedGCInProgress = true;
GCProfileWalkHeap(true);
s_forcedGCInProgress = false;
reportGenerationBounds();
FireEtwGenAwareEnd((int)gcIndex, GetClrInstanceId());
EventPipeAdapter::PauseSession(gcGenAnalysisEventPipeSession);
gcGenAnalysisState = GcGenAnalysisState::Done;
EnableFinalization(true);
}
}
}

@smoogipoo
Copy link

smoogipoo commented Jun 10, 2021

You're right, this is just an event handler local to a parenting class.

These PoolableSkinnableSample objects are absolutely being leaked into gen2, but I can guarantee that we're not leaking them in the traditional sense as we regularly go in with tools like JetBrains dotMemory to check the state of the application/check for anything that's live longer than intended. I've verified this again by taking a snapshot around the same time as the traces above:

image
image
81 total objects, with 6 of them in gen1. The pools also only contain these objects at most, not more. Internal logging that we have set up confirms these counts.

The event handler is this one: https://github.com/ppy/osu/blob/2b4649a3ea2a919f8869a2e66854448b05780d01/osu.Game/Skinning/SkinProvidingContainer.cs#L22

Here's a simplified form of what's going on:

using System;
using System.Collections.Generic;
using System.Linq;

namespace Demo
{
    class Program
    {
        static void Main(string[] args)
        {
            var samplePool = new DrawablePool<PoolableSkinnableSample>(0);
            var skinProvidingContainer = new SkinProvidingContainer();

            // We have a recursive data structure of Drawable objects.
            var hierarchy = new Drawable
            {
                Children =
                {
                    new Drawable
                    {
                        Children =
                        {
                            //... This continues on nesting with different Drawable types for a while, and at some point:
                            skinProvidingContainer,
                            samplePool,
                        }
                    }
                }
            };

            // Some time passes...
            // Note: At this point you can assume that everything above here is guaranteed to be in Gen2.

            // Objects are retrieved from the pool and added to the hierarchy.
            skinProvidingContainer.Children.Add(samplePool.Get());
            skinProvidingContainer.Children.Add(samplePool.Get());

            // Within a few seconds afterwards, they're removed from the hierarchy and returned to the pool.
            // Note: event subscriptions aren't removed here. This is intentional.
            //   We have methods on get/return where we could move the subscriptions/unsubscriptions to, however since there are very little objects in total we think it's better to
            //   forego this and any additional handling required as a result, such as what if the event was called before it was re-bound to, just for simplicity.
            var children = skinProvidingContainer.Children.OfType<PoolableSkinnableSample>().ToArray();
            skinProvidingContainer.Children.Clear();
            foreach (var c in children)
                samplePool.Return(c);

            // Note that there are two ways which the samples could have leaked into Gen2 here:
            //   1. Via the event handler subscription.
            //   2. Via them being returned to the pool.

            // Immediately, or within a few seconds afterwards, they're retrieved from the pool and added back to the hierarchy.
            skinProvidingContainer.Children.Add(samplePool.Get());
            skinProvidingContainer.Children.Add(samplePool.Get());

            // The above process repeats many times...

            // And much, much later on everything clears up and all events unbound/etc.
            // Note: This never occurs in the traces I provided. It's intentional as I'm not really interested this far out (only within the 3 minute windows provided).
            hierarchy.Dispose();
        }
    }

    public class Drawable : IDisposable
    {
        public readonly List<Drawable> Children = new List<Drawable>();

        public void Dispose()
        {
            GC.SuppressFinalize(this);
            Dispose(true);

            foreach (var child in Children)
                child.Dispose();
            Children.Clear();
        }

        protected virtual void Dispose(bool isDisposing)
        {
        }
    }

    public class DrawablePool<T> : Drawable
        where T : Drawable, new()
    {
        private readonly Stack<T> pool = new Stack<T>();

        public DrawablePool(int initialSize)
        {
            for (int i = 0; i < initialSize; i++)
                pool.Push(new T());
        }

        public T Get()
        {
            if (pool.TryPop(out var obj))
                return obj;

            return new T();
        }

        public void Return(T obj) => pool.Push(obj);

        protected override void Dispose(bool isDisposing)
        {
            base.Dispose(isDisposing);

            foreach (var pooledChild in pool)
                pooledChild.Dispose();
        }
    }

    public class SkinProvidingContainer : Drawable
    {
        public event Action SourceChanged;

        // ... Other logic in this class
    }

    public class PoolableSkinnableSample : Drawable
    {
        private SkinProvidingContainer parentSkinProvidingContainer;

        // Assume this is automatically called sometime after ctor() and definitely before Dispose(), via our dependency-injection mechanism.
        private void load(SkinProvidingContainer parentSkinProvidingContainer)
        {
            this.parentSkinProvidingContainer = parentSkinProvidingContainer;
            parentSkinProvidingContainer.SourceChanged += sourceChanged;
        }

        private void sourceChanged()
        {
            // Do something.
        }

        protected override void Dispose(bool isDisposing)
        {
            base.Dispose(isDisposing);

            if (parentSkinProvidingContainer != null)
                parentSkinProvidingContainer.SourceChanged -= sourceChanged;
        }
    }
}

I don't know how to improve this process because there's two leaks to Gen2:

  1. Via the event handler.
    • This could be avoided.
  2. Via them being added to a Gen2 hierarchy.
    • This can't be avoided.

Options I'm seeing right now include:

  1. Reduce the size of these objects.
    • Very hard to do.
    • I'm unsure whether this would resolve the issue - these objects aren't the largest seen in that trace, and removal of the largest ones (the QuadVertexBuffers) didn't prove to be fruitful. If it's the count of objects that's more important, then this won't really achieve that much.
  2. Pool these objects more aggressively ahead of time.
    • This only really works around the problem - can't really assume perfect pooling in all scenarios and there's limits we want to apply anyway to not have the pools be too large on mobile platforms.
    • This is a pretty common scenario we have in general, where such objects can spontaneously get added to a Gen2 hierarchy. It sounds infeasible to have this never be a thing.

Realistically I don't think we can resolve this at our end with local optimisations without considerable effort or resorting to the LowLatency GC mode.

@cshung
Copy link
Member

cshung commented Jun 14, 2021

@Maoni0 spotted that in the mainline+unmanaged-vbo-no-memory-pressure+no-poolable-sample case we have two interesting GCs (id = 24, id = 34) to compare against. The XML are obtained through the Raw Data XML file (for debugging) link on the top of the GCStat page on PerfView, the attribute MarkStack ="0.199(56)" means 0.199 milliseconds is spent on marking 56 bytes on the stack, other values are interpreted similarly.

id = 20, PauseTime = 2.382
<PerHeapHistory MarkStack ="0.199(56)" MarkFQ ="0.001(0)" MarkHandles ="0.005(0)" MarkOldGen ="0.801(588210)">

id = 24 PauseTime = 8.395
<PerHeapHistory MarkStack ="5.351(420)" MarkFQ ="0.001(0)" MarkHandles ="0.005(0)" MarkOldGen ="1.154(1233772)">

id = 34 PauseTime = 201.724
<PerHeapHistory MarkStack ="0.230(208)" MarkFQ ="0.000(0)" MarkHandles ="0.009(1136)" MarkOldGen ="2.476(3892528)">

id = 46 PauseTime = 5.463
<PerHeapHistory MarkStack ="0.190(0)" MarkFQ ="0.000(0)" MarkHandles ="0.008(568)" MarkOldGen ="2.092(2000712)">

GC 34 is an obvious outlier, it is true that we are having about 3 times more objects and 2 times more bytes than the previous GC, but that doesn't explain why we are pausing for 30x amount of time.

The rest look more normal, although we see there can be an occasional spike in MarkStack as well.

If we see cases like id = 34 happen consistently, we might want to drill deeper to see what happens there. Maybe we should consider taking CPU samples to figure out what happened exactly during that 200ms.

@smoogipoo
Copy link

smoogipoo commented Jun 15, 2021

That pause (id = 34) is so long only because it's the one which gen aware analysis triggered on.

@Maoni0
Copy link
Member

Maoni0 commented Jun 15, 2021

@smoogipoo what is your current goal? are you aiming to make the other gen1 GCs with higher than usual pause times shorter? or just all gen1 GCs shorter? right now you have at least 3 gen0 GCs for 1 gen1 GC. you could experiment with making the gen1 budget smaller (we don't expose this as a config right now) but I think you are already building the runtime right? so you could experiment with this. I don't generally advise folks to mess with this but if you are in a very controlled environment you can experiment (essentially the LowLatency mode is setting gen0/gen1 budgets very small but they are too small for you). in init_static_data you could set static_data_table[i][1].max_size to say 2mb.

@smoogipoo
Copy link

smoogipoo commented Jun 17, 2021

what is your current goal?

The goal is for GC pauses to not take longer than 4ms, regardless of collected generation. The lower we can go without completely crippling the GC, the better. I've focused a lot on Gen1 because those collections are the outliers, but the Gen0 times we're seeing are also quite significant.

Ideally we'd like to have a completely time-bounded GC. For example, Unity recently released an "incremental GC" where you can set a time slice for each increment, however I understand that this is a huge feat and (probably?) not in the scope of .NET in the short or long-term future.

So instead, I want to understand why GC pauses are taking so long in the first place, and whether it's something we can improve on from our side:

  • Should we be doing even fewer allocations?
  • Should we be reducing individual object size?
  • Is there a disproportionate effect from a deeply nested object graph as opposed to a more linear list containing the same objects?
    • For example, describing a house as a nested relationship of objects, as opposed to simply just listing all objects without any relations. Is there a significant difference between these two models in terms of GC pause time?

As you've said and we've discovered previously, it does seem like the bulk of our issues come down to gen0/gen1 budgets. The problem is that this isn't something we can set at runtime, and as this is a user-facing game, we can't tell each and every user to adjust their GCGen0MaxBudget. Our only path forward would be to create a custom loader for our application, which feels a bit ugly.

The Java G1 GC has a MaxGCPauseMillis knob for a soft-target pause time. It sounds like the reverse concept to GCGen0MaxBudget but instead gives the GC the responsibility to get to that value however it wants. I assume the GC is a lot more flexible this way - our application goes through many different stages and it's possible (actually, definitely the case) that a pause time of 4ms or a max budget of 6MB is not optimal at all times.
Is this something that may be possible with the .NET GC?

but I think you are already building the runtime right?

I'm not currently building the runtime, but I'll look into it.

@Maoni0
Copy link
Member

Maoni0 commented Jun 17, 2021

Unity recently released an "incremental GC" where you can set a time slice for each increment, however I understand that this is a huge feat and (probably?) not in the scope of .NET in the short or long-term future.

Unity's GC is non generational and non compacting. doing incremental on such a GC isn't that much work. our GC is both generational and compacting which makes doing incremental a much, much harder job (a mark and sweep only GC wouldn't be able to handle the kinds of workload we get on .NET). having said that, it doesn't mean we don't have a goal to do it. you are correct of course in saying that specifying a max budget is not as flexible as specifying a pause limit. our regions work in .NET 6 is to build a foundation for future work like providing a soft target pause time. G1 uses regions.

to understand the cost of a GC, I recommend to read this and this section in mem-doc.

Our only path forward would be to create a custom loader for our application, which feels a bit ugly.

is there a reason why you can't use configs to specify these (right now they are not specifiable with runtimeconfig but we can make it so you can specify them that way. it's trivial amount of work. that's the standard way you add configs to .net core apps)?

@smoogipoo
Copy link

Great to hear progress is being made in this area! Using runtimeconfig sounds like a good solution.

@Maoni0
Copy link
Member

Maoni0 commented Jun 17, 2021

@smoogipoo let me know if you hit problems building coreclr yourself. feel free to send me email if you need help (my email alias at work is maonis).

@smoogipoo
Copy link

Just to confirm one last thing that's been on our minds - if we have 2 million objects in gen2, could this in any way affect the performance of promotions from gen0/gen1 to gen2 (via the leaking seen above), just due to the count of objects already in gen2 alone?

@smoogipoo
Copy link

smoogipoo commented Jun 22, 2021

Here's more data as requested. These measurements are a little different from the above in terms of the state of the game and pressure on the GC, but the same results are exhibited. The first and last 5 seconds of the traces can be mostly discarded as load times.

Base: base.nettrace.zip
Limiting Gen0 to 6MB: gen0_6MB.nettrace.zip
Limiting Gen1 to 2MB: gen1_2MB.nettrace.zip
Limiting Gen0 to 6MB and Gen1 to 2MB: gen0_6MB_+_gen1_2MB.nettrace.zip

All above are with SustainedLowLatency.

Base with LowLatency: lowlatency.nettrace.zip

  • Limiting Gen1 by itself has very little/no effect.
  • Limiting Gen0 has the greatest effect.
  • Limiting Gen0 + Gen1 results in slightly smaller Gen1 pause times, at the expense of slightly longer Gen2 times as expected (but also much less frequent, so a worthwhile tradeoff). It looks to be fine in this scenario.
    • I say "in this scenario" because we recently tried to use LowLatency (trace provided above) and encountered large Gen2 pauses exceeding 100ms at times. This is just as you said, G0/G1 budgets are too small, and (I assume?) a lot more data is promoting to G2 that would have otherwise have died earlier.

Lastly, I also gave regions a shot, and you probably don't want to hear about it yet since it's still WIP, but I'll document my findings anyway:

  • I encountered a few segfaults/unexpected nullrefs. Let me know if this is something I should be reporting.
  • In the few tests I managed to run, I was still getting large G1 pauses, but I don't have any traces to show. Again, let me know if this is actually of interest at this point and I'll see if I can get a traceable repro.

@cshung
Copy link
Member

cshung commented Jun 23, 2021

  • I encountered a few segfaults/unexpected nullrefs. Let me know if this is something I should be reporting.

This is definitely interesting, I would love to be able to reproduce these failures and get them fixed.

@cshung
Copy link
Member

cshung commented Jun 24, 2021

  1. Is this supposed to work on Linux? I've tested with net5.0 and net6.0 and the .nettrace files don't get generated at all.

I finally figure out a potential reason why it doesn't work for you. On Linux, the environment variables are case-sensitive, the right casing is COMPlus instead of COMPLUS. On Windows, it won't matter.

Although my blog post is Windows-centric, I updated the environment variable part so that if someone follows along and wanted to experiment on Linux, they will copy the right casing there.

@smoogipoo
Copy link

smoogipoo commented Aug 25, 2021

Recently, I've noticed that we're getting different performance characteristics on Linux and Windows on identical hardware (AMD Ryzen 3950x).

I've run tests similar to the above in an even more isolated environment to attempt to reproduce the results. The most telling (to me) was the difference in frame times (how many milliseconds a game's render loop takes):

linux windows
Linux frametimes(2) Windows frametimes(2)

As above, the win-6mb and linux-6mb variants are a result of COMPlus_GCGen0MaxBudget=600000. I mentioned in previous testing that this 6MB Gen0 limit is where we'd like to sit, which looks to bring Linux performance in-line with Windows (a good thing).

Here's the traces:
win-master: trace.nettrace.zip
win-6mb: trace.nettrace.zip
linux-master: trace.nettrace.zip
linux-6mb: trace.nettrace.zip

Collected with dotnet trace collect --profile gc-collect. I can re-run with other profiles/more info if needed, as I'm unsure of how to read the PerfView logs to find out the exact Gen0 size to confirm things.

Is this a potential issue with the GC tuning too aggressively on Linux?

This is definitely interesting, I would love to be able to reproduce these failures and get them fixed.

I took another look into this since I've been following the recent GC-related PRs, and can still repro it. Here's a script to reproduce with: https://gist.github.com/smoogipoo/1b87b3b518095199db0993133550abbe

It should crash within 30 seconds - I've had it both segfault and exception at multiple areas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

9 participants