Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] GC pause time is too large for 60 fps #65850

Open
gbalykov opened this issue Feb 24, 2022 · 20 comments
Open

[Question] GC pause time is too large for 60 fps #65850

gbalykov opened this issue Feb 24, 2022 · 20 comments

Comments

@gbalykov
Copy link
Member

We are investigating GC pause time in order to potentially achieve 60 fps, but we observe long GC EE suspend pauses.

For example, on next benchmark max GC pause on arm64(rpi4) can get as high as 55 ms with default GC setup (i.e. default latency mode, default max gen0/gen1 size). This is a huge pause, because 1 frame in 60 fps mode should fit in ~16 ms.

Some details:

  • GC pause time is computed as diff between GCRestartEEEnd and GCSuspendEEBegin events
  • SustainedLowLatency mode doesn't improve max GC pause time
  • LowLatency mode reduces max GC pause time, but max RSS increases signigicantly (on benchmark above ~10x increase in max RSS)
  • different values of COMPlus_GCGen0MaxBudget and COMPlus_GCGen1MaxBudget do not improve max GC pause time

Additionally, we've found the next measurements on coreclr/mono and results are pretty much the same: https://gist.github.com/jechter/2730225240163a806fcc15c44c5ac2d6.

Are there any plans to support GC mode with limited pause time?

using System;
using System.Threading;
using System.Collections.Generic;
using System.Diagnostics.Tracing;

namespace gctest2
{
    class Test
    {
        static void Main()
        {
            GCEventListener.Start();
            Test t = new Test();
            t.mGCTest();
        }

        struct Log
        {
            public String log;
        }

        private static char[] base_string =  "GC  TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST COUNT".ToCharArray();

        public void mGCTest()
        {
            int count = 0;
            int max = 102400;
            
            List<Log> list = new List<Log>();

            for (int index1 = 0; index1 < 2000; ++index1)
			{
				for (int index2 = 0; index2 < 1000; ++index2)
				{
					String sd = $"GC  TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST COUNT {count} / {max}";
					Log log = new Log();
					log.log = sd;
					list.Add(log);

					if (list.Count == max)
					{
						while (list.Count > 0)
						{
							list.RemoveAt(0);
						}
					}
					count++;
				}
            }
        }
    }

    public sealed class GCEventListener : EventListener
    {
        private static GCEventListener instance = null;
        private long timeGCStart = 0;
        private bool verbose = false;

        private GCEventListener()
        {
            Console.WriteLine("GCEventListener Created");
        }

        public static void Start(bool verbose = false)
        {
            if(instance == null)
                instance = new GCEventListener();
        }

        // Called whenever an EventSource is created.
        protected override void OnEventSourceCreated(EventSource eventSource)
        {
            // Watch for the .NET runtime EventSource and enable all of its events.
            if (eventSource.Name.Equals("Microsoft-Windows-DotNETRuntime"))
            {
                EnableEvents(eventSource, EventLevel.Informational, (EventKeywords)0x1);
            }
        }

        private void PrintEventData(EventWrittenEventArgs eventData)
        {
            Console.WriteLine($"ThreadID = {eventData.OSThreadId} ID = {eventData.EventId} Name = {eventData.EventName} Tick = {eventData.TimeStamp.Ticks/10.0/1000.0}");

            for (var i = 0; i < eventData.Payload.Count; i++)
            {
                string payloadString = eventData.Payload[i] != null ? eventData.Payload[i].ToString() : string.Empty;
                Console.WriteLine($"\tName = \"{eventData.PayloadNames[i]}\" Value = \"{payloadString}\"");
            }

            Console.WriteLine("\n");
        }

        // Called whenever an event is written.
        protected override void OnEventWritten(EventWrittenEventArgs eventData)
        {
			PrintEventData(eventData);
            if (eventData.EventName.Contains("Concurrent"))
            {
                PrintEventData(eventData);
            }
            else if (eventData.EventName.Contains("GCStart"))
            {
                timeGCStart = eventData.TimeStamp.Ticks;
                //PrintEventData(eventData);
            }
            else if (eventData.EventName.Contains("GCEnd"))
            {
                long timeGCEnd = eventData.TimeStamp.Ticks;
                long gcIndex = long.Parse(eventData.Payload[0].ToString());
                Console.WriteLine("GC#{0} took {1:f3}ms for generation {2}", gcIndex, (double) (timeGCEnd - timeGCStart)/10.0/1000.0, eventData.Payload[1]);
                //PrintEventData(eventData);
            }
        }
	}
}

cc @alpencolt @jkotas

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 24, 2022
@ghost
Copy link

ghost commented Feb 24, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

We are investigating GC pause time in order to potentially achieve 60 fps, but we observe long GC EE suspend pauses.

For example, on next benchmark max GC pause on arm64(rpi4) can get as high as 55 ms with default GC setup (i.e. default latency mode, default max gen0/gen1 size). This is a huge pause, because 1 frame in 60 fps mode should fit in ~16 ms.

Some details:

  • GC pause time is computed as diff between GCRestartEEEnd and GCSuspendEEBegin events
  • SustainedLowLatency mode doesn't improve max GC pause time
  • LowLatency mode reduces max GC pause time, but max RSS increases signigicantly (on benchmark above ~10x increase in max RSS)
  • different values of COMPlus_GCGen0MaxBudget and COMPlus_GCGen1MaxBudget do not improve max GC pause time

Additionally, we've found the next measurements on coreclr/mono and results are pretty much the same: https://gist.github.com/jechter/2730225240163a806fcc15c44c5ac2d6.

Are there any plans to support GC mode with limited pause time?

using System;
using System.Threading;
using System.Collections.Generic;
using System.Diagnostics.Tracing;

namespace gctest2
{
    class Test
    {
        static void Main()
        {
            GCEventListener.Start();
            Test t = new Test();
            t.mGCTest();
        }

        struct Log
        {
            public String log;
        }

        private static char[] base_string =  "GC  TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST COUNT".ToCharArray();

        public void mGCTest()
        {
            int count = 0;
            int max = 102400;
            
            List<Log> list = new List<Log>();

            for (int index1 = 0; index1 < 2000; ++index1)
			{
				for (int index2 = 0; index2 < 1000; ++index2)
				{
					String sd = $"GC  TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST COUNT {count} / {max}";
					Log log = new Log();
					log.log = sd;
					list.Add(log);

					if (list.Count == max)
					{
						while (list.Count > 0)
						{
							list.RemoveAt(0);
						}
					}
					count++;
				}
            }
        }
    }

    public sealed class GCEventListener : EventListener
    {
        private static GCEventListener instance = null;
        private long timeGCStart = 0;
        private bool verbose = false;

        private GCEventListener()
        {
            Console.WriteLine("GCEventListener Created");
        }

        public static void Start(bool verbose = false)
        {
            if(instance == null)
                instance = new GCEventListener();
        }

        // Called whenever an EventSource is created.
        protected override void OnEventSourceCreated(EventSource eventSource)
        {
            // Watch for the .NET runtime EventSource and enable all of its events.
            if (eventSource.Name.Equals("Microsoft-Windows-DotNETRuntime"))
            {
                EnableEvents(eventSource, EventLevel.Informational, (EventKeywords)0x1);
            }
        }

        private void PrintEventData(EventWrittenEventArgs eventData)
        {
            Console.WriteLine($"ThreadID = {eventData.OSThreadId} ID = {eventData.EventId} Name = {eventData.EventName} Tick = {eventData.TimeStamp.Ticks/10.0/1000.0}");

            for (var i = 0; i < eventData.Payload.Count; i++)
            {
                string payloadString = eventData.Payload[i] != null ? eventData.Payload[i].ToString() : string.Empty;
                Console.WriteLine($"\tName = \"{eventData.PayloadNames[i]}\" Value = \"{payloadString}\"");
            }

            Console.WriteLine("\n");
        }

        // Called whenever an event is written.
        protected override void OnEventWritten(EventWrittenEventArgs eventData)
        {
			PrintEventData(eventData);
            if (eventData.EventName.Contains("Concurrent"))
            {
                PrintEventData(eventData);
            }
            else if (eventData.EventName.Contains("GCStart"))
            {
                timeGCStart = eventData.TimeStamp.Ticks;
                //PrintEventData(eventData);
            }
            else if (eventData.EventName.Contains("GCEnd"))
            {
                long timeGCEnd = eventData.TimeStamp.Ticks;
                long gcIndex = long.Parse(eventData.Payload[0].ToString());
                Console.WriteLine("GC#{0} took {1:f3}ms for generation {2}", gcIndex, (double) (timeGCEnd - timeGCStart)/10.0/1000.0, eventData.Payload[1]);
                //PrintEventData(eventData);
            }
        }
	}
}

cc @alpencolt @jkotas

Author: gbalykov
Assignees: -
Labels:

area-GC-coreclr, untriaged

Milestone: -

@jkotas
Copy link
Member

jkotas commented Feb 24, 2022

The current CoreCLR GC does not have guaranteed upper bound on the pause times. The pause times can the significant if the application creates patterns unfriendly to generational GC like long linked list or a lot of Gen2 -> Gen0 references.

Have you analyzed where the time is spent in the GC for this microbenchmark? Looking at the microbenchmark code, I think large number of Gen2 -> Gen0 references may be contributing be the significant pause times.

Are there any plans to support GC mode with limited pause time?

I am not aware of plans to build a GC mode with max pause time that is guaranteed even when the application uses patterns that are unfriendly to GC. It would likely need to be a very different GC from what we have today. I think we would be happy to accept a GC implementation like that into dotnet/runtime if somebody builds it.

@EgorBo
Copy link
Member

EgorBo commented Feb 24, 2022

different values of COMPlus_GCGen0MaxBudget and COMPlus_GCGen1MaxBudget do not improve max GC pause time

weird, I'd expect smaller gen0 to lead to more frequent GCs but smaller pauses overall (disclaimer: not a gc expert)
However, due to the L3 issue I assume the default Gen0 for your Pi4 is already set to some really small value like 256kb (see #64645). Have you tried to play with DOTNET_GCgen0size= instead? (just in case - these config switches accept numbers in HEX format)

@jkotas
Copy link
Member

jkotas commented Feb 24, 2022

weird, I'd expect smaller gen0 to lead to more frequent GCs but smaller pauses overall (disclaimer: not a gc expert)

That's true only for programs that behave according to the simplifying assumptions that the generational GC designs are based on. The example above violates these simplifying assumptions. It ends up with a large Gen2 array that has to be scanned by Gen0 GC every time since it may contain Gen0 back references.

Here is a simpler example that amplifies the problem. It has Gen0 GC pause times around 0.3 seconds on my desktop machine. Changing the Gen0 budged does not have affect on the GC pause times.

object[] a = new object[100_000_000];

for (;;)
{
    // Create a lot of Gen2 -> Gen0 references to keep the GC busy
    object o = new object();
    for (int i = 0; i < a.Length; i++)
    {
        a[i] = o;
    }
    // Create a lot of short lived objects to trigger Gen0 GC
    for (int i = 0; i < 1000; i++)
    {
        GC.KeepAlive(new string('a', 10000));
    }
}

BTW: It is not unusual to see these kind of anti-patterns in "high-performance" code that tries to cache and reuse data structures to "save" the GC work.

@Maoni0
Copy link
Member

Maoni0 commented Feb 25, 2022

before we provide soft real time guarantee (which is a lot of work as you can imagine), I was actually thinking of provide a mode for our GC with no generations just for workloads with no generational behavior at all. so every GC is a full GC. I hacked up a version to do this tonight 😄 with @jkotas's code I get this

comparison

the pause time in single gen is actually very tiny, but there's one BGC that has that max pause which made the avg pause 8.1ms instead of < 1ms -

image

what I have right now is very hacky. but if anyone is interested, I can see if I could get some time in the not too distant future make it reasonable quality to become an experimental feature (ie, can be turned on with a config).

@huoyaoyuan
Copy link
Member

GC pause time should be related to the CPU power. On PC processor, getting 600fps with SubstainedLowLatency is achievable.
Anyway, reducing GC pressure should be beneficial.

@gbalykov
Copy link
Member Author

Thanks for all the answers! Benchmark that I was testing is a sample GC stress test benchmark and I agree that it may not represent optimized behavior in terms of GC, but we wanted to see how GC behaves in this mode too.

Have you tried to play with DOTNET_GCgen0size= instead?

I've only tested different values of gen0_max_size and gen1_max_size (with which static_data_table[i] is filled), and on rpi4 gen0_min_size=min(1Mb, gen0_max_size). Also I've been testing from COMPlus_GCGen0MaxBudget=10000 (64 Kb) to COMPlus_GCGen0MaxBudget=10000000 (256 Mb) with some step, and max GC pause time varies but always stays above ~30 ms.

I'd expect smaller gen0 to lead to more frequent GCs but smaller pauses overall

This is true that average GC pause reduces significantly (to ~1-2ms with COMPlus_GCGen0MaxBudget=10000) and frequency of GCs increases significantly (more than x100), but max GC pause time still remains above ~30 ms.

what I have right now is very hacky. but if anyone is interested, I can see if I could get some time in the not too distant future make it reasonable quality to become an experimental feature (ie, can be turned on with a config).

@Maoni0 This looks very interesting, can you share more details on how to make this hack myself or maybe even patch, so that I can experiment with it?

@Maoni0
Copy link
Member

Maoni0 commented Feb 25, 2022

@gbalykov I need to do a bit more work on it - if I can get some time to work on this next week I'll give you a commit to try.

@Maoni0
Copy link
Member

Maoni0 commented Feb 25, 2022

@gbalykov also for the occasional long pause you are seeing right now, I would suggest 2 things -

does it show that the Promote MB is also high for those GCS? if so could you please try the generation aware analysis that we introduced in .NET 5? it would help us figure out what's causing the more promoted bytes.

if the Promoted is the same yet the GC time is longer, you could also capture a trace with CPU samples with perfview if you are on windows. I have a commandline in mem-doc (search for StopOnGCOverMSec).

@gbalykov
Copy link
Member Author

On the benchmark that I've shared above TotalPromotedSize0, TotalPromotedSize1 and TotalPromotedSize2 can get quite large, however, TotalPromotedSize0 is large almost for all GCs, even with small pauses. As far as I can tell large pauses mostly come from large TotalPromotedSize1 and TotalPromotedSize2. I think that @jkotas is right that this benchmark is GC unfriendly. I'll try to check generation aware analysis, thanks.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Feb 28, 2022
@mangod9 mangod9 added this to the Future milestone Feb 28, 2022
@smoogipoo
Copy link

Also going to link #48937 here, where we've experienced Gen1 collection times greater than the 16ms window of 60fps, which is a bare minimum for us.

In my testing there, we didn't see what I would consider a "large amount" of promotion: https://user-images.githubusercontent.com/1329837/118099109-c0f93a00-b40f-11eb-80ae-db3ccf4a39a0.png as what's shown in this issue. I've forgotten how many times I've read through @Maoni0 's GC doc all to come to the conclusion that we're not doing anything out of the ordinary.

I was actually thinking of provide a mode for our GC with no generations just for workloads with no generational behavior at all

I would also be very interested and am willing to test this even off the main branch.

@aaron-manning
Copy link

aaron-manning commented Sep 18, 2022

+1

The lack of a low-latency GC option makes C#/F# sub-optimal for low-latency projects, eg high-frequency-trading.
Seems like a big hole in an otherwise great runtime.

@gbalykov
Copy link
Member Author

@Maoni0 did you have a chance to work on GC with no generations?

@Maoni0
Copy link
Member

Maoni0 commented Nov 16, 2022

I did, but only a little bit :( I got it to kind of run but clearly there were bugs. I'm hoping to get more time to work on this now that we finished .net 7. this seems like a good project for the holidays...

@roozbehid
Copy link
Contributor

+1

I also need a consistent 20ms processing and was doing a research on if it is achievable with c# or not.
It seems it is not yet. Hopefully. Net 8!

@gbalykov
Copy link
Member Author

@Maoni0 hi, did you have a chance to continue your work on GC with no generations?

@AlgorithmsAreCool
Copy link
Contributor

Adding another voice, I am currently developing a latency sensitive application and I'm also interested in the above discussion of a generationless GC as a potential option.

@Yen
Copy link

Yen commented Oct 8, 2023

Not experienced with the inner workings of GCs and can't talk to the complexities to such an approach, but I would like a way to have more control over the GC without requiring hard guarantees. I am building a UI platform (not a game engine where a dropped frame is the end of the world) and would like to have ways to potentially help avoid large GC pauses if I can instead perform smaller ones more often. For example, a call to GC.Collect with a target max duration would seem like an idea API for these scenarios where I can calculate through runtime heuristics how long I have free before a next frame might want to be displayed, and as such I can tell the garbage collector it has some time to do anything it might want to. I understand there is a lot going on in a full GC pause that can't be easily broken up, and as such for something like a game engine solving this is hard, but having the option get some smaller bits done when I know there is time to do so, in hopes that it would make a full GC pause slightly less regular, would be a nice to have for any user interaction dependent workload.

I am aware that this is a difficult problem to solve and, in many ways, goes against the point of the GC to try make it do things it doesn't want to do "optimally" from the GCs perspective. However, I think providing more primitives, even if it's easy to misuse them, that allow a developer in a high-performance scenario to dictate information to the garbage collector in hopes of avoiding a large GC pause would be appreciated.

@denis-paranichev
Copy link
Contributor

Hi, @Maoni0, did you have a chance to continue your work on generationless GC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests