Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory diagnoser fix for Tiered Compilation #1543

Closed
wants to merge 3 commits into from

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Oct 1, 2020

In #1542 @ronbrogan has reported a very unusual bug - a code that was CPU bound and not allocating at all was reporting allocations for .NET Core 3.1 (it works fine for 2.1 as there we use GC.GetAllocatedBytesForCurrentThread instead GC.GetTotalAllocatedBytes):

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

namespace BenchmarkAllocationDebug
{
    class Program
    {
        static void Main() => BenchmarkRunner.Run<Bench>();
    }

    [KeepBenchmarkFiles]
    public class Bench
    {
        [Benchmark] public ulong Benchmark1() => TestMethod();
        [Benchmark] public ulong Benchmark2() => TestMethod();
        [Benchmark] public ulong Benchmark3() => TestMethod();
        [Benchmark] public ulong Benchmark4() => TestMethod();
        [Benchmark] public ulong Benchmark5() => TestMethod();

        [MethodImpl(MethodImplOptions.NoInlining)]
        public static ulong TestMethod()
        {
            var r = 1ul;
            for (var i = 0; i < 50_000_000; i++)
            {
                r /= 1;
            }
            return r;
        }
    }
}

After some investigation, I've narrowed down the problem to Tiered JIT thread that from time to time would be promoting methods from Tier 0 to Tier 1 and allocating memory during the iteration where we call GC.GetTotalAllocatedBytes.

obraz

I had few ideas, but the only one worked was putting the thread to sleep for 250ms before we call GC.GetTotalAllocatedBytes. In this time TC thread kicks-in and promotes the methods. It's of course far from perfect as TC might not finish the promotion before we make the first call to GC.GetTotalAllocatedBytes. I don't want to prolong the sleeping period because it would increase the time we need to run the benchmarks.

@kouvel do you have a better idea of how we could prevent TC from working at a given point of time?

I've confirmed that it works as expected by modifying the GetExtraStats method to emit some extra events and filtering the TC events in PerfView to this particular period of time:

private (GcStats, ThreadingStats) GetExtraStats(IterationData data)
{
    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStart(data.IterationMode, data.IterationStage, -2);

    if (RuntimeInformation.IsNetCore)
    {
        // we put the current thread to sleep so Tiered Compiler can kick in, compile it's stuff
        // and NOT allocate anything on the background thread when we are measuring allocations
        System.Threading.Thread.Sleep(TimeSpan.FromMilliseconds(250));
    }

    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStop(data.IterationMode, data.IterationStage, -2);

    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStart(data.IterationMode, data.IterationStage, -1);

    // we enable monitoring after main target run, for this single iteration which is executed at the end
    // so even if we enable AppDomain monitoring in separate process
    // it does not matter, because we have already obtained the results!
    EnableMonitoring();

    IterationSetupAction(); // we run iteration setup first, so even if it allocates, it is not included in the results

    var initialThreadingStats = ThreadingStats.ReadInitial(); // this method might allocate
    var initialGcStats = GcStats.ReadInitial();

    WorkloadAction(data.InvokeCount / data.UnrollFactor);

    var finalGcStats = GcStats.ReadFinal();
    var finalThreadingStats = ThreadingStats.ReadFinal();

    IterationCleanupAction(); // we run iteration cleanup after collecting GC stats

    GcStats gcStats = (finalGcStats - initialGcStats).WithTotalOperations(data.InvokeCount * OperationsPerInvoke);
    ThreadingStats threadingStats = (finalThreadingStats - initialThreadingStats).WithTotalOperations(data.InvokeCount * OperationsPerInvoke);

    if (EngineEventSource.Log.IsEnabled())
        EngineEventSource.Log.IterationStop(data.IterationMode, data.IterationStage, -1);

    return (gcStats, threadingStats);
}

obraz

obraz

Fixes #1542

…iteration, the Tiered JIT might kick-in and allocate some memory and affect the results

as a workaround, we can put the thread to sleep for more than 200ms to TC thread kicks in before we start memory measurements

it's far from perfect but it works

fixes #1542
@kouvel
Copy link
Member

kouvel commented Oct 1, 2020

Hmm you had mentioned on the other issue that 5.0 is free of this issue, is this really specific to 3.1? Timings have changed between 3.1 and 5.0 such that the rejits would happen sooner in 5.0 but it seems unlikely that that would affect microbenchmarks (the change is easily visible in real-world cases with many thousands of methods but there would only be a few methods here). For allocation measurements especially with long-running methods and with a strong assertion about allocation (that there shouldn't be any) generally my suggestion is to disable tiered compilation. Tier 0 jitted code may allocate where optimized code would not, and the former may allocate much more than the latter. An alternative to the delay may be to set COMPlus_TC_AggressiveTiering=1, which should tier-up more quickly, but it would still not guarantee that things would get tiered-up in time, if the involved methods are very long-running some methods may not reach tier 1.

Aside from that, I wonder what is actually causing allocation during jitting. There are some small allocation-dependent things I'm aware of regarding virtual methods that can be fixed, but otherwise on a rejit since all static construction, etc. should have already been done I'm not sure what would be allocating. Do you have an idea?

@kouvel
Copy link
Member

kouvel commented Oct 1, 2020

Also I don't think there should be any guarantee that there would not be any allocation happening in the background. As we move more stuff to managed code like the thread pool, things may happen in the background unrelated to the app's code that may cause allocation and that should be ok.

@adamsitnik
Copy link
Member Author

Hmm you had mentioned on the other issue that 5.0 is free of this issue, is this really specific to 3.1?

Yes, it looks like it's specific to 3.1 (I did not test 2.2 and 3.0 as they are not supported anymore)

BenchmarkDotNet=v0.12.1.20201002-develop, OS=Windows 10.0.18363.1082 (1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.100-rc.1.20452.10
  [Host]     : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-WTPPRD : .NET 5.0.0 (5.0.20.45114), X64 RyuJIT
  Job-MUUDGX : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-TXYPKS : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  Job-IEOLEB : .NET Framework 4.8 (4.8.4220.0), X64 RyuJIT


|     Method |              Runtime | Allocated |
|----------- |--------------------- |----------:|
| Benchmark1 |             .NET 5.0 |         - |
| Benchmark1 |        .NET Core 2.1 |         - |
| Benchmark1 |        .NET Core 3.1 |       9 B |
| Benchmark1 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark2 |             .NET 5.0 |         - |
| Benchmark2 |        .NET Core 2.1 |         - |
| Benchmark2 |        .NET Core 3.1 |      20 B |
| Benchmark2 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark3 |             .NET 5.0 |         - |
| Benchmark3 |        .NET Core 2.1 |         - |
| Benchmark3 |        .NET Core 3.1 |      21 B |
| Benchmark3 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark4 |             .NET 5.0 |         - |
| Benchmark4 |        .NET Core 2.1 |         - |
| Benchmark4 |        .NET Core 3.1 |     178 B |
| Benchmark4 |   .NET Framework 4.8 |         - |
|            |                      |           |
| Benchmark5 |             .NET 5.0 |         - |
| Benchmark5 |        .NET Core 2.1 |         - |
| Benchmark5 |        .NET Core 3.1 |     101 B |
| Benchmark5 |   .NET Framework 4.8 |         - |

@kouvel
Copy link
Member

kouvel commented Oct 2, 2020

Ok I think it is the virtual slot backpatching storage. It's probably the rejit timing that is causing the allocation to happen earlier and not show up in the benchmark in 5.0.

@adamsitnik
Copy link
Member Author

An alternative to the delay may be to set COMPlus_TC_AggressiveTiering=1

I am afraid that this could lead to BDN reporting "too perfect" results that could differ from what end-users with default settings are experiencing

Aside from that, I wonder what is actually causing allocation during jitting

I was also curious and tried to use VS Memory Profiler to find out, but I've failed.

In this particular case the VS Profiler shows me memory allocated for JITTing of the methods that are executed for the first time:

obraz

But it does not show me anything attributed to the TP Thread and TieredCompilationManager.OptimizeMethods

@adamsitnik
Copy link
Member Author

Also I don't think there should be any guarantee that there would not be any allocation happening in the background.

I totally agree. But I also expect the users to be quite suprised when they see that BDN reports allocated memory for a code that clearly does not allocate anything in explicit way.

@kouvel
Copy link
Member

kouvel commented Oct 2, 2020

An alternative to the delay may be to set COMPlus_TC_AggressiveTiering=1

I am afraid that this could lead to BDN reporting "too perfect" results that could differ from what end-users with default settings are experiencing

Possibly. With aggressive tiering, it still goes through the normal tiering stages, just more quickly. It would change rejit timings and more methods may be rejitted, though likely the code quality would be similar to default mode (no guarantee though since it can depend on timing and code paths hit in the calls before rejit). The other option is to give the test more warmup time to stabilize.

@@ -183,6 +183,13 @@ public Measurement RunIteration(IterationData data)
// it does not matter, because we have already obtained the results!
EnableMonitoring();

if (RuntimeInformation.IsNetCore)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this problem is relevant only for .NET Core 3.x, could we add an additional check?

{
// we put the current thread to sleep so Tiered Compiler can kick in, compile it's stuff
// and NOT allocate anything on the background thread when we are measuring allocations
System.Threading.Thread.Sleep(TimeSpan.FromMilliseconds(250));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it 500 (just to be extra sure).

@timcassell
Copy link
Collaborator

timcassell commented Nov 30, 2020

What if you just run GetExtraStats twice, only using the results from the second run (or the lesser of both runs)?

@timcassell timcassell mentioned this pull request Dec 1, 2020
@adamsitnik
Copy link
Member Author

What if you just run GetExtraStats twice, only using the results from the second run (or the lesser of both runs)?

@timcassell this would increase the time it takes to get the stats which would be bad for long-running benchmarks

@timcassell
Copy link
Collaborator

@adamsitnik I noticed you showed that the issue is not apparent in .NET 5.0, which I find strange because I posted an issue in runtime (dotnet/runtime#45446) that demonstrates memory measurement issues in both .NET 5.0 and .NET Core 3.1 runtimes. If that same thing affects this issue, this may not be solvable until the fix is in the runtime (it was placed on the 6.0.0 milestone).

@timcassell
Copy link
Collaborator

timcassell commented Dec 2, 2020

Also @adamsitnik I was looking at RuntimeHelpers.PrepareDelegate and RuntimeHelpers.PrepareMethod which I was hoping would help me with my PR. Sadly, it didn't, but it might help here if it really is a JIT issue, you think? https://www.liranchen.com/2010/08/forcing-jit-compilation-during-runtime.html

@adamsitnik
Copy link
Member Author

@timcassell The list of methods that get promoted to Tier 1 in the meantime is not fixed. It's not just the benchmarked code, it's also BDN Engine code and the BCL types that it's using. For example in my investigation I've found the following methods to be "problematic" (the TC marked with yellow):

obraz

There is no way to ensure that all of them are going to get promoted to Tier 1 before we start measuring GC stats.

@timcassell
Copy link
Collaborator

I can confirm that I am seeing the same issue (sometimes shows allocations, sometimes doesn't) in .NET 5.0, so it's not isolated to Core 3.1.

BenchmarkDotNet=v0.12.1.20201202-develop, OS=Windows 7 SP1 (6.1.7601.0)
AMD Phenom(tm) II X6 1055T Processor, 1 CPU, 6 logical and 6 physical cores
Frequency=2746679 Hz, Resolution=364.0760 ns, Timer=TSC
  [Host]     : .NET Framework 4.8 (4.8.4110.0), X64 RyuJIT
  Job-LYPETN : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT

Runtime=.NET Core 5.0  IterationCount=3  LaunchCount=1
RunStrategy=Throughput  WarmupCount=3

|          Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| ZeroAllocations | 71.57 us | 0.187 us | 0.010 us |     - |     - |     - |       2 B |
[MemoryDiagnoser]
public class Memory
{
    [Benchmark]
    public void ZeroAllocations()
    {
        for (int i = 0; i < 100_000; i++) { }
    }
}

@adamsitnik
Copy link
Member Author

@kouvel is dotnet/runtime#45901 going to affect this behavior? will Tiered JIT stop allocating managed memory in the background thread?

@kouvel
Copy link
Member

kouvel commented Jan 29, 2021

@kouvel is dotnet/runtime#45901 going to affect this behavior? will Tiered JIT stop allocating managed memory in the background thread?

No it wouldn't affect the allocation. The allocation is from virtual slot backpatching, which currently uses some GC memory for tracking virtual slots and is somewhat separate from tiering. An eventual goal is to remove the dependency on the GC for that tracking, as it has other issues, but it's lower on the priority list at the moment.

@adamsitnik
Copy link
Member Author

@kouvel thank you for a very detailed answer!

@AndreyAkinshin AndreyAkinshin modified the milestones: v0.13.0, v0.13.x May 19, 2021
@timcassell
Copy link
Collaborator

Looks like this was fixed in .Net 6 preview 6. dotnet/runtime#45446 (comment)

@kouvel
Copy link
Member

kouvel commented Jul 24, 2021

@timcassell the underlying issue is not fixed, it just may not be reproducing anymore by chance. When measuring GC allocations I would for now still suggest turning off tiered compilation to avoid the noise.

@timcassell
Copy link
Collaborator

@kouvel Curious if this was actually fixed in dotnet/runtime#67160?

@kouvel
Copy link
Member

kouvel commented Jul 11, 2022

Yes this should be fixed now in .NET 7.0 by dotnet/runtime#67160

@adamsitnik
Copy link
Member Author

I've used the latest .NET Runtime (7.0.100-rc.1.22364.28) and BDN still reports that some memory is being allocated. Most likely there are still some background allocations. I currently don't have the time to dig in deeper into that.

Method Runtime Allocated
Benchmark1 .NET 5.0 -
Benchmark1 .NET 6.0 18 B
Benchmark1 .NET 7.0 15 B
Benchmark2 .NET 5.0 -
Benchmark2 .NET 6.0 15 B
Benchmark2 .NET 7.0 15 B
Benchmark3 .NET 5.0 -
Benchmark3 .NET 6.0 16 B
Benchmark3 .NET 7.0 15 B
Benchmark4 .NET 5.0 -
Benchmark4 .NET 6.0 16 B
Benchmark4 .NET 7.0 15 B
Benchmark5 .NET 5.0 -
Benchmark5 .NET 6.0 18 B
Benchmark5 .NET 7.0 15 B

@timcassell
Copy link
Collaborator

Superseded by #2562.

@timcassell timcassell closed this Apr 23, 2024
@timcassell timcassell removed this from the v0.14.x milestone Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected allocations reported in CPU-bound code
4 participants