Skip to content

Reduce contention in UnwindInfoTable seen during high volume LCG method creation#128619

Open
eduardo-vp wants to merge 7 commits into
dotnet:mainfrom
eduardo-vp:xunit-reg-windows
Open

Reduce contention in UnwindInfoTable seen during high volume LCG method creation#128619
eduardo-vp wants to merge 7 commits into
dotnet:mainfrom
eduardo-vp:xunit-reg-windows

Conversation

@eduardo-vp
Copy link
Copy Markdown
Member

@eduardo-vp eduardo-vp commented May 27, 2026

Contributes to #123124.

Changes:

  • Adds an initial code pointer to m_DynamicCodePointers in LCGMethodResolver. In most cases the list will contain exactly one pointer but if it starts empty, it does a chunk allocation while taking a lock. The initial pointer avoids that unnecessary chunk allocation for most cases.
  • Adds a publish and pending lock per unwind info table rather than using global locks.
  • Reduces contention on the publish lock by allowing threads to determine if some other thread is already flushing entries. If it is, it doesn't block waiting for the publish lock - it returns and the flushing thread will handle it by doing another iteration in the flushing loop.
  • Bumps pending table size to 128 and use qsort instead of a quadratic sort.
  • Removes entries using binary search instead of linear search. Also, we should now check both the pending and the published table since this removes the fact that an entry is actually published before AddToUnwindInfoTable finishes (it may stay in the pending buffer and get picked up by another thread).
  • Adds a registration failed flag to early-return in AddToUnwindInfoTable to avoid taking locks unnecessarily.

I used both the original xUnit test with 30k methods mentioned in the issue and a separate benchmark that stresses LCG method creation to test these changes. I mostly used the LCG benchmark to investigate this since .NET 10 runs ~150% slower than .NET 9 so it was easier to spot if changes were helpful or not.

For the measurements, I built from release/9.0, release/10.0 and release/10.0 + the changes in this PR since the goal is to backport to net 10.

LCG Stress Benchmark (300K methods)

LCG stress benchmark code
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq.Expressions;
using System.Threading;
using System.Threading.Tasks;

internal static class Program
{
    private static int Main(string[] args)
    {
        int totalMethods   = args.Length > 0 ? int.Parse(args[0])   : 300_000;
        int testsPerClass  = args.Length > 1 ? int.Parse(args[1])   : 3;
        int threads        = Environment.ProcessorCount;

        Console.WriteLine($"totalMethods={totalMethods} testsPerClass={testsPerClass} threads={threads}");

        var sw = Stopwatch.StartNew();
        Run(totalMethods, threads, testsPerClass);
        sw.Stop();

        Console.WriteLine($"Elapsed: {sw.Elapsed.TotalSeconds:F3} s ");
        return 0;
    }

    private static void Run(int totalMethods, int threads, int testsPerClass)
    {
        Func<int, Func<int, int>> compileBody = k => CompileTestBody(k);
        int classCount = Math.Max(1, totalMethods / testsPerClass);
        var queue = new ConcurrentQueue<int>();
        for (int i = 0; i < classCount; i++) queue.Enqueue(i);

        var workers = new Task[threads];
        for (int w = 0; w < threads; w++)
        {
            workers[w] = Task.Factory.StartNew(() =>
            {
                int seed = 0;
                var classScope = new List<Func<int, int>>(capacity: testsPerClass);

                while (queue.TryDequeue(out int classIndex))
                {
                    classScope.Clear();

                    for (int i = 0; i < testsPerClass; i++)
                    {
                        Func<int, int> formatter = CompileFormatter(seed++);
                        int arg = formatter(classIndex + i);

                        Func<int, int> testBody = compileBody(seed++);
                        testBody(arg);
                        // testBody.DynamicInvoke(arg);

                        classScope.Add(testBody);
                    }

                }
            }, TaskCreationOptions.LongRunning);
        }

        Task.WaitAll(workers);
    }

    private static Func<int, int> CompileFormatter(int k)
    {
        var x = Expression.Parameter(typeof(int), "x");
        Expression body = Expression.ExclusiveOr(x, Expression.Constant(k));
        return Expression.Lambda<Func<int, int>>(body, x).Compile();
    }

    private static Func<int, int> CompileTestBody(int k)
    {
        var x = Expression.Parameter(typeof(int), "x");
        Expression body = Expression.Add(
            Expression.Multiply(x, Expression.Constant(k | 1)),
            Expression.Constant(k));
        body = Expression.Condition(
            Expression.GreaterThan(x, Expression.Constant(0)),
            body,
            Expression.Negate(body));
        return Expression.Lambda<Func<int, int>>(body, x).Compile();
    }
}
Workload .NET 9 .NET 10 .NET 10 + this PR
Time (s) 5.178 13.116 5.677
vs .NET 9 - +153 % +10 %
vs .NET 10 - - -57 %

Original regression was ~150%. Still slower than .NET 9 but at the same time I'm dubious if it's a realistic scenario.

Original xUnit Benchmark (30K methods)

Workload .NET 9 .NET 10 .NET 10 + this PR
Time (s) 2.111 2.802 2.115
vs .NET 9 - +33 % +0 %
vs .NET 10 - - -25 %

Original regression was ~33%. The original benchmark now runs essentially as fast as .NET 9.

@eduardo-vp eduardo-vp self-assigned this May 27, 2026
Copilot AI review requested due to automatic review settings May 27, 2026 03:36
@eduardo-vp eduardo-vp added tenet-performance Performance related issue area-VM-coreclr labels May 27, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @agocke
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets performance/scalability under high-volume dynamic method (LCG) creation by reducing lock contention and avoiding avoidable allocations in CoreCLR’s dynamic method and unwind-info publishing paths.

Changes:

  • Pre-seeds LCGMethodResolver::m_DynamicCodePointers with an inline “first” node to avoid chunk allocation in the common single-pointer case.
  • Reworks UnwindInfoTable synchronization to use per-table locks plus a per-table “flush gate” to reduce contention when flushing pending entries.
  • Increases pending buffer capacity and switches pending sorting and removal operations to more efficient algorithms.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/coreclr/vm/dynamicmethod.h Adds an inline initial DynamicCodePointer node and initializes m_DynamicCodePointers to it.
src/coreclr/vm/dynamicmethod.cpp Resets/allocates code pointer records using the inline initial node to avoid allocations under hot locks.
src/coreclr/vm/codeman.h Expands pending buffer and introduces per-table locks, flush gate, and registration-failed state.
src/coreclr/vm/codeman.cpp Implements per-table locking/flush gating, pending sort changes, binary-search removal, and lock-free table installation via CAS.
Comments suppressed due to low confidence (1)

src/coreclr/vm/codeman.cpp:500

  • RemoveFromUnwindInfoTable can incorrectly return after matching a soft-deleted (UnwindData==0) entry in the published table. This is problematic now that we also search the pending buffer: if code addresses are reused, a deleted published entry may still cover the new method’s Begin/End range, causing this path to return early and skip removing the real pending entry. Consider requiring UnwindData != 0 for the published-table match (or otherwise treating deleted matches as not-found so the pending buffer check can run).
        if (lo > 0)
        {
            ULONG i = lo - 1;
            if (relativeEntryPoint < RUNTIME_FUNCTION__EndAddress(&unwindInfo->pTable[i], unwindInfo->iRangeStart))
            {
                if (unwindInfo->pTable[i].UnwindData != 0)
                    unwindInfo->cDeletedEntries++;
                unwindInfo->pTable[i].UnwindData = 0;        // Mark the entry for deletion
                STRESS_LOG1(LF_JIT, LL_INFO100, "RemoveFromUnwindInfoTable Removed entry 0x%x\n", i);


// This thread attempts to become the sole flusher for this table by taking
// the flush gate. If it wins, publish the pending entries, then release the gate,
// re-check if more entries arrived and loop if so.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the other threads need to wait for their entries to be flushed?

I do not see any code that does that.

Copy link
Copy Markdown
Member Author

@eduardo-vp eduardo-vp May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After AddToUnwindInfoTable finishes, they may need to wait a minimum amount of time. A thread that finds m_flushInProgress set to 1 (there's a flushing thread) just returns and waits for the flushing thread to just loop again and pick up its entries. Some threads essentially defer the work to another thread in this version as opposed to every thread taking the lock and publishing (which generated a lot of contention).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thread that finds m_flushInProgress set to 1 (there's a flushing thread) just returns and waits for the flushing thread to just loop again

What's the line of code that makes it wait?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it doesn't actually wait/block - I meant there's some time until it gets published. The thread just continues.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that the tracing stacktraces on the thread that continues are going to be broken until the flushing thread catches up. This change is regressing tracing reliability.

@korchak-aleksandr
Copy link
Copy Markdown

@eduardo-vp you still need minimal repro from us on Linux? based on your comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-VM-coreclr tenet-performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants