Skip to content

Fix KeyRingProvider thread pool starvation on cold start#66683

Merged
DeagleGross merged 6 commits into
dotnet:mainfrom
DeagleGross:dmkorolev/keyringprovider-threadpool-starvation
May 19, 2026
Merged

Fix KeyRingProvider thread pool starvation on cold start#66683
DeagleGross merged 6 commits into
dotnet:mainfrom
DeagleGross:dmkorolev/keyringprovider-threadpool-starvation

Conversation

@DeagleGross
Copy link
Copy Markdown
Member

Background

KeyRingProvider.GetCurrentKeyRingCoreNew handles two states with one mechanism:

  • State A — stale ring exists. The cached ring expired but a previous value is still in the field. Refresh work is dispatched onto TaskScheduler.Default; every caller takes the early-return and immediately gets the stale ring. Nobody blocks.
  • State B — no ring at all (cold start). Same dispatch path runs, but now there is no stale ring to fall back on, so every caller falls through to existingTask.Wait() — pinning a thread-pool thread on a task that needs a free thread-pool thread to run. With a constrained pool (e.g. ThreadPool.SetMaxThreads(16, …) and 16 concurrent Protect calls — exactly the issue's repro), every worker is parked waiting for a worker that doesn't exist. The runtime's hill climber eventually injects extra threads (~118 s in the report) and the app recovers, but during the freeze nothing makes progress.

Fix is to split up cold-start (no stale cacheableKeyRing yet) and do the synchronous load on the first thread acquiring lock. Others will be waiting on lock as in the old implementation.

Related #54675
Fixes #66380

@DeagleGross DeagleGross self-assigned this May 14, 2026
Copilot AI review requested due to automatic review settings May 14, 2026 18:05
@DeagleGross DeagleGross added the area-dataprotection Includes: DataProtection label May 14, 2026
@DeagleGross
Copy link
Copy Markdown
Member Author

As a local proof I tried this repro which starved threads on code from main, and completed instantly on updated code:

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System.Diagnostics;
using Microsoft.AspNetCore.DataProtection;
using Microsoft.Extensions.DependencyInjection;

namespace NonDISample;

public class Program
{
    public static async Task Main(string[] args)
    {
        // Build the DI container manually so we can avoid the two main-thread warming calls:
        //   1. AddDataProtection's IDataProtectionProvider factory calls
        //      CreateProtector(ApplicationDiscriminator) - skipped here because we don't
        //      set SetApplicationName / ApplicationDiscriminator.
        //   2. CreateProtector("repro") on the main thread - we move that into Task.Run.
        var services = new ServiceCollection();
        services
            .AddDataProtection()
            .PersistKeysToFileSystem(new DirectoryInfo(Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString("N"))));

        var serviceProvider = services.BuildServiceProvider();
        var provider = serviceProvider.GetRequiredService<IDataProtectionProvider>();

        // DO NOT call provider.CreateProtector here. That would warm the cache.

        const int threadCount = 16;
        var minOk = ThreadPool.SetMinThreads(threadCount, threadCount);
        var maxOk = ThreadPool.SetMaxThreads(threadCount, threadCount);
        ThreadPool.GetMaxThreads(out var actualMax, out _);

        Console.WriteLine($"ProcessorCount = {Environment.ProcessorCount}");
        Console.WriteLine($"SetMinThreads({threadCount}) = {minOk}, SetMaxThreads({threadCount}) = {maxOk}, actualMax = {actualMax}");
        if (actualMax > threadCount)
        {
            Console.WriteLine();
            Console.WriteLine("WARNING: pool max larger than caller count - bug cannot trigger.");
            Console.WriteLine("Re-run with: $env:DOTNET_PROCESSOR_COUNT = \"4\"");
            return;
        }
        Console.WriteLine();

        var barrier = new Barrier(threadCount);
        var stopwatch = Stopwatch.StartNew();

        var tasks = Enumerable.Range(0, threadCount)
            .Select(i => Task.Run(() =>
            {
                barrier.SignalAndWait();
                // First touch of the key ring happens here, on 16 pool threads concurrently.
                // CreateProtector internally calls _keyRingProvider.GetCurrentKeyRing() - that
                // is the cold-start race we want to trigger.
                var protector = provider.CreateProtector("repro");
                return protector.Protect($"hello {i}");
            }))
            .ToArray();

        var allDone = Task.WhenAll(tasks);
        var winner = await Task.WhenAny(allDone, Task.Delay(TimeSpan.FromSeconds(180)));
        stopwatch.Stop();

        var finishedCount = tasks.Count(t => t.IsCompleted);
        Console.WriteLine($"{finishedCount}/{threadCount} callers finished in {stopwatch.ElapsedMilliseconds} ms.");

        if (!ReferenceEquals(winner, allDone))
        {
            Console.WriteLine("STARVED: timed out before all callers finished (#66380).");
        }
        else if (stopwatch.ElapsedMilliseconds > 1000)
        {
            Console.WriteLine("DEGRADED: hill climber rescued the deadlock (#66380).");
        }
        else
        {
            Console.WriteLine("OK: no thread-pool starvation observed.");
        }
    }
}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Data Protection cold-start thread pool starvation by performing the initial key ring load inline instead of dispatching it to the thread pool when no cached key ring exists.

Changes:

  • Splits cold-start behavior from stale-cache refresh behavior in KeyRingProvider.
  • Keeps async refresh for stale cached rings while loading the first ring synchronously.
  • Adds a regression test verifying cold-start refresh runs on the calling thread.
Show a summary per file
File Description
src/DataProtection/DataProtection/src/KeyManagement/KeyRingProvider.cs Updates key ring refresh logic to avoid queuing cold-start work to the thread pool.
src/DataProtection/DataProtection/test/Microsoft.AspNetCore.DataProtection.Tests/KeyManagement/KeyRingProviderTests.cs Adds regression coverage for the cold-start inline refresh invariant.

Copilot's findings

  • Files reviewed: 2/2 changed files
  • Comments generated: 2

Comment thread src/DataProtection/DataProtection/src/KeyManagement/KeyRingProvider.cs Outdated
Comment thread src/DataProtection/DataProtection/src/KeyManagement/KeyRingProvider.cs Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@halter73 halter73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I agree this might be worth backporting.

Comment thread src/DataProtection/DataProtection/src/KeyManagement/KeyRingProvider.cs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-dataprotection Includes: DataProtection

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KeyRingProvider causes thread pool starvation on cold start in .NET 10

3 participants