Fix KeyRingProvider thread pool starvation on cold start#66683
Merged
DeagleGross merged 6 commits intoMay 19, 2026
Merged
Conversation
Member
Author
|
As a local proof I tried this repro which starved threads on code from // Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
using System.Diagnostics;
using Microsoft.AspNetCore.DataProtection;
using Microsoft.Extensions.DependencyInjection;
namespace NonDISample;
public class Program
{
public static async Task Main(string[] args)
{
// Build the DI container manually so we can avoid the two main-thread warming calls:
// 1. AddDataProtection's IDataProtectionProvider factory calls
// CreateProtector(ApplicationDiscriminator) - skipped here because we don't
// set SetApplicationName / ApplicationDiscriminator.
// 2. CreateProtector("repro") on the main thread - we move that into Task.Run.
var services = new ServiceCollection();
services
.AddDataProtection()
.PersistKeysToFileSystem(new DirectoryInfo(Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString("N"))));
var serviceProvider = services.BuildServiceProvider();
var provider = serviceProvider.GetRequiredService<IDataProtectionProvider>();
// DO NOT call provider.CreateProtector here. That would warm the cache.
const int threadCount = 16;
var minOk = ThreadPool.SetMinThreads(threadCount, threadCount);
var maxOk = ThreadPool.SetMaxThreads(threadCount, threadCount);
ThreadPool.GetMaxThreads(out var actualMax, out _);
Console.WriteLine($"ProcessorCount = {Environment.ProcessorCount}");
Console.WriteLine($"SetMinThreads({threadCount}) = {minOk}, SetMaxThreads({threadCount}) = {maxOk}, actualMax = {actualMax}");
if (actualMax > threadCount)
{
Console.WriteLine();
Console.WriteLine("WARNING: pool max larger than caller count - bug cannot trigger.");
Console.WriteLine("Re-run with: $env:DOTNET_PROCESSOR_COUNT = \"4\"");
return;
}
Console.WriteLine();
var barrier = new Barrier(threadCount);
var stopwatch = Stopwatch.StartNew();
var tasks = Enumerable.Range(0, threadCount)
.Select(i => Task.Run(() =>
{
barrier.SignalAndWait();
// First touch of the key ring happens here, on 16 pool threads concurrently.
// CreateProtector internally calls _keyRingProvider.GetCurrentKeyRing() - that
// is the cold-start race we want to trigger.
var protector = provider.CreateProtector("repro");
return protector.Protect($"hello {i}");
}))
.ToArray();
var allDone = Task.WhenAll(tasks);
var winner = await Task.WhenAny(allDone, Task.Delay(TimeSpan.FromSeconds(180)));
stopwatch.Stop();
var finishedCount = tasks.Count(t => t.IsCompleted);
Console.WriteLine($"{finishedCount}/{threadCount} callers finished in {stopwatch.ElapsedMilliseconds} ms.");
if (!ReferenceEquals(winner, allDone))
{
Console.WriteLine("STARVED: timed out before all callers finished (#66380).");
}
else if (stopwatch.ElapsedMilliseconds > 1000)
{
Console.WriteLine("DEGRADED: hill climber rescued the deadlock (#66380).");
}
else
{
Console.WriteLine("OK: no thread-pool starvation observed.");
}
}
} |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses Data Protection cold-start thread pool starvation by performing the initial key ring load inline instead of dispatching it to the thread pool when no cached key ring exists.
Changes:
- Splits cold-start behavior from stale-cache refresh behavior in
KeyRingProvider. - Keeps async refresh for stale cached rings while loading the first ring synchronously.
- Adds a regression test verifying cold-start refresh runs on the calling thread.
Show a summary per file
| File | Description |
|---|---|
src/DataProtection/DataProtection/src/KeyManagement/KeyRingProvider.cs |
Updates key ring refresh logic to avoid queuing cold-start work to the thread pool. |
src/DataProtection/DataProtection/test/Microsoft.AspNetCore.DataProtection.Tests/KeyManagement/KeyRingProviderTests.cs |
Adds regression coverage for the cold-start inline refresh invariant. |
Copilot's findings
- Files reviewed: 2/2 changed files
- Comments generated: 2
This was referenced May 14, 2026
Open
3 tasks
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
3 tasks
halter73
approved these changes
May 19, 2026
Member
halter73
left a comment
There was a problem hiding this comment.
Thanks! I agree this might be worth backporting.
…ovider-threadpool-starvation
This was referenced May 19, 2026
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
KeyRingProvider.GetCurrentKeyRingCoreNewhandles two states with one mechanism:TaskScheduler.Default; every caller takes the early-return and immediately gets the stale ring. Nobody blocks.existingTask.Wait()— pinning a thread-pool thread on a task that needs a free thread-pool thread to run. With a constrained pool (e.g.ThreadPool.SetMaxThreads(16, …)and 16 concurrentProtectcalls — exactly the issue's repro), every worker is parked waiting for a worker that doesn't exist. The runtime's hill climber eventually injects extra threads (~118 s in the report) and the app recovers, but during the freeze nothing makes progress.Fix is to split up cold-start (no stale
cacheableKeyRingyet) and do the synchronous load on the first thread acquiring lock. Others will be waiting on lock as in the old implementation.Related #54675
Fixes #66380