Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock in ConfigurationManager in .NET 6 #61747

Closed
martincostello opened this issue Nov 17, 2021 · 24 comments · Fixed by #62209
Closed

Possible deadlock in ConfigurationManager in .NET 6 #61747

martincostello opened this issue Nov 17, 2021 · 24 comments · Fixed by #62209

Comments

@martincostello
Copy link
Member

martincostello commented Nov 17, 2021

Description

We have a production HTTP application that we updated recently to .NET 6.0.0 from .NET 6.0.0-rc.2 and have observed a number of issues where the application appeared to be become suddenly unresponsive to HTTP requests. This would cause application health checks to fail, and the instances to be taken out of service.

Having dug into this over the last day or so (#60654 (comment)), I think I've tracked this down to a deadlock that occurs in ConfigurationManager if the application's configuration is manually reloaded at runtime.

Overall the issue appears to be that if an options class is bound to configuration via a type such as IOptionsMonitor<T> and there is a change callback bound to IConfigurationRoot.Reload(), then the application will deadlock trying to get configuration values to bind to the options class as the lock around getting an option's value:

will be waiting for the lock acquired during the reload:

I've captured a memory dump from the application after triggering the issue in our staging environment, and a screenshot of the Parallel Stacks window from Visual Studio taken from inspecting the memory dump is below.

deadlock

Thread 852 has called IConfigurationRoot.Reload(), which is blocked on thread 3516 waiting on an options monitor callback for an options class.

Thread 3516 is deadlocked on a call to IConfiguration[string] to create an options class.

IConfiguration and IConfigurationRoot are both the same instance of ConfigurationManager.

I haven't ruled out this being a latent bug in our application that .NET 6 has brought to the surface, but we've only had the issue with .NET 6.0.0. We've reverted the application to .NET 6.0.0-rc2 for the time being, and the problem has gone away.

I figured I would log the issue now in case someone looks at it and can quickly find the root cause while I'm continuing to repro this independently or determine it's an actual bug in our app.

Reproduction Steps

To reproduce this issue, follow the instructions in this repo: https://github.com/martincostello/ConfigurationManagerDeadlock

A conceptual repro is to do the following two actions concurrently in an app using WebApplicationBuilder so ConfigurationManager is the app's IConfigurationRoot:

  1. Reload the IConfigurationRoot from an HTTP request in a loop;
  2. Issue an HTTP request that resolves IOptionsMonitor<T> or IOptionsSnapshot<T> from the service provider which is bound to configuration in a loop.

After a period of time (in testing I found this happened within 10 minutes), the application will deadlock.

Expected behavior

Configuration reloads successfully and does not deadlock requests in flight.

Actual behavior

The application deadlocks the thread reloading the configuration and other threads accessing the configuration to bind options.

Regression?

Compared to using IConfigurationRoot directly with Program/Startup (i.e. a non-Minimal API), yes.

Known Workarounds

I'm not aware of any workarounds at this point, other than not using Minimal APIs when doing configuration reloading at runtime.

Configuration

  • .NET 6.0.0-rtm.21522.10
  • Microsoft Windows 10.0.17763
  • Runtime version 6.0.0-rtm.21522.10+4822e3c3aa77eb82b2fb33c9321f923cf11ddde6
  • ASP.NET Core version 6.0.0+ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Configuration untriaged New issue has not been triaged by the area owner labels Nov 17, 2021
@ghost
Copy link

ghost commented Nov 17, 2021

Tagging subscribers to this area: @safern
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

We have a production HTTP application that we updated recently to .NET 6.0.0 from .NET 6.0.0-rc.2 and have observed a number of issues where the application appeared to be become suddenly unresponsive to HTTP requests. This would cause application health checks to fail, and the instances to be taken out of service.

Having dug into this over the last day or so (#60654 (comment)), I think I've tracked this down to a deadlock that occurs in ConfigurationManager if the application's configuration is manually reloaded at runtime.

Overall the issue appears to be that if an options class is bound to configuration via a type such as IOptionsMonitor<T> and there is a change callback bound to IConfigurationRoot.Reload(), then the application will deadlock trying to get configuration values to bind to the options class as the lock around getting an option's value:

will be waiting for the lock acquired during the reload:

I've captured a memory dump from the application after triggering the issue in our staging environment, and a screenshot of the Parallel Stacks window from Visual Studio taken from inspecting the memory dump is below.

deadlock

Thread 852 has called IConfigurationRoot.Reload(), which is blocked on thread 3516 waiting on an options monitor callback for an options class.

Thread 3516 is deadlocked on a call to IConfiguration[string] to create an options class.

IConfiguration and IConfigurationRoot are both the same instance of ConfigurationManager.

I haven't ruled out this being a latent bug in our application that .NET 6 has brought to the surface, but we've only had the issue with .NET 6.0.0. We've reverted the application to .NET 6.0.0-rc2 for the time being, and the problem has gone away.

I figured I would log the issue now in case someone looks at it and can quickly find the root cause while I'm continuing to repro this independently or determine it's an actual bug in our app.

Reproduction Steps

Will update with a minimal repro project once I can write one up either later today or tomorrow

A conceptual repro is to do the following two actions concurrently in an app using WebApplicationBuilder so ConfigurationManager is the app's IConfigurationRoot:

  1. Reload the IConfigurationRoot from an HTTP request in a loop;
  2. Issue an HTTP request that resolves IOptionsMonitor<T> or IOptionsSnapshot<T> from the service provider which is bound to configuration in a loop.

After a period of time (in testing I found this happened within 10 minutes), the application will deadlock.

Expected behavior

Configuration reloads successfully and does not deadlock requests in flight.

Actual behavior

The application deadlocks the thread reloading the configuration and other threads accessing the configuration to bind options.

Regression?

Compared to using IConfigurationRoot directly with Program/Startup (i.e. a non-Minimal API), yes.

Known Workarounds

I'm not aware of any workarounds at this point, other than not using Minimal APIs when doing configuration reloading at runtime.

Configuration

  • .NET 6.0.0-rtm.21522.10
  • Microsoft Windows 10.0.17763
  • Runtime version 6.0.0-rtm.21522.10+4822e3c3aa77eb82b2fb33c9321f923cf11ddde6
  • ASP.NET Core version 6.0.0+ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52

Other information

No response

Author: martincostello
Assignees: -
Labels:

area-System.Configuration, untriaged

Milestone: -

@ghost
Copy link

ghost commented Nov 17, 2021

Tagging subscribers to this area: @maryamariyan, @safern
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

We have a production HTTP application that we updated recently to .NET 6.0.0 from .NET 6.0.0-rc.2 and have observed a number of issues where the application appeared to be become suddenly unresponsive to HTTP requests. This would cause application health checks to fail, and the instances to be taken out of service.

Having dug into this over the last day or so (#60654 (comment)), I think I've tracked this down to a deadlock that occurs in ConfigurationManager if the application's configuration is manually reloaded at runtime.

Overall the issue appears to be that if an options class is bound to configuration via a type such as IOptionsMonitor<T> and there is a change callback bound to IConfigurationRoot.Reload(), then the application will deadlock trying to get configuration values to bind to the options class as the lock around getting an option's value:

will be waiting for the lock acquired during the reload:

I've captured a memory dump from the application after triggering the issue in our staging environment, and a screenshot of the Parallel Stacks window from Visual Studio taken from inspecting the memory dump is below.

deadlock

Thread 852 has called IConfigurationRoot.Reload(), which is blocked on thread 3516 waiting on an options monitor callback for an options class.

Thread 3516 is deadlocked on a call to IConfiguration[string] to create an options class.

IConfiguration and IConfigurationRoot are both the same instance of ConfigurationManager.

I haven't ruled out this being a latent bug in our application that .NET 6 has brought to the surface, but we've only had the issue with .NET 6.0.0. We've reverted the application to .NET 6.0.0-rc2 for the time being, and the problem has gone away.

I figured I would log the issue now in case someone looks at it and can quickly find the root cause while I'm continuing to repro this independently or determine it's an actual bug in our app.

Reproduction Steps

Will update with a minimal repro project once I can write one up either later today or tomorrow

A conceptual repro is to do the following two actions concurrently in an app using WebApplicationBuilder so ConfigurationManager is the app's IConfigurationRoot:

  1. Reload the IConfigurationRoot from an HTTP request in a loop;
  2. Issue an HTTP request that resolves IOptionsMonitor<T> or IOptionsSnapshot<T> from the service provider which is bound to configuration in a loop.

After a period of time (in testing I found this happened within 10 minutes), the application will deadlock.

Expected behavior

Configuration reloads successfully and does not deadlock requests in flight.

Actual behavior

The application deadlocks the thread reloading the configuration and other threads accessing the configuration to bind options.

Regression?

Compared to using IConfigurationRoot directly with Program/Startup (i.e. a non-Minimal API), yes.

Known Workarounds

I'm not aware of any workarounds at this point, other than not using Minimal APIs when doing configuration reloading at runtime.

Configuration

  • .NET 6.0.0-rtm.21522.10
  • Microsoft Windows 10.0.17763
  • Runtime version 6.0.0-rtm.21522.10+4822e3c3aa77eb82b2fb33c9321f923cf11ddde6
  • ASP.NET Core version 6.0.0+ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52

Other information

No response

Author: martincostello
Assignees: -
Labels:

untriaged, area-Extensions-Configuration

Milestone: -

@vcsjones
Copy link
Member

To reproduce this issue, follow the instructions in this repo: #61747

Did you mean to link to a repository here? This appears to link to the current issue/

@martincostello
Copy link
Member Author

Thanks Kevin - fat-fingered copy-paste.

@slang25
Copy link
Contributor

slang25 commented Nov 18, 2021

I've been having a play with this, the repo is super useful to recreate this.

There are 2 resources involved in this deadlock:

  • ConfigurationManager ._providerLock object field, this object is used to guard around most ConfigurationManager methods.
  • Lazy<MyOptions> stored in OptionsCache. The .Value property uses the default ExecutionAndPublication behavior, where the first thread will create the value, and other threads will wait for that value to become available.

The deadlock

  • /reload route
    ConfigurationManager.Reload()lock (_providerLock)OptionsCacheLazy<MyOptions>.Value + wait
  • /value route
    OptionsCacheLazy<MyOptions>.Value + execute → ConfigurationManager["MyOptions"]lock (_providerLock)

I presume the reload path is fine when the Lazy<MyOptions>.Value is executed on that thread as it will be on the same thread and the lock is reentrant.

Perhaps the Lazy<TOptions> in OptionsCache should use the PublicationOnly behaviour on this line, the advantage being there no threads will be waiting when the value is uninitialized, so no deadlock, but the disadvantage is that we'd potentially be creating multiple instances of TOption while the value is uninitialized, which is wasteful but seems acceptable given it should only happen briefly after startup or config reload.

@martincostello
Copy link
Member Author

Nice investigation, Stu. I wonder through if maybe there's something in the lock implementation that should be reworked in ConfigurationManager though.

The class summary says it should be frozen once Build() is called, but it doesn't seem to do that. Also having all access to the configuration vales guarded by a lock seems like a performance issue to me for an app just using vanilla IConfiguration.

Any thoughts on this @halter73 ?

@martincostello
Copy link
Member Author

From a customer/production perspective, this is an example of the behaviour we saw when the issue first arose.

image

The application was in a steady state serving requests (each coloured line is a different HTTP endpoint), then where the red arrow is when the application's configuration was reloaded in each of the 3 AWS EC2 instances that were in service at the time in response to a change made in our remote configuration store.

The application then very quickly went into a state of deadlock in each instance, with health checks eventually also deadlocking, leading to our load-balancer marking the instances all as unhealthy and taking them out of service. Traffic then flatlines until new instances come into service to take up the load.

@halter73
Copy link
Member

It looks like it's very easy to have application code run with the _providerLock in Reload()/Load() since many IConfigurationProviders will trigger IChangeToken callback inline. We might not have seen this deadlock before because without something like OptionsCache's Lazy<TOptions>, the lock would be reentrant.

I think the best solution might be to make a copy-on-write version of the _providers list that's used by both the indexer and Reload() that we don't lock over. We would then only lock over the real _providers list for configuration source changes and dispose. Once the source modifications are completed, we could update the copy.

@halter73
Copy link
Member

halter73 commented Nov 18, 2021

Thanks for the great repro @martincostello . One possible workaround for now if you control the code calling Reload() would be to change the call to Reload() to the following:

foreach (var provider in ((IConfigurationRoot)Configuration).Providers)
{
    provider.Load();
}

This isn't exactly the same as Reload() since it doesn't trigger the reload token directly. But if the providers trigger their own reload tokens, it should be fairly equivalent.

@martincostello
Copy link
Member Author

Thanks @halter73 - I'll give that workaround a try out tomorrow and see if it provides an alternative.

Today in the meantime we backed-out Minimal Hosting to go back to Program/Startup instead, which was relatively minor to switch back to based on how our code is structured.

@halter73 halter73 self-assigned this Nov 18, 2021
@martincostello
Copy link
Member Author

Yep, the workaround works for our use case. Thanks!

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 30, 2021
@eerhardt eerhardt removed the untriaged New issue has not been triaged by the area owner label Dec 1, 2021
@eerhardt eerhardt added this to the 7.0.0 milestone Dec 1, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 14, 2022
@martincostello
Copy link
Member Author

Will the fix be backported to 6.0?

@halter73
Copy link
Member

I opened a backport PR at #63816. If you're aware of any others who have run into this issue, that might make it easier pass the servicing bar.

@halter73
Copy link
Member

@martincostello Is it possible for you verify the ConfigurationManager changes made in main by either trying SDK version 7.0.100-alpha.1.22068.3 or later. Or by using the package from https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet7/nuget/v3/index.json?

@martincostello
Copy link
Member Author

Sure, I'll try this out tomorrow in our staging environment.

@martincostello
Copy link
Member Author

martincostello commented Jan 19, 2022

Looks like the installer repo still isn't producing the zip packages for the SDK on main for v7 😞

The required version of the .NET Core SDK is not installed. Expected 7.0.100-alpha.1.22068.3.
dotnet-install: Note that the intended use of this script is for Continuous Integration (CI) scenarios, where:
dotnet-install: - The SDK needs to be installed without user interaction and without admin rights.
dotnet-install: - The SDK installation doesn't need to persist across multiple CI runs.
dotnet-install: To set up a development environment or to run apps, use installers rather than this script. Visit https://dotnet.microsoft.com/download to get the installer.

dotnet-install: Downloading primary link https://dotnetcli.azureedge.net/dotnet/Sdk/7.0.100-alpha.1.22068.3/dotnet-sdk-7.0.100-alpha.1.22068.3-win-x64.zip
dotnet-install: The resource at https://dotnetcli.azureedge.net/dotnet/Sdk/7.0.100-alpha.1.22068.3/dotnet-sdk-7.0.100-alpha.1.22068.3-win-x64.zip is not available.
dotnet-install: Downloading legacy link https://dotnetcli.azureedge.net/dotnet/Sdk/7.0.100-alpha.1.22068.3/dotnet-dev-win-x64.7.0.100-alpha.1.22068.3.zip
dotnet-install: The resource at https://dotnetcli.azureedge.net/dotnet/Sdk/7.0.100-alpha.1.22068.3/dotnet-dev-win-x64.7.0.100-alpha.1.22068.3.zip is not available.

I can test this out locally with the installer exe, but our build-deploy process relies on acquiring the SDK using the dotnet-install scripts, so until that's resolved I won't be able to put it into one of our dev/QA/staging envs for a more detailed exercise.

@martincostello
Copy link
Member Author

I ran a variant of the steps in this repro app with the real application that surfaced this issue locally for 15 minutes with no issues using SDK version 7.0.100-alpha.1.22068.17.

If necessary I can do some further validation of the fix in our staging environment under more load when the issues with the v7 ZIP availability in the dotnet/installer repo are resolved.

@halter73
Copy link
Member

Thanks @martincostello. If you could do further validation that would be great. It would help give us more confidence that it fixes the deadlock without causing other regressions.

@martincostello
Copy link
Member Author

No problem. Is the installer team looking into the issue? It's been broken since before the holidays.

There was an issue open about it, but it got closed once the exe started working: dotnet/installer#12850

@halter73
Copy link
Member

Is the installer at https://aka.ms/dotnet/7.0.1xx/daily/dotnet-sdk-win-x64.exe not up to date? I'm seeing build number 7.0.100-alpha.1.22069.1 which looks recent.

@martincostello
Copy link
Member Author

The exe's are fine, but the zips 404 when using the CI scripts.

@halter73
Copy link
Member

Oh! I should have read your previous comments more carefully. If you want to download 7.x previews using the dotnet-install scripts, you need to add -AzureFeed https://dotnetbuilds.azureedge.net/public.

This was fixed by dotnet/install-scripts#233, but it doesn't look like the scripts at dot.net/v1/dotnet-install.sh or dot.net/v1/dotnet-install.ps1 have been updated yet. @bekir-ozturk do you have any idea how long it will take to get the official scripts updated?

@martincostello
Copy link
Member Author

Aha! Cool, I'll add that argument to our bootstrapper script tomorrow morning, then I should be able to get something up into our staging environment for a load/soak test.

@martincostello
Copy link
Member Author

@halter73 I deployed a .NET 7 build of our application with the fix to our staging environment today for 2 hours, and there were no functional or performance issues observed.

Our load test test sends constant synthetic load at the application, plus I additionally fired requests at it to reload the configuration continuously in a loop for an hour. We didn't observe any deadlocks during the period, compared to with .NET 6 where we could reproduce the deadlock in these circumstances within a few minutes.

@ghost ghost locked as resolved and limited conversation to collaborators Feb 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants