Intermittent 'IIS AspNetCore Module V2 throws failed to load coreclr' #50317

indy-singh · 2023-08-24T17:23:43Z

Is there an existing issue for this?

I have searched the existing issues

Describe the bug

Every now and then we are seeing a dotnet6 app fail to start for no apparent reason.

The first exception we saw is this:-

HTTP Error 500.37 - ASP.NET Core app failed to start within start-up time limit

So we naturally assumed that our app was taking a lot of time to start-up, this is not the case. As we ruled that out by adding a logger to Program.cs and Startup.cs, and waited for the next time it happened. When it did happen, we can see the log is never reached (despite being the first thing the app should hit). This also happens on applications that have an established history of starting up within 10 seconds.

To be clear we are seeing this sporadically across different production servers and different applications, so that rules out it being related to one server or one application.

We are using in-process hosting on IIS.

We've turned on:-

<handlerSetting name="debugLevel" value="EVENTLOG,INFO"/>

And today it happened again and were able to collect more information. But it still isn't clear why this happening.

I've attached the files I've referenced from today's failure, all the apps are identical in configuration and all the servers are managed via DSC so there are no configuration differences.

As far as we can tell "something" is preventing the app from being booted. But there is no evidence of that..

I used this collect the relevant logs:-

Get-WinEvent -FilterHashtable @{ProviderName= "IIS AspNetCore Module V2"; LogName = "Application"; StartTime = (Get-Date -Date '2023-08-24');} | where {$_.message -like "*PID_HERE*"} | Select-Object -Property TimeCreated, Message | FL *

I've appended .txt to make these file uploadable.

dotnet-info.txt
Program.cs.txt
Startup.cs.txt
systeminfo.txt
Web.LIVE.config.txt

If you extract the message from the raw logs:-

IIS107.MAN.txt
IIS120.MAN.txt

107 = successful deploy (good.log)
120 = failed deploy (bad.log)

GOOD.log
BAD.log

You can see BAD.log has extra output at the top of the log (most recent).

We are unsure what the extra logs means other than 2 min threshold was reached, because we know the app isn't not taking more than 2 mins to boot (there are no logs on app start). As far as we can tell something "just" hangs.

If you grep for Setting current directory in both logs you'll see that bad.log just shows a 2 minute hang. But in good.log it continues as normal.

Any ideas?

Expected Behavior

Should be able to deploy without experiencing this error.

Steps To Reproduce

Unable to repropuce at all, only happens in production very intermittently/sporadically

Exceptions (if any)

Message : [2023-08-24T14:23:26.309Z, PID: 35868] [aspnetcorev2.dll] ASPNET_CORE_GLOBAL_MODULE::Terminate
Message : [2023-08-24T14:23:25.291Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping file watcher thread
Message : [2023-08-24T14:23:25.291Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping file watching.
Message : [2023-08-24T14:23:25.291Z, PID: 35868] [aspnetcorev2.dll] Stopping application '/LM/W3SVC/1382/ROOT'
Message : [2023-08-24T14:23:25.290Z, PID: 35868] [aspnetcorev2.dll] ASPNET_CORE_GLOBAL_MODULE::OnGlobalStopListening
Message : [2023-08-24T14:23:25.250Z, PID: 35868] [aspnetcorev2_inprocess.dll] Starting file watcher thread
Message : [2023-08-24T14:23:25.249Z, PID: 35868] [aspnetcorev2_inprocess.dll] Starting app_offline monitoring in application 'c:\wwwroot\vehicledatalogic\blue\'
Message : [2023-08-24T14:23:25.249Z, PID: 35868] [aspnetcorev2_inprocess.dll] Failed HRESULT returned: 0x8007023e at 
          D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\dllmain.cpp:131 
Message : [2023-08-24T14:23:25.249Z, PID: 35868] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at 
          D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 
Message : [2023-08-24T14:23:25.248Z, PID: 35868] [aspnetcorev2_inprocess.dll] Event Log: 'Application '/LM/W3SVC/1382/ROOT' with physical root 
          'c:\wwwroot\vehicledatalogic\blue\' failed to load coreclr. Exception message:
          Managed server didn't initialize after 120000 ms.' 
          End Event Log Message.
Message : [2023-08-24T14:23:25.230Z, PID: 35868] [aspnetcorev2_inprocess.dll] AspNetCore InProcess Recycle Process on Demand
Message : [2023-08-24T14:23:25.229Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping in-process worker thread
Message : [2023-08-24T14:23:25.215Z, PID: 35868] [aspnetcorev2_inprocess.dll] Canceling standard stream pipe reader
Message : [2023-08-24T14:23:25.215Z, PID: 35868] [aspnetcorev2_inprocess.dll] Event Log: 'Failed to gracefully shutdown application 
          'MACHINE/WEBROOT/APPHOST/VEHICLEDATALOGIC.BLUE'.' 
          End Event Log Message.
Message : [2023-08-24T14:23:25.214Z, PID: 35868] [aspnetcorev2_inprocess.dll] Clr thread wait ended: clrThreadExited: 0
Message : [2023-08-24T14:23:25.214Z, PID: 35868] [aspnetcorev2_inprocess.dll] Starting shutdown sequence 0
Message : [2023-08-24T14:23:25.214Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping CLR
Message : [2023-08-24T14:23:25.214Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping file watcher thread
Message : [2023-08-24T14:23:25.213Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping file watching.

.NET Version

6.0.13

Anything else?

No response

The text was updated successfully, but these errors were encountered:

BrennanConroy · 2023-08-24T21:37:20Z

It looks like we start the CLR thread

aspnetcore/src/Servers/IIS/AspNetCoreModuleV2/InProcessRequestHandler/inprocessapplication.cpp

Line 357 in 5bdb8b4

m_clrThread = std::thread(ClrThreadEntryPoint, context);

Which calls "main" (hostfxr_main or hostfxr_run_app)

aspnetcore/src/Servers/IIS/AspNetCoreModuleV2/InProcessRequestHandler/inprocessapplication.cpp

Line 511 in 5bdb8b4

    
           auto const exitCode = context->m_hostFxr.Main(context->m_argc, context->m_argv.get());

That is what launches your apps dll as well as loads all the other dlls needed by your app.
The next log you should see from IIS is "In-process callbacks set" which comes from the managed IIS Server component calling back into native code

aspnetcore/src/Servers/IIS/IIS/src/Core/IISHttpServer.cs

Line 93 in 5bdb8b4

_nativeApplication.RegisterCallbacks(

Which would run after your logs since you have logs before starting the app.

This implies something in hostfxr is taking a long time. You can enable tracing at that layer with COREHOST_TRACE=1
https://github.com/dotnet/runtime/blob/main/docs/design/features/host-tracing.md#trace-routing

indy-singh · 2023-08-25T11:26:49Z

@BrennanConroy ah great - we've made that change, just waiting game now.

Cheers,
Indy

indy-singh · 2023-08-25T12:24:20Z

It just happened again. We had this config deployed:-

<environmentVariables>
	<environmentVariable name="COREHOST_TRACE" value="1"/>
	<environmentVariable name="COREHOST_TRACE_VERBOSITY" value="3"/>
	<environmentVariable name="COREHOST_TRACEFILE" value=".\is-jw-hostfxr.log"/>
</environmentVariables>

The logs are taken from the same host.

Unfortunately, it looks like there is nothing obviously wrong between GOOD vs BAD. We've bumped up COREHOST_TRACE_VERBOSITY to 4.

eventlog_2023-08-25T11-44-07.558Z.BAD.log
eventlog_2023-08-25T11-52-46.693Z.GOOD.log
is-jw-hostfxr.BAD.log
is-jw-hostfxr.GOOD.log

Commands used:-

PS C:\Windows\system32> Get-WinEvent -FilterHashtable @{ProviderName= "IIS AspNetCore Module V2"; LogName = "Application"; StartTime = (Get-Date -Date '2023-08-25');} | where {$_.message -like "*648*"} | Select-Object -Property Message | FL * | clip
PS C:\Windows\system32> Get-WinEvent -FilterHashtable @{ProviderName= "IIS AspNetCore Module V2"; LogName = "Application"; StartTime = (Get-Date -Date '2023-08-25');} | where {$_.message -like "*24192*"} | Select-Object -Property Message | FL * | clip

Cheers,
Indy

indy-singh · 2023-08-28T19:07:15Z

It just happened again, but that is interesting is that COREHOST_TRACEFILE wasn't overwritten as we'd expect during a successful deploy:-

demo102.man.is-jw-hostfxr.log

The failure logs in the event viewer are present as ever:-

logs.txt

Is there something before COREHOST_TRACEFILE that could be hanging?

This particular failure happened on our DEMO env which has close to zero traffic, so we can rule out things like load and pending/long running requests.

I also tried running:-

Get-WinEvent -ListLog * `
| Where-Object {$_.RecordCount -gt 0} `
| Where-Object {$_.LogName -ne "Microsoft-Windows-SystemDataArchiver/Diagnostic"} `
| Where-Object {$_.LogName -ne "Microsoft-IIS-Configuration/Administrative"} `
| Where-Object {$_.LogName -ne "Microsoft-IIS-Configuration/Operational"} `
| Where-Object {$_.LastWriteTime -ge (Get-Date -Date '2023-08-28 19:10')} `
| Get-WinEvent `
| Where-Object {$_.TimeCreated -ge (Get-Date -Date '2023-08-28 19:10')} `
| Where-Object {$_.TimeCreated -lt (Get-Date -Date '2023-08-28 19:13')}

To see if anything else happens during that failure but nothing obvious jumps out.

Is there more logging I can turn on?

Cheers,
Indy

indy-singh · 2023-08-28T20:37:50Z

Ah, the log files (is-jw-hostfxr.log etc) I've been posting so far are from the build release step and not from when the actual failure occurs.

I've just pushed a change that will ensure that we get a separate log file from the one that is build on CI - we warm up the application as apart of the the CI steps

Sorry for the confusion; back to the waiting game!

Cheers,
Indy

indy-singh · 2023-08-29T10:07:45Z

Just happened again.

event_logs-2023-08-29T09-31-08.545Z.log
iis102.man.LIVE-is-jw-hostfxr.log
iis112.man.LIVE-is-jw-hostfxr.log

iis102 = failed deploy
iis112 = successful deploy (so we can compare hostfxr logs)

They are both identical save for the time stamp.

So this points to something after hostfxr - is there more logging I can turn on after that has finished it's job?

Also, do you think it is worth upgrading to the latest dotnet6 release? I had a brief look through the release notes and couldn't see anything related to this issue that might resolve it.

Cheers,
Indy

BrennanConroy · 2023-08-29T17:18:49Z

is there more logging I can turn on after that has finished it's job?

I'm not aware of any "logs". You could start looking into collecting event counters. But before that, there are a couple things you could try, assuming they are viable in your repro environment:

Set startupTimeLimit to a larger value than 120 (max 3600), and see if the app starts
Get a dump file of the application during the startup hang, see previous bullet for making it hang longer so you have time to get a dump
Verify your logger is logging and doesn't have any potentially long running operations (database access etc.)
Add a try catch around all your code in main and make sure no exceptions are being thrown

A similar issue occurred in #48665 and it turned out to be their code occasionally throwing a SQL exception on startup.

indy-singh · 2023-08-30T15:12:45Z

Assuming they are viable in your repro environment

So this is made difficult by the fact we don't have a repo. We only know after it has happened. And because we deploy ~200 times a day. there are many opportunities to experience this intermittent error.

You could start looking into collecting event counters.

We do monitor and collect event counters - is there one in particular you are thinking of?

Set startupTimeLimit to a larger value than 120 (max 3600), and see if the app starts

We've mulled over this, and we are against the idea because we think the app will continue to hang no matter. But happy to bump it to say 10 minutes?

Get a dump file of the application during the startup hang, see previous bullet for making it hang longer so you have time to get a dump

I'm not sure how to do that, given we only know after the error has occurred. We don't know when it is happening, only that it has happened (e.g. the deploy fails). We could setup DebugDiag or dotnet-dump and have it wait I guess...

Verify your logger is logging and doesn't have any potentially long running operations (database access etc.)

The log is 100% is firing, because we can see it fire during successfully deploys, and as we deploy ~200 times a day we have plenty of evidence:-

So for our LIVE environment we have 14 boxes, and when it happens we can see the log on the 13/14 boxes in the deploy window.

Same for our DEMO environment but we only have 4 boxes, And given the logger is the first thing in Main.cs we are fairly sure it is something between dotnet and IIS. Do you know if there is specific IIS logging I can turn on?

Add a try catch around all your code in main and make sure no exceptions are being thrown

We've had that in the past but there is usually a breadcrumb to indicate that is happening. We've never had an exception silently take down an app before. Happy to add a catch try.

Cheers,
Indy

BrennanConroy · 2023-08-30T23:14:04Z

Do you know if there is specific IIS logging I can turn on?

From the logs it looks like there isn't anything else IIS is doing (except the timeout of course).

We do monitor and collect event counters - is there one in particular you are thinking of?

There are some startup counters:
https://learn.microsoft.com/dotnet/framework/performance/runtime-information-etw-events
Probably mostly just want to check that they are present or not in the hung case.

indy-singh · 2023-08-31T11:49:58Z

Just had it happen again; and this was with the try catch added in the main entry point. Absolutely nothing. It seems like it hangs before it gets to invoked/startup/boot the app in the question. Very bizarre. It's going to be a little while until we get more information. We need to setup a continuous ETW capture that might shed light on what events are actually firing.

Cheers,
Indy

indy-singh · 2023-09-06T06:34:11Z

Just a small update, this is still happening - but we haven't yet setup a continuous ETW trace.

Cheers,
Indy

indy-singh · 2023-09-11T12:46:45Z

Alright, before we invoke the app. We now kick off an PerfView trace.

$dateTimeString = (Get-Date -Format o | ForEach-Object { $_ -replace ":", "." });
& C:\PerfView\PerfView.exe "/DataFile:C:\PerfView\$dateTimeString.etl" "/LogFile:C:\PerfView\$dateTimeString.log" "/BufferSizeMB:256" "/StackCompression" "/CircularMB:256" "/MaxCollectSec:150" "/KernelEvents:None" "/NoGui" "/FocusProcess:`"w3wp.exe`"" "/NoNGenRundown" "/onlyProviders=clrPrivate:0:Informational:@EventIDsToEnable=`"80 88 112`"" "/Zip:false" "/AcceptEULA" "collect"

The event IDs are taken from:-

https://github.com/microsoft/perfview/blob/b01f733927bb022b0d606064cf657f99d2453fd9/src/TraceEvent/Parsers/ClrPrivateTraceEventParser.cs#L1814

https://github.com/microsoft/perfview/blob/b01f733927bb022b0d606064cf657f99d2453fd9/src/TraceEvent/Parsers/ClrPrivateTraceEventParser.cs#L1822

https://github.com/microsoft/perfview/blob/b01f733927bb022b0d606064cf657f99d2453fd9/src/TraceEvent/Parsers/ClrPrivateTraceEventParser.cs#L1846

Back to the waiting game.

Cheers,
Indy

indy-singh · 2023-09-26T09:38:07Z

It finally happened again!

There are four files in the attached ZIP.

allium.green_2023-09-26T10.12.50.9728076+01.00.etl
allium.green_2023-09-26T10.12.50.9728076+01.00.log
organami.blue_2023-09-26T10.12.16.4892863+01.00.etl
organami.blue_2023-09-26T10.12.16.4892863+01.00.log

The files starting with allium is the bad example, and the organami file is the good example.

The perfview trace shows that in the bad example; the main method is never invoked.

To be clear we only tell perfview to look for @EventIDsToEnable="80 88 112 191""

80 = StartupEEStartupStartEventID
88 = StartupMainStartEventID
112 = StartupBindingStartEventID
191 = FailFastFailFastEventID

To us that suggest the main method is never being invoked and something before it hangs.

perfview_stuff.zip

Not really sure where to proceed from here - any ideas?

Cheers,
Indy

BrennanConroy · 2023-09-27T17:12:25Z

Asked around and the recommendation is to try and get a dump during the hang.

And for events:

I'd try everything in the StartupKeyword (0x80000000) for the private CLR provider and then the Loader (0x8) and Binder (0x4) keywords for the public CLR provider.

indy-singh · 2023-09-27T22:27:08Z

Yeah, happy to do that.

Is the correct modification to the command line?

"/onlyProviders=clrPrivate:0x80000000:Informational" "/ClrEvents=Loader,Binder"

This would be the entire command:-

& C:\PerfView\PerfView.exe "/DataFile:C:\PerfView\$dateTimeString.etl" "/LogFile:C:\PerfView\$dateTimeString.log" "/BufferSizeMB:256" "/StackCompression" "/CircularMB:256" "/MaxCollectSec:150" "/KernelEvents:None" "/NoGui" "/FocusProcess:`"w3wp.exe`"" "/NoNGenRundown" "/onlyProviders=clrPrivate:0x80000000:Informational" "/ClrEvents=Loader,Binder" "/Zip:false" "/AcceptEULA" "collect";

I would love to get a dump, but it happens so infrequently we can't take a dump as a matter of course. As we deploy 200+ times a day, the prod boxes would get full with dumps pretty quickly.

Unfortunately, we only know after the app has taken 2+ mins to start.

Cheers,
Indy

indy-singh · 2023-12-02T16:33:39Z

We are giving up on this issue; it is still happens but it occurs so infrequently and sporadically that it is borderline impossible to diagnose especially without a repeatable reproduction.

Apologies for any wasted time.

Cheers,
Indy

dotnet-issue-labeler bot added the area-runtime label Aug 24, 2023

amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Aug 25, 2023

amcasey added the feature-iis Includes: IIS, ANCM label Aug 28, 2023

indy-singh mentioned this issue Sep 13, 2023

Machine-wide focused on new instances, ignoring existing instances microsoft/perfview#1909

Closed

indy-singh mentioned this issue Sep 26, 2023

Access Violation Error in GC DiagTraceGCSegments dotnet/runtime#42743

Open

indy-singh closed this as completed Dec 2, 2023

ghost locked as resolved and limited conversation to collaborators Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent 'IIS AspNetCore Module V2 throws failed to load coreclr' #50317

Intermittent 'IIS AspNetCore Module V2 throws failed to load coreclr' #50317

indy-singh commented Aug 24, 2023

BrennanConroy commented Aug 24, 2023

indy-singh commented Aug 25, 2023

indy-singh commented Aug 25, 2023

indy-singh commented Aug 28, 2023

indy-singh commented Aug 28, 2023

indy-singh commented Aug 29, 2023

BrennanConroy commented Aug 29, 2023

indy-singh commented Aug 30, 2023

BrennanConroy commented Aug 30, 2023

indy-singh commented Aug 31, 2023 •

edited

Loading

indy-singh commented Sep 6, 2023

indy-singh commented Sep 11, 2023

indy-singh commented Sep 26, 2023

BrennanConroy commented Sep 27, 2023

indy-singh commented Sep 27, 2023

indy-singh commented Dec 2, 2023 •

edited

Loading

Intermittent 'IIS AspNetCore Module V2 throws failed to load coreclr' #50317

Intermittent 'IIS AspNetCore Module V2 throws failed to load coreclr' #50317

Comments

indy-singh commented Aug 24, 2023

Is there an existing issue for this?

Describe the bug

Expected Behavior

Steps To Reproduce

Exceptions (if any)

.NET Version

Anything else?

BrennanConroy commented Aug 24, 2023

indy-singh commented Aug 25, 2023

indy-singh commented Aug 25, 2023

indy-singh commented Aug 28, 2023

indy-singh commented Aug 28, 2023

indy-singh commented Aug 29, 2023

BrennanConroy commented Aug 29, 2023

indy-singh commented Aug 30, 2023

BrennanConroy commented Aug 30, 2023

indy-singh commented Aug 31, 2023 • edited Loading

indy-singh commented Sep 6, 2023

indy-singh commented Sep 11, 2023

indy-singh commented Sep 26, 2023

BrennanConroy commented Sep 27, 2023

indy-singh commented Sep 27, 2023

indy-singh commented Dec 2, 2023 • edited Loading

indy-singh commented Aug 31, 2023 •

edited

Loading

indy-singh commented Dec 2, 2023 •

edited

Loading