Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent 'IIS AspNetCore Module V2 throws failed to load coreclr' #50317

Closed
1 task done
indy-singh opened this issue Aug 24, 2023 · 16 comments
Closed
1 task done

Intermittent 'IIS AspNetCore Module V2 throws failed to load coreclr' #50317

indy-singh opened this issue Aug 24, 2023 · 16 comments
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-iis Includes: IIS, ANCM

Comments

@indy-singh
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

Every now and then we are seeing a dotnet6 app fail to start for no apparent reason.

The first exception we saw is this:-

HTTP Error 500.37 - ASP.NET Core app failed to start within start-up time limit

So we naturally assumed that our app was taking a lot of time to start-up, this is not the case. As we ruled that out by adding a logger to Program.cs and Startup.cs, and waited for the next time it happened. When it did happen, we can see the log is never reached (despite being the first thing the app should hit). This also happens on applications that have an established history of starting up within 10 seconds.

To be clear we are seeing this sporadically across different production servers and different applications, so that rules out it being related to one server or one application.

We are using in-process hosting on IIS.

We've turned on:-

<handlerSetting name="debugLevel" value="EVENTLOG,INFO"/>

And today it happened again and were able to collect more information. But it still isn't clear why this happening.

I've attached the files I've referenced from today's failure, all the apps are identical in configuration and all the servers are managed via DSC so there are no configuration differences.

As far as we can tell "something" is preventing the app from being booted. But there is no evidence of that..

I used this collect the relevant logs:-

Get-WinEvent -FilterHashtable @{ProviderName= "IIS AspNetCore Module V2"; LogName = "Application"; StartTime = (Get-Date -Date '2023-08-24');} | where {$_.message -like "*PID_HERE*"} | Select-Object -Property TimeCreated, Message | FL *

I've appended .txt to make these file uploadable.

dotnet-info.txt
Program.cs.txt
Startup.cs.txt
systeminfo.txt
Web.LIVE.config.txt

If you extract the message from the raw logs:-

IIS107.MAN.txt
IIS120.MAN.txt

107 = successful deploy (good.log)
120 = failed deploy (bad.log)

GOOD.log
BAD.log

You can see BAD.log has extra output at the top of the log (most recent).

We are unsure what the extra logs means other than 2 min threshold was reached, because we know the app isn't not taking more than 2 mins to boot (there are no logs on app start). As far as we can tell something "just" hangs.

If you grep for Setting current directory in both logs you'll see that bad.log just shows a 2 minute hang. But in good.log it continues as normal.

Any ideas?

Expected Behavior

Should be able to deploy without experiencing this error.

Steps To Reproduce

Unable to repropuce at all, only happens in production very intermittently/sporadically

Exceptions (if any)

Message : [2023-08-24T14:23:26.309Z, PID: 35868] [aspnetcorev2.dll] ASPNET_CORE_GLOBAL_MODULE::Terminate
Message : [2023-08-24T14:23:25.291Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping file watcher thread
Message : [2023-08-24T14:23:25.291Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping file watching.
Message : [2023-08-24T14:23:25.291Z, PID: 35868] [aspnetcorev2.dll] Stopping application '/LM/W3SVC/1382/ROOT'
Message : [2023-08-24T14:23:25.290Z, PID: 35868] [aspnetcorev2.dll] ASPNET_CORE_GLOBAL_MODULE::OnGlobalStopListening
Message : [2023-08-24T14:23:25.250Z, PID: 35868] [aspnetcorev2_inprocess.dll] Starting file watcher thread
Message : [2023-08-24T14:23:25.249Z, PID: 35868] [aspnetcorev2_inprocess.dll] Starting app_offline monitoring in application 'c:\wwwroot\vehicledatalogic\blue\'
Message : [2023-08-24T14:23:25.249Z, PID: 35868] [aspnetcorev2_inprocess.dll] Failed HRESULT returned: 0x8007023e at 
          D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\dllmain.cpp:131 
Message : [2023-08-24T14:23:25.249Z, PID: 35868] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at 
          D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 
Message : [2023-08-24T14:23:25.248Z, PID: 35868] [aspnetcorev2_inprocess.dll] Event Log: 'Application '/LM/W3SVC/1382/ROOT' with physical root 
          'c:\wwwroot\vehicledatalogic\blue\' failed to load coreclr. Exception message:
          Managed server didn't initialize after 120000 ms.' 
          End Event Log Message.
Message : [2023-08-24T14:23:25.230Z, PID: 35868] [aspnetcorev2_inprocess.dll] AspNetCore InProcess Recycle Process on Demand
Message : [2023-08-24T14:23:25.229Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping in-process worker thread
Message : [2023-08-24T14:23:25.215Z, PID: 35868] [aspnetcorev2_inprocess.dll] Canceling standard stream pipe reader
Message : [2023-08-24T14:23:25.215Z, PID: 35868] [aspnetcorev2_inprocess.dll] Event Log: 'Failed to gracefully shutdown application 
          'MACHINE/WEBROOT/APPHOST/VEHICLEDATALOGIC.BLUE'.' 
          End Event Log Message.
Message : [2023-08-24T14:23:25.214Z, PID: 35868] [aspnetcorev2_inprocess.dll] Clr thread wait ended: clrThreadExited: 0
Message : [2023-08-24T14:23:25.214Z, PID: 35868] [aspnetcorev2_inprocess.dll] Starting shutdown sequence 0
Message : [2023-08-24T14:23:25.214Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping CLR
Message : [2023-08-24T14:23:25.214Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping file watcher thread
Message : [2023-08-24T14:23:25.213Z, PID: 35868] [aspnetcorev2_inprocess.dll] Stopping file watching.

.NET Version

6.0.13

Anything else?

No response

@BrennanConroy
Copy link
Member

It looks like we start the CLR thread

m_clrThread = std::thread(ClrThreadEntryPoint, context);

Which calls "main" (hostfxr_main or hostfxr_run_app)
auto const exitCode = context->m_hostFxr.Main(context->m_argc, context->m_argv.get());

That is what launches your apps dll as well as loads all the other dlls needed by your app.
The next log you should see from IIS is "In-process callbacks set" which comes from the managed IIS Server component calling back into native code

_nativeApplication.RegisterCallbacks(

Which would run after your logs since you have logs before starting the app.

This implies something in hostfxr is taking a long time. You can enable tracing at that layer with COREHOST_TRACE=1
https://github.com/dotnet/runtime/blob/main/docs/design/features/host-tracing.md#trace-routing

@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Aug 25, 2023
@indy-singh
Copy link
Author

@BrennanConroy ah great - we've made that change, just waiting game now.

Cheers,
Indy

@indy-singh
Copy link
Author

It just happened again. We had this config deployed:-

<environmentVariables>
	<environmentVariable name="COREHOST_TRACE" value="1"/>
	<environmentVariable name="COREHOST_TRACE_VERBOSITY" value="3"/>
	<environmentVariable name="COREHOST_TRACEFILE" value=".\is-jw-hostfxr.log"/>
</environmentVariables>

The logs are taken from the same host.

Unfortunately, it looks like there is nothing obviously wrong between GOOD vs BAD. We've bumped up COREHOST_TRACE_VERBOSITY to 4.

eventlog_2023-08-25T11-44-07.558Z.BAD.log
eventlog_2023-08-25T11-52-46.693Z.GOOD.log
is-jw-hostfxr.BAD.log
is-jw-hostfxr.GOOD.log

Commands used:-

PS C:\Windows\system32> Get-WinEvent -FilterHashtable @{ProviderName= "IIS AspNetCore Module V2"; LogName = "Application"; StartTime = (Get-Date -Date '2023-08-25');} | where {$_.message -like "*648*"} | Select-Object -Property Message | FL * | clip
PS C:\Windows\system32> Get-WinEvent -FilterHashtable @{ProviderName= "IIS AspNetCore Module V2"; LogName = "Application"; StartTime = (Get-Date -Date '2023-08-25');} | where {$_.message -like "*24192*"} | Select-Object -Property Message | FL * | clip

Cheers,
Indy

@indy-singh
Copy link
Author

It just happened again, but that is interesting is that COREHOST_TRACEFILE wasn't overwritten as we'd expect during a successful deploy:-

demo102.man.is-jw-hostfxr.log

The failure logs in the event viewer are present as ever:-

logs.txt

Is there something before COREHOST_TRACEFILE that could be hanging?

This particular failure happened on our DEMO env which has close to zero traffic, so we can rule out things like load and pending/long running requests.

I also tried running:-

Get-WinEvent -ListLog * `
| Where-Object {$_.RecordCount -gt 0} `
| Where-Object {$_.LogName -ne "Microsoft-Windows-SystemDataArchiver/Diagnostic"} `
| Where-Object {$_.LogName -ne "Microsoft-IIS-Configuration/Administrative"} `
| Where-Object {$_.LogName -ne "Microsoft-IIS-Configuration/Operational"} `
| Where-Object {$_.LastWriteTime -ge (Get-Date -Date '2023-08-28 19:10')} `
| Get-WinEvent `
| Where-Object {$_.TimeCreated -ge (Get-Date -Date '2023-08-28 19:10')} `
| Where-Object {$_.TimeCreated -lt (Get-Date -Date '2023-08-28 19:13')}

To see if anything else happens during that failure but nothing obvious jumps out.

Is there more logging I can turn on?

Cheers,
Indy

@amcasey amcasey added the feature-iis Includes: IIS, ANCM label Aug 28, 2023
@indy-singh
Copy link
Author

Ah, the log files (is-jw-hostfxr.log etc) I've been posting so far are from the build release step and not from when the actual failure occurs.

I've just pushed a change that will ensure that we get a separate log file from the one that is build on CI - we warm up the application as apart of the the CI steps

Sorry for the confusion; back to the waiting game!

Cheers,
Indy

@indy-singh
Copy link
Author

Just happened again.

event_logs-2023-08-29T09-31-08.545Z.log
iis102.man.LIVE-is-jw-hostfxr.log
iis112.man.LIVE-is-jw-hostfxr.log

iis102 = failed deploy
iis112 = successful deploy (so we can compare hostfxr logs)

They are both identical save for the time stamp.

So this points to something after hostfxr - is there more logging I can turn on after that has finished it's job?

Also, do you think it is worth upgrading to the latest dotnet6 release? I had a brief look through the release notes and couldn't see anything related to this issue that might resolve it.

Cheers,
Indy

@BrennanConroy
Copy link
Member

is there more logging I can turn on after that has finished it's job?

I'm not aware of any "logs". You could start looking into collecting event counters. But before that, there are a couple things you could try, assuming they are viable in your repro environment:

  • Set startupTimeLimit to a larger value than 120 (max 3600), and see if the app starts
  • Get a dump file of the application during the startup hang, see previous bullet for making it hang longer so you have time to get a dump
  • Verify your logger is logging and doesn't have any potentially long running operations (database access etc.)
  • Add a try catch around all your code in main and make sure no exceptions are being thrown

A similar issue occurred in #48665 and it turned out to be their code occasionally throwing a SQL exception on startup.

@indy-singh
Copy link
Author

Assuming they are viable in your repro environment

So this is made difficult by the fact we don't have a repo. We only know after it has happened. And because we deploy ~200 times a day. there are many opportunities to experience this intermittent error.

You could start looking into collecting event counters.

We do monitor and collect event counters - is there one in particular you are thinking of?

Set startupTimeLimit to a larger value than 120 (max 3600), and see if the app starts

We've mulled over this, and we are against the idea because we think the app will continue to hang no matter. But happy to bump it to say 10 minutes?

Get a dump file of the application during the startup hang, see previous bullet for making it hang longer so you have time to get a dump

I'm not sure how to do that, given we only know after the error has occurred. We don't know when it is happening, only that it has happened (e.g. the deploy fails). We could setup DebugDiag or dotnet-dump and have it wait I guess...

Verify your logger is logging and doesn't have any potentially long running operations (database access etc.)

The log is 100% is firing, because we can see it fire during successfully deploys, and as we deploy ~200 times a day we have plenty of evidence:-

image

So for our LIVE environment we have 14 boxes, and when it happens we can see the log on the 13/14 boxes in the deploy window.

Same for our DEMO environment but we only have 4 boxes, And given the logger is the first thing in Main.cs we are fairly sure it is something between dotnet and IIS. Do you know if there is specific IIS logging I can turn on?

Add a try catch around all your code in main and make sure no exceptions are being thrown

We've had that in the past but there is usually a breadcrumb to indicate that is happening. We've never had an exception silently take down an app before. Happy to add a catch try.

Cheers,
Indy

@BrennanConroy
Copy link
Member

Do you know if there is specific IIS logging I can turn on?

From the logs it looks like there isn't anything else IIS is doing (except the timeout of course).

We do monitor and collect event counters - is there one in particular you are thinking of?

There are some startup counters:
https://learn.microsoft.com/dotnet/framework/performance/runtime-information-etw-events
Probably mostly just want to check that they are present or not in the hung case.

@indy-singh
Copy link
Author

indy-singh commented Aug 31, 2023

Just had it happen again; and this was with the try catch added in the main entry point. Absolutely nothing. It seems like it hangs before it gets to invoked/startup/boot the app in the question. Very bizarre. It's going to be a little while until we get more information. We need to setup a continuous ETW capture that might shed light on what events are actually firing.

Cheers,
Indy

@indy-singh
Copy link
Author

Just a small update, this is still happening - but we haven't yet setup a continuous ETW trace.

Cheers,
Indy

@indy-singh
Copy link
Author

Alright, before we invoke the app. We now kick off an PerfView trace.

$dateTimeString = (Get-Date -Format o | ForEach-Object { $_ -replace ":", "." });
& C:\PerfView\PerfView.exe "/DataFile:C:\PerfView\$dateTimeString.etl" "/LogFile:C:\PerfView\$dateTimeString.log" "/BufferSizeMB:256" "/StackCompression" "/CircularMB:256" "/MaxCollectSec:150" "/KernelEvents:None" "/NoGui" "/FocusProcess:`"w3wp.exe`"" "/NoNGenRundown" "/onlyProviders=clrPrivate:0:Informational:@EventIDsToEnable=`"80 88 112`"" "/Zip:false" "/AcceptEULA" "collect"

The event IDs are taken from:-

https://github.com/microsoft/perfview/blob/b01f733927bb022b0d606064cf657f99d2453fd9/src/TraceEvent/Parsers/ClrPrivateTraceEventParser.cs#L1814

https://github.com/microsoft/perfview/blob/b01f733927bb022b0d606064cf657f99d2453fd9/src/TraceEvent/Parsers/ClrPrivateTraceEventParser.cs#L1822

https://github.com/microsoft/perfview/blob/b01f733927bb022b0d606064cf657f99d2453fd9/src/TraceEvent/Parsers/ClrPrivateTraceEventParser.cs#L1846

Back to the waiting game.

Cheers,
Indy

@indy-singh
Copy link
Author

It finally happened again!

There are four files in the attached ZIP.

allium.green_2023-09-26T10.12.50.9728076+01.00.etl
allium.green_2023-09-26T10.12.50.9728076+01.00.log
organami.blue_2023-09-26T10.12.16.4892863+01.00.etl
organami.blue_2023-09-26T10.12.16.4892863+01.00.log

The files starting with allium is the bad example, and the organami file is the good example.

The perfview trace shows that in the bad example; the main method is never invoked.

To be clear we only tell perfview to look for @EventIDsToEnable="80 88 112 191""

80 = StartupEEStartupStartEventID
88 = StartupMainStartEventID
112 = StartupBindingStartEventID
191 = FailFastFailFastEventID

image

To us that suggest the main method is never being invoked and something before it hangs.

perfview_stuff.zip

Not really sure where to proceed from here - any ideas?

Cheers,
Indy

@BrennanConroy
Copy link
Member

Asked around and the recommendation is to try and get a dump during the hang.

And for events:

I'd try everything in the StartupKeyword (0x80000000) for the private CLR provider and then the Loader (0x8) and Binder (0x4) keywords for the public CLR provider.

@indy-singh
Copy link
Author

Yeah, happy to do that.

Is the correct modification to the command line?

"/onlyProviders=clrPrivate:0x80000000:Informational" "/ClrEvents=Loader,Binder"

This would be the entire command:-

& C:\PerfView\PerfView.exe "/DataFile:C:\PerfView\$dateTimeString.etl" "/LogFile:C:\PerfView\$dateTimeString.log" "/BufferSizeMB:256" "/StackCompression" "/CircularMB:256" "/MaxCollectSec:150" "/KernelEvents:None" "/NoGui" "/FocusProcess:`"w3wp.exe`"" "/NoNGenRundown" "/onlyProviders=clrPrivate:0x80000000:Informational" "/ClrEvents=Loader,Binder" "/Zip:false" "/AcceptEULA" "collect";

I would love to get a dump, but it happens so infrequently we can't take a dump as a matter of course. As we deploy 200+ times a day, the prod boxes would get full with dumps pretty quickly.

Unfortunately, we only know after the app has taken 2+ mins to start.

Cheers,
Indy

@indy-singh
Copy link
Author

indy-singh commented Dec 2, 2023

We are giving up on this issue; it is still happens but it occurs so infrequently and sporadically that it is borderline impossible to diagnose especially without a repeatable reproduction.

Apologies for any wasted time.

Cheers,
Indy

@ghost ghost locked as resolved and limited conversation to collaborators Feb 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-iis Includes: IIS, ANCM
Projects
None yet
Development

No branches or pull requests

3 participants