-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent 'IIS AspNetCore Module V2 throws failed to load coreclr' #50317
Comments
It looks like we start the CLR thread aspnetcore/src/Servers/IIS/AspNetCoreModuleV2/InProcessRequestHandler/inprocessapplication.cpp Line 357 in 5bdb8b4
Which calls "main" ( hostfxr_main or hostfxr_run_app )aspnetcore/src/Servers/IIS/AspNetCoreModuleV2/InProcessRequestHandler/inprocessapplication.cpp Line 511 in 5bdb8b4
That is what launches your apps dll as well as loads all the other dlls needed by your app.
Which would run after your logs since you have logs before starting the app. This implies something in hostfxr is taking a long time. You can enable tracing at that layer with |
@BrennanConroy ah great - we've made that change, just waiting game now. Cheers, |
It just happened again. We had this config deployed:- <environmentVariables>
<environmentVariable name="COREHOST_TRACE" value="1"/>
<environmentVariable name="COREHOST_TRACE_VERBOSITY" value="3"/>
<environmentVariable name="COREHOST_TRACEFILE" value=".\is-jw-hostfxr.log"/>
</environmentVariables> The logs are taken from the same host. Unfortunately, it looks like there is nothing obviously wrong between GOOD vs BAD. We've bumped up eventlog_2023-08-25T11-44-07.558Z.BAD.log Commands used:- PS C:\Windows\system32> Get-WinEvent -FilterHashtable @{ProviderName= "IIS AspNetCore Module V2"; LogName = "Application"; StartTime = (Get-Date -Date '2023-08-25');} | where {$_.message -like "*648*"} | Select-Object -Property Message | FL * | clip
PS C:\Windows\system32> Get-WinEvent -FilterHashtable @{ProviderName= "IIS AspNetCore Module V2"; LogName = "Application"; StartTime = (Get-Date -Date '2023-08-25');} | where {$_.message -like "*24192*"} | Select-Object -Property Message | FL * | clip Cheers, |
It just happened again, but that is interesting is that COREHOST_TRACEFILE wasn't overwritten as we'd expect during a successful deploy:- The failure logs in the event viewer are present as ever:- Is there something before COREHOST_TRACEFILE that could be hanging? This particular failure happened on our DEMO env which has close to zero traffic, so we can rule out things like load and pending/long running requests. I also tried running:- Get-WinEvent -ListLog * `
| Where-Object {$_.RecordCount -gt 0} `
| Where-Object {$_.LogName -ne "Microsoft-Windows-SystemDataArchiver/Diagnostic"} `
| Where-Object {$_.LogName -ne "Microsoft-IIS-Configuration/Administrative"} `
| Where-Object {$_.LogName -ne "Microsoft-IIS-Configuration/Operational"} `
| Where-Object {$_.LastWriteTime -ge (Get-Date -Date '2023-08-28 19:10')} `
| Get-WinEvent `
| Where-Object {$_.TimeCreated -ge (Get-Date -Date '2023-08-28 19:10')} `
| Where-Object {$_.TimeCreated -lt (Get-Date -Date '2023-08-28 19:13')} To see if anything else happens during that failure but nothing obvious jumps out. Is there more logging I can turn on? Cheers, |
Ah, the log files ( I've just pushed a change that will ensure that we get a separate log file from the one that is build on CI - we warm up the application as apart of the the CI steps Sorry for the confusion; back to the waiting game! Cheers, |
Just happened again. event_logs-2023-08-29T09-31-08.545Z.log iis102 = failed deploy They are both identical save for the time stamp. So this points to something after hostfxr - is there more logging I can turn on after that has finished it's job? Also, do you think it is worth upgrading to the latest dotnet6 release? I had a brief look through the release notes and couldn't see anything related to this issue that might resolve it. Cheers, |
I'm not aware of any "logs". You could start looking into collecting event counters. But before that, there are a couple things you could try, assuming they are viable in your repro environment:
A similar issue occurred in #48665 and it turned out to be their code occasionally throwing a SQL exception on startup. |
So this is made difficult by the fact we don't have a repo. We only know after it has happened. And because we deploy ~200 times a day. there are many opportunities to experience this intermittent error.
We do monitor and collect event counters - is there one in particular you are thinking of?
We've mulled over this, and we are against the idea because we think the app will continue to hang no matter. But happy to bump it to say 10 minutes?
I'm not sure how to do that, given we only know after the error has occurred. We don't know when it is happening, only that it has happened (e.g. the deploy fails). We could setup DebugDiag or dotnet-dump and have it wait I guess...
The log is 100% is firing, because we can see it fire during successfully deploys, and as we deploy ~200 times a day we have plenty of evidence:- So for our LIVE environment we have 14 boxes, and when it happens we can see the log on the 13/14 boxes in the deploy window. Same for our DEMO environment but we only have 4 boxes, And given the logger is the first thing in Main.cs we are fairly sure it is something between dotnet and IIS. Do you know if there is specific IIS logging I can turn on?
We've had that in the past but there is usually a breadcrumb to indicate that is happening. We've never had an exception silently take down an app before. Happy to add a catch try. Cheers, |
From the logs it looks like there isn't anything else IIS is doing (except the timeout of course).
There are some startup counters: |
Just had it happen again; and this was with the try catch added in the main entry point. Absolutely nothing. It seems like it hangs before it gets to invoked/startup/boot the app in the question. Very bizarre. It's going to be a little while until we get more information. We need to setup a continuous ETW capture that might shed light on what events are actually firing. Cheers, |
Just a small update, this is still happening - but we haven't yet setup a continuous ETW trace. Cheers, |
Alright, before we invoke the app. We now kick off an PerfView trace. $dateTimeString = (Get-Date -Format o | ForEach-Object { $_ -replace ":", "." });
& C:\PerfView\PerfView.exe "/DataFile:C:\PerfView\$dateTimeString.etl" "/LogFile:C:\PerfView\$dateTimeString.log" "/BufferSizeMB:256" "/StackCompression" "/CircularMB:256" "/MaxCollectSec:150" "/KernelEvents:None" "/NoGui" "/FocusProcess:`"w3wp.exe`"" "/NoNGenRundown" "/onlyProviders=clrPrivate:0:Informational:@EventIDsToEnable=`"80 88 112`"" "/Zip:false" "/AcceptEULA" "collect" The event IDs are taken from:- Back to the waiting game. Cheers, |
It finally happened again! There are four files in the attached ZIP.
The files starting with The perfview trace shows that in the bad example; the main method is never invoked. To be clear we only tell perfview to look for 80 = StartupEEStartupStartEventID To us that suggest the main method is never being invoked and something before it hangs. Not really sure where to proceed from here - any ideas? Cheers, |
Asked around and the recommendation is to try and get a dump during the hang. And for events:
|
Yeah, happy to do that. Is the correct modification to the command line? "/onlyProviders=clrPrivate:0x80000000:Informational" "/ClrEvents=Loader,Binder" This would be the entire command:- & C:\PerfView\PerfView.exe "/DataFile:C:\PerfView\$dateTimeString.etl" "/LogFile:C:\PerfView\$dateTimeString.log" "/BufferSizeMB:256" "/StackCompression" "/CircularMB:256" "/MaxCollectSec:150" "/KernelEvents:None" "/NoGui" "/FocusProcess:`"w3wp.exe`"" "/NoNGenRundown" "/onlyProviders=clrPrivate:0x80000000:Informational" "/ClrEvents=Loader,Binder" "/Zip:false" "/AcceptEULA" "collect"; I would love to get a dump, but it happens so infrequently we can't take a dump as a matter of course. As we deploy 200+ times a day, the prod boxes would get full with dumps pretty quickly. Unfortunately, we only know after the app has taken 2+ mins to start. Cheers, |
We are giving up on this issue; it is still happens but it occurs so infrequently and sporadically that it is borderline impossible to diagnose especially without a repeatable reproduction. Apologies for any wasted time. Cheers, |
Is there an existing issue for this?
Describe the bug
Every now and then we are seeing a dotnet6 app fail to start for no apparent reason.
The first exception we saw is this:-
So we naturally assumed that our app was taking a lot of time to start-up, this is not the case. As we ruled that out by adding a logger to Program.cs and Startup.cs, and waited for the next time it happened. When it did happen, we can see the log is never reached (despite being the first thing the app should hit). This also happens on applications that have an established history of starting up within 10 seconds.
To be clear we are seeing this sporadically across different production servers and different applications, so that rules out it being related to one server or one application.
We are using in-process hosting on IIS.
We've turned on:-
And today it happened again and were able to collect more information. But it still isn't clear why this happening.
I've attached the files I've referenced from today's failure, all the apps are identical in configuration and all the servers are managed via DSC so there are no configuration differences.
As far as we can tell "something" is preventing the app from being booted. But there is no evidence of that..
I used this collect the relevant logs:-
I've appended
.txt
to make these file uploadable.dotnet-info.txt
Program.cs.txt
Startup.cs.txt
systeminfo.txt
Web.LIVE.config.txt
If you extract the message from the raw logs:-
IIS107.MAN.txt
IIS120.MAN.txt
107 = successful deploy (good.log)
120 = failed deploy (bad.log)
GOOD.log
BAD.log
You can see BAD.log has extra output at the top of the log (most recent).
We are unsure what the extra logs means other than 2 min threshold was reached, because we know the app isn't not taking more than 2 mins to boot (there are no logs on app start). As far as we can tell something "just" hangs.
If you grep for
Setting current directory
in both logs you'll see that bad.log just shows a 2 minute hang. But in good.log it continues as normal.Any ideas?
Expected Behavior
Should be able to deploy without experiencing this error.
Steps To Reproduce
Unable to repropuce at all, only happens in production very intermittently/sporadically
Exceptions (if any)
.NET Version
6.0.13
Anything else?
No response
The text was updated successfully, but these errors were encountered: