App Warmup failed under stress if backend process failed to start #260
Comments
I am seeing issues very consistent with this behavior. ASP Net Core 2.0 app hosted on Azure App Service. Occasionally any app condition that results in a restart or recycle of the app causes the app to stop responding and reporting 502.3 results consistent with kestrel timeouts. The app does not recover unless we manually restart the app service, change the instance count, or perform a slot swap. |
We get 500s after a platform event has occurred such as a restart or a new worker Is assigned, and definitely not when the App is under load. Only way to resolve is with a restart of the App Service. |
@nforysinski @dotdestroyer if you guys have any logs (application or ETW), feel free to share with me at my personal email. We are investigating this issue. |
@jkotalik i sent you some info, but wanted to ask if you thought rolling back to a previous version of netcore could help. We are running on 2.0.1 and dont mind rolling back to 1.x for the time being until there is a fix. This problem is super disruptive to us (and made worse by it being random and unexpected) so we wanted to see if there was any kind of work-around to get us through the time being. Also wanted to share that all of the error logging in azure is not helpful because it appears nothing it making it to our app pipeline. There are no event logs for the 502 that occurs, there is nothing in app insights for it. Its as if the requests never make it to our app at all and the w3wp process keeps restarting over and over. This is often accompanied by extremely high thread counts. |
It is nothing related with asp.net core version. The issue is appwarmup module sends a faked request to warm up the application and listens on io completion synchronizely. For some reason, the io callback never happens (could be a leak of request in aspnetcore module which is global singleton and not controlled by the app or bug in io layer) and thus causes IIS worker process hangs. So far we cannot repro the issue on local IIS server. The simple workaround is to remove warmup module from your site (if your app does not use appwarmup feature) by using site extension. We are working on the change of aspnetcore module to simplify process start logic. It will help to reduce the chance of process failure. |
@pan-wang we are hosting on azure app services so we don't have that level of control over IIS, unless im missing something. |
App Service does allow user to tune IIS settings via site extension. One of our CCS did it for another customer. Let me ping him to get the whole solution and then update you. Please expect some delay due to holiday. |
@nforysinski could you please try the steps in the attached doc to disable app warmup? |
@pan-wang we set this up a few days ago and everything seems to be working much better. several app restarts have come through fine and the application hasn't gone into the 502 issues we were seeing previously. |
Could you explain what preload is and why disabling it will likely fix this issue? What are the cons from disabling preload? Also what is the warmup module. How would I know if I am using it? |
@pan-wang : When I try to create the file applicationHost.xdt I get the following error:
Am I creating the file in the wrong folder? I assumed the file had to live next to the original ( Also, what is the easiest way to test if preloading is disabled? So I can see if the workaround is implemented successfully. |
@nphmuller - The .xdt file needs to go at the site root - d:\home\site. If you look at the applicationhost.config after a restart, you should see the change reflected in that file. |
@nphmuller See also https://github.com/projectkudu/kudu/wiki/Azure-Site-Extensions#debugging-private-extensions for more general info on using/debugging this xdt file. @pan-wang can you update your Word doc to indicate where the xdt should be created? |
@davidebbo I have update the doc as you suggested |
Faced with a similar problem. I noticed that this happens after restarting the server. Do not restart the application, namely restart the server. If you look at the field "System up time" in kudu (scm) - it perfectly correlates with the time of inaccessibility of the site. May be it linked with #226 |
I see this too - but ~30 second delay for some calls. Definitely all method calls are affected for sure - I had to write a script to call all my APIs the first time after a deployment so my user does not notice this slowness. "We get 500s after a platform event has occurred such as a restart or a new worker Is assigned, and definitely not when the App is under load." In my case under service fabric, the issue goes away after the first call. Azure ticket: 117121117302812 Note for me, this problem was not there with 1.1 or 2.0 release. It started happening out of the blue around the time the ticket was opened. I may have updated all my nuget packages to the latest version around that time. |
@davidebbo and @pan-wang Is there an easy way to script this applicationhost.xdt file addition that appears to resolve this issue? We have numerous ASP.NET Core APIs that can benefit from this, and it seems like this could be generalized and scripted. |
@prejeanb you should be able to upload the file via the Kudu vfs API. |
Thanks @davidebbo . I did create a PowerShell script that automates this using the Kudu vfs API.. Note: it does require Azure PowerShell to automate the restarting of the web app. |
This wasn't mentioned in the post above, but you can create a generic transform that can be applied to any site like so: <?xml version="1.0"?>
<configuration xmlns:xdt="http://schemas.microsoft.com/XML-Document-Transform">
<system.applicationHost>
<sites>
<site name="%XDT_SITENAME%" xdt:Locator="Match(name)">
<application preloadEnabled="false" xdt:Transform="SetAttributes(preloadEnabled)">
</application>
</site>
</sites>
</system.applicationHost>
</configuration> The |
@xt0rted - Thanks for sharing! That simplifies the PowerShell script some. Here is the simplified script... |
I just wanted to add my experience in case it helps in this. I'm working with a Microsoft Support employee who's helping me with the following exception that occurs periodically during slot swapping and during auto-scaling operations of my Azure App Service:
He suggested I try this solution. I haven't yet since I'm trying to understand how it applies, but maybe it'll help your investigation. |
@pan-wang I don't understand this issue particularly with the address in use case. Here's why:
Do we have any idea at which of the above steps things are falling apart? Based on other reports I've seen, the logs don't support the idea that there are 10 process start failures in a row. Do we even know if the dotnet process is crashing? I'm starting to wonder if a bunch of applications are out there swallowing exceptions from WebHost.Run/Start that's leading to this strange behavior. @JosephTremoulet Do you have any insight at which stage in the process crash/recovery cycle things are stalling? |
I was also sent here by a Microsoft support employee. My problem occurs during instance changes of my WebApp hardware which result in random "AccessDenied" when accessing all of my DLLs, this locks the app as none of the DLLs can be accessed until the entire WebApp is restarted (not just the process, the actual WebApp environment must be stopped, then started again) I've since done this and not see the error again (It was sporadic and would happen twice a week, once a month etc) but I can't be sure it is fixed. Can anyone comment how this issue would relate to all of my .NET DLLs returning access denied when they are loaded? Case number is 117120117249053 if any other employees want to have a look. |
The root cause is that 1) AppWarmUp module send a faked request to start IIS pipeline and waiting for the IO completion of this faked request. ANCM received the faked request and tried to start backend process. if starting failed at first time, ANCM will create a child request to try again. Somehow, the IO completion for the parent/child request got lost. AppWrmUp module holds IIS pipeline forever. This error happens with parent/child request pattern. I will simply the starting logic to abandon the child request in the new code. |
@halter It sounds like you are saying this exception can be ignored. Although I would wonder why a random port has a 99% chance of already being in use. |
@nforysinski do you have FREB log |
Guys, correct me if I'm wrong. p.s. I'm not sure that my problem is related to warmup, but the behavior of my application looks the same. |
@rolandh I looked at #2700. For your case, Kestrel server hit some assembly loading issue. It is unrelated with AppWarmUp issue, in which requests get blocked by AppWarmUp module. @Tratcher and @davidfowl , do you have any idea about the possible root cause of issue "Could not load DiscountRules.CustomerRoles.dll Access is Denied #2700" |
@joshmouch in most case, the port-in-use error is ignorable as the backend Kestrel exits on this error and ANMC will do the retry on the next request. To fix this port in use issue, the new code will do retry inside serving the first request instead of waiting for new request. |
@sergei66666 , could you please provide more info about your scenario. Is your application hosted on AppService? |
@pan-wang , yes, my app in AppService. And after it rebooted (after update for example), I get 502.5 error. I have logs, but they without any errors. I see my app in kudu panel, but i get 502.5. It can (sometimes can't) start works after plan or slot restart. p.s. As I said, I do not know if this is due to warmup. Maybe this is another issue. But the symptoms look similar as I can judge. |
@nforysinski and pan-wang. I have about 3 different 502's from yesterday. Yesterday after one restart, all requests returned 502 until a second restart. Doesn't sound exactly like what you're talking about, though. I have FREB logs, but didn't want to throw too many things at pan-wang at once until I had this port issue fixed, and I didn't want to confuse this thread too much. ;) Also, I wanted to see if I could find the cause myself before posting here. I'll open up some tickets on these different issues. |
@nforysinski I have two FREB logs here from yesterday showing your issue if I can post them somewhere. |
@joshmouch @pan-wang has shared his email earlier in the thread - I assume if you have FREB logs showing issues related to what I posted, you can send them over to him with a quick description that they are for an issue both of us are seeing. I believe they are related to the issue that The preload fix was working fine until two days ago when at ~2pm our app stopped responding to requests, and one day ago when at ~1pm it also stopped responding to requests. Only slot swaps after a deploy could get it back to working. |
@nforysinski slot swaps will lead to w3wp recycle. What do you mean "stopped responding"? 500s error or hang? Did you notice any abnormal about memory usage of w3wp.exe or dotnet.exe? |
@pan-wang after two minutes, 502 which seems to indicate app service timeout instead of net core timeout. Looking back at our logs, there were no abnormal or unexpected rises in memory or CPU usage on the app service at all. |
@nforysinski @pan-wang I'm still not sure if my problem is the same, but just though I'd chime in. When I see this timeout after two minutes, the FREB log shows a "2147954402" error in the AspNetCore module. For example:
|
@joshmouch based on the log and error code 2147954402 (WININET_E_TIMEOUT) , I would say you hit a different issue other than warmup. In your case, ANCM forwarded the request to the backend while the backend never response. You may want to turn on stdout log in web.config to see what happened at the backend. In the past I saw one similar case that there was a deadlock in backend during start up. |
@pan-wang I just emailed you an example of a failed request that occurred during this time. Spoiler alert: I am seeing the exact same error code as @joshmouch but this has occurred this morning and two different times about 10 minutes apart. After several interventions the site was able to recover. Again, this bears much resemblance to the warmup issues we were seeing before. I have turned on stdout to see if there are any errors reporting at the instance they are occurring. |
A little more info - in our event logs, everytime this occurs, we see several events like this:
|
this event is W3_EVENT_APPLICATION_PRELOAD_ERROR_GENERIC.which is emitted by warmup module. You definitely hit some warmup issue. I am not sure this is because warmup request got timeout as it ANCM never got response from backend. the stdout log will help.
…________________________________
From: Nick Forysinski <notifications@github.com>
Sent: Tuesday, February 6, 2018 6:37:32 PM
To: aspnet/AspNetCoreModule
Cc: pan-wang; Mention
Subject: Re: [aspnet/AspNetCoreModule] App Warmup failed under stress if backend process failed to start (#260)
A little more info - in our event logs, everytime this occurs, we see several events like this:
<System>
<Provider Name="W3SVC-WP"/>
<EventID>2303</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>Keywords</Keywords>
<TimeCreated SystemTime="2018-02-06T22:30:25Z"/>
<EventRecordID>32770765</EventRecordID>
<Channel>Application</Channel>
<Computer>RD00155D85A22C</Computer>
<Security/>
</System>
<EventData>
<Data>mongoose-sms-api__ed8f</Data>
<Binary>02000780</Binary>
</EventData>
</Event>
<Event>
<System>
<Provider Name="W3SVC-WP"/>
<EventID>2303</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>Keywords</Keywords>
<TimeCreated SystemTime="2018-02-06T22:30:25Z"/>
<EventRecordID>32770781</EventRecordID>
<Channel>Application</Channel>
<Computer>RD00155D85A22C</Computer>
<Security/>
</System>
<EventData>
<Data>mongoose-sms-api__ed8f</Data>
<Binary>02000780</Binary>
</EventData>
</Event>
<Event>
<System>
<Provider Name="W3SVC-WP"/>
<EventID>2303</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>Keywords</Keywords>
<TimeCreated SystemTime="2018-02-06T22:30:26Z"/>
<EventRecordID>28674375</EventRecordID>
<Channel>Application</Channel>
<Computer>RD00155D858F1B</Computer>
<Security/>
</System>
<EventData>
<Data>mongoose-sms-api__ed8f</Data>
<Binary>02000780</Binary>
</EventData>
</Event>
<Event>
<System>
<Provider Name="W3SVC-WP"/>
<EventID>2303</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>Keywords</Keywords>
<TimeCreated SystemTime="2018-02-06T22:30:26Z"/>
<EventRecordID>28674390</EventRecordID>
<Channel>Application</Channel>
<Computer>RD00155D858F1B</Computer>
<Security/>
</System>
<EventData>
<Data>mongoose-sms-api__ed8f</Data>
<Binary>02000780</Binary>
</EventData>
</Event>```
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#260 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ARVK27B7olt6RKyCeu1JWIzZ_qyCbzFOks5tSOI8gaJpZM4QoRe2>.
|
@pan-wang
|
it seems the backend had unhandled exception and never return to the forwarded request. this make sense for the timeout and warmup failure. @Tratcher could you please help to diagnose this runtime compilerservices exception? |
@nforysinski that stack trace shows an exception inside your MVC Action that gets caught and logged by Kestrel. Do you have any timestamps showing if this exception happened before or after the ANCM timeout? What it looks like is happening is the action taking a long time due to a large/slow database operation, ANCM times out and closes the connection, Kestrel fires the RequestAborted cancellation token, and something null refs inside of the subsequent database operation cancellation. |
@Tratcher Unfortunately I do not as my logs were only set to warning. Will changing the log level to information catch the appropriate log you need? |
Yes it should help. Getting time stamps depends on the logger you use, but Hosting always logs the request elapsed time at the end of each request. See |
After some run-around with the Azure team, they recommended an adjustment to the |
@nforysinski What is the adjustment? |
Our original file:
The adjusted file (note the wildcard in the site name portion):
|
@nforysinski have you seen the issue recur after implementing the preload disable you mention on Feb 13? We saw the issue recur again on Feb 27 after supposedly disabling preload using the explicit site_name value (not the XDT_SITENAME variable). |
@NoahStahl nope - not since adjusting the xdt file to use the variable - seems to be working well to prevent this issue. |
Reported from AppService that App WarmUp sometime failed for core app.
so far we cannot repro on local server. On possibility is about how ANCM handles dotnet.exe start failure, i.e., child request for retry. Will change the retry mechanism without child request.
The text was updated successfully, but these errors were encountered: