Skip to content
This repository has been archived by the owner on Jun 20, 2019. It is now read-only.

App Warmup failed under stress if backend process failed to start #260

Open
pan-wang opened this issue Nov 23, 2017 · 60 comments
Open

App Warmup failed under stress if backend process failed to start #260

pan-wang opened this issue Nov 23, 2017 · 60 comments

Comments

@pan-wang
Copy link
Contributor

Reported from AppService that App WarmUp sometime failed for core app.
so far we cannot repro on local server. On possibility is about how ANCM handles dotnet.exe start failure, i.e., child request for retry. Will change the retry mechanism without child request.

@shirhatti shirhatti added this to the Backlog milestone Nov 29, 2017
@nforysinski
Copy link

I am seeing issues very consistent with this behavior. ASP Net Core 2.0 app hosted on Azure App Service. Occasionally any app condition that results in a restart or recycle of the app causes the app to stop responding and reporting 502.3 results consistent with kestrel timeouts. The app does not recover unless we manually restart the app service, change the instance count, or perform a slot swap.

@olitomlinson
Copy link

We get 500s after a platform event has occurred such as a restart or a new worker Is assigned, and definitely not when the App is under load. Only way to resolve is with a restart of the App Service.

@jkotalik
Copy link
Contributor

@nforysinski @dotdestroyer if you guys have any logs (application or ETW), feel free to share with me at my personal email. We are investigating this issue.

@nforysinski
Copy link

nforysinski commented Dec 23, 2017

@jkotalik i sent you some info, but wanted to ask if you thought rolling back to a previous version of netcore could help. We are running on 2.0.1 and dont mind rolling back to 1.x for the time being until there is a fix. This problem is super disruptive to us (and made worse by it being random and unexpected) so we wanted to see if there was any kind of work-around to get us through the time being.

Also wanted to share that all of the error logging in azure is not helpful because it appears nothing it making it to our app pipeline. There are no event logs for the 502 that occurs, there is nothing in app insights for it. Its as if the requests never make it to our app at all and the w3wp process keeps restarting over and over. This is often accompanied by extremely high thread counts.

@pan-wang
Copy link
Contributor Author

It is nothing related with asp.net core version. The issue is appwarmup module sends a faked request to warm up the application and listens on io completion synchronizely. For some reason, the io callback never happens (could be a leak of request in aspnetcore module which is global singleton and not controlled by the app or bug in io layer) and thus causes IIS worker process hangs. So far we cannot repro the issue on local IIS server. The simple workaround is to remove warmup module from your site (if your app does not use appwarmup feature) by using site extension. We are working on the change of aspnetcore module to simplify process start logic. It will help to reduce the chance of process failure.

@nforysinski
Copy link

@pan-wang we are hosting on azure app services so we don't have that level of control over IIS, unless im missing something.

@pan-wang
Copy link
Contributor Author

App Service does allow user to tune IIS settings via site extension. One of our CCS did it for another customer. Let me ping him to get the whole solution and then update you. Please expect some delay due to holiday.

@pan-wang
Copy link
Contributor Author

pan-wang commented Dec 23, 2017

How.to.disable.preload.docx

@nforysinski could you please try the steps in the attached doc to disable app warmup?

@nforysinski
Copy link

@pan-wang we set this up a few days ago and everything seems to be working much better. several app restarts have come through fine and the application hasn't gone into the 502 issues we were seeing previously.

@rollsch
Copy link

rollsch commented Jan 2, 2018

Could you explain what preload is and why disabling it will likely fix this issue? What are the cons from disabling preload?

Also what is the warmup module. How would I know if I am using it?

@nphmuller
Copy link

@pan-wang : When I try to create the file applicationHost.xdt I get the following error:

409 Conflict: Could not write to local resource 'D:\local\Config\applicationhost.xdt' due to error 'Access to the path 'D:\local\Config\applicationhost.xdt' is denied.'.

Am I creating the file in the wrong folder? I assumed the file had to live next to the original (applicationhost.config).

Also, what is the easiest way to test if preloading is disabled? So I can see if the workaround is implemented successfully.

@veatcha
Copy link

veatcha commented Jan 2, 2018

@nphmuller - The .xdt file needs to go at the site root - d:\home\site. If you look at the applicationhost.config after a restart, you should see the change reflected in that file.

@davidebbo
Copy link

@nphmuller See also https://github.com/projectkudu/kudu/wiki/Azure-Site-Extensions#debugging-private-extensions for more general info on using/debugging this xdt file.

@pan-wang can you update your Word doc to indicate where the xdt should be created?

@pan-wang
Copy link
Contributor Author

pan-wang commented Jan 2, 2018

@davidebbo I have update the doc as you suggested
@rolandh AppWarm module (aka application initialization) is an IIS native module. It is a useful feature in the scenario that your managed application needs long time to warm up (e.g., loading assemblies, setting up SQL connection, et al), while you want your application to serve some custom page during the warming up instead of just letting client waiting/spinning there. If you don't know about this module, most likely you are not using it.

@sergei66666
Copy link

sergei66666 commented Jan 4, 2018

Faced with a similar problem. I noticed that this happens after restarting the server. Do not restart the application, namely restart the server. If you look at the field "System up time" in kudu (scm) - it perfectly correlates with the time of inaccessibility of the site.
To reproduce this, I tried to restart the server. But I did not find a way to do it.
Is there some way to do this or maybe simulate?

May be it linked with #226

@vdevappa
Copy link

vdevappa commented Jan 4, 2018

I see this too - but ~30 second delay for some calls. Definitely all method calls are affected for sure - I had to write a script to call all my APIs the first time after a deployment so my user does not notice this slowness.

"We get 500s after a platform event has occurred such as a restart or a new worker Is assigned, and definitely not when the App is under load."

In my case under service fabric, the issue goes away after the first call.

Azure ticket: 117121117302812

Note for me, this problem was not there with 1.1 or 2.0 release. It started happening out of the blue around the time the ticket was opened. I may have updated all my nuget packages to the latest version around that time.

@prejeanb
Copy link

prejeanb commented Jan 5, 2018

@davidebbo and @pan-wang Is there an easy way to script this applicationhost.xdt file addition that appears to resolve this issue? We have numerous ASP.NET Core APIs that can benefit from this, and it seems like this could be generalized and scripted.

@davidebbo
Copy link

@prejeanb you should be able to upload the file via the Kudu vfs API.

@prejeanb
Copy link

prejeanb commented Jan 9, 2018

Thanks @davidebbo . I did create a PowerShell script that automates this using the Kudu vfs API.. Note: it does require Azure PowerShell to automate the restarting of the web app.
aspnetcorefix.zip

@xt0rted
Copy link

xt0rted commented Jan 12, 2018

This wasn't mentioned in the post above, but you can create a generic transform that can be applied to any site like so:

<?xml version="1.0"?>
<configuration xmlns:xdt="http://schemas.microsoft.com/XML-Document-Transform">
  <system.applicationHost>
    <sites>
      <site name="%XDT_SITENAME%" xdt:Locator="Match(name)">
        <application preloadEnabled="false" xdt:Transform="SetAttributes(preloadEnabled)">
        </application>
      </site>
    </sites>
  </system.applicationHost>
</configuration>

The %XDT_SITENAME% value will be pulled from Kudu. There's a few other variables you can use as well that are listed in the Kudu wiki.

@prejeanb
Copy link

prejeanb commented Jan 12, 2018

@xt0rted - Thanks for sharing! That simplifies the PowerShell script some. Here is the simplified script...
aspdotnetcorefixsimplified (2).zip

@joshmouch
Copy link

I just wanted to add my experience in case it helps in this. I'm working with a Microsoft Support employee who's helping me with the following exception that occurs periodically during slot swapping and during auto-scaling operations of my Azure App Service:

Message: Failed to bind to address http://127.0.0.1:6629: address already in use.
Exception: Microsoft.AspNetCore.Server.Kestrel.Transport.Libuv.Internal.Networking.UvException

He suggested I try this solution. I haven't yet since I'm trying to understand how it applies, but maybe it'll help your investigation.

@halter73
Copy link
Member

@pan-wang I don't understand this issue particularly with the address in use case. Here's why:

  1. Any time Kestrel logs "Failed to bind to address http://: address already in use.", it should always throw an exception from KestrelServer.Start and in turn WebHost.Run.
  2. This exception should bubble up through Program.Main and crash the dotnet process.
  3. ANCM should observe this process crash and attempt to restart the dotnet process with a new random port which should have a very good chance at succeeding in most scenarios.
  4. If it doesn't succeed a second time, ANCM should retry up to 10 times before ultimately failing.

Do we have any idea at which of the above steps things are falling apart? Based on other reports I've seen, the logs don't support the idea that there are 10 process start failures in a row. Do we even know if the dotnet process is crashing?

I'm starting to wonder if a bunch of applications are out there swallowing exceptions from WebHost.Run/Start that's leading to this strange behavior.

@JosephTremoulet Do you have any insight at which stage in the process crash/recovery cycle things are stalling?

@rollsch
Copy link

rollsch commented Jan 25, 2018

I just wanted to add my experience in case it helps in this. I'm working with a Microsoft Support employee who's helping me with the following exception that occurs periodically during slot swapping and during auto-scaling operations of my Azure App Service:

Message: Failed to bind to address http://127.0.0.1:6629: address already in use.
Exception: Microsoft.AspNetCore.Server.Kestrel.Transport.Libuv.Internal.Networking.UvException

He suggested I try this solution. I haven't yet since I'm trying to understand how it applies, but maybe it'll help your investigation.

I was also sent here by a Microsoft support employee. My problem occurs during instance changes of my WebApp hardware which result in random "AccessDenied" when accessing all of my DLLs, this locks the app as none of the DLLs can be accessed until the entire WebApp is restarted (not just the process, the actual WebApp environment must be stopped, then started again)

I've since done this and not see the error again (It was sporadic and would happen twice a week, once a month etc) but I can't be sure it is fixed.

Can anyone comment how this issue would relate to all of my .NET DLLs returning access denied when they are loaded? Case number is 117120117249053 if any other employees want to have a look.

@pan-wang
Copy link
Contributor Author

The root cause is that 1) AppWarmUp module send a faked request to start IIS pipeline and waiting for the IO completion of this faked request. ANCM received the faked request and tried to start backend process. if starting failed at first time, ANCM will create a child request to try again. Somehow, the IO completion for the parent/child request got lost. AppWrmUp module holds IIS pipeline forever. This error happens with parent/child request pattern. I will simply the starting logic to abandon the child request in the new code.

@joshmouch
Copy link

@halter It sounds like you are saying this exception can be ignored. Although I would wonder why a random port has a 99% chance of already being in use.

@pan-wang
Copy link
Contributor Author

pan-wang commented Feb 1, 2018

@nforysinski do you have FREB log

@sergei66666
Copy link

Guys, correct me if I'm wrong.
The first thing we need to do is to make the errors repeatable.
In my experience (in 100% of cases) and the experience of some people from this discussion, one of the reasons is a hard reboot of the server.
May be we try to find a way to reboot manually so we do not wait for any system events that lead to the problem.

p.s. I'm not sure that my problem is related to warmup, but the behavior of my application looks the same.

@pan-wang
Copy link
Contributor Author

pan-wang commented Feb 1, 2018

@rolandh I looked at #2700. For your case, Kestrel server hit some assembly loading issue. It is unrelated with AppWarmUp issue, in which requests get blocked by AppWarmUp module. @Tratcher and @davidfowl , do you have any idea about the possible root cause of issue "Could not load DiscountRules.CustomerRoles.dll Access is Denied #2700"

@pan-wang
Copy link
Contributor Author

pan-wang commented Feb 1, 2018

@joshmouch in most case, the port-in-use error is ignorable as the backend Kestrel exits on this error and ANMC will do the retry on the next request. To fix this port in use issue, the new code will do retry inside serving the first request instead of waiting for new request.

@pan-wang
Copy link
Contributor Author

pan-wang commented Feb 1, 2018

@sergei66666 , could you please provide more info about your scenario. Is your application hosted on AppService?

@sergei66666
Copy link

@pan-wang , yes, my app in AppService. And after it rebooted (after update for example), I get 502.5 error. I have logs, but they without any errors. I see my app in kudu panel, but i get 502.5. It can (sometimes can't) start works after plan or slot restart.
If I remember correctly, it started in November.

p.s. As I said, I do not know if this is due to warmup. Maybe this is another issue. But the symptoms look similar as I can judge.

@joshmouch
Copy link

joshmouch commented Feb 1, 2018

@nforysinski and pan-wang. I have about 3 different 502's from yesterday. Yesterday after one restart, all requests returned 502 until a second restart. Doesn't sound exactly like what you're talking about, though. I have FREB logs, but didn't want to throw too many things at pan-wang at once until I had this port issue fixed, and I didn't want to confuse this thread too much. ;) Also, I wanted to see if I could find the cause myself before posting here. I'll open up some tickets on these different issues.

@joshmouch
Copy link

@nforysinski I have two FREB logs here from yesterday showing your issue if I can post them somewhere.

@nforysinski
Copy link

nforysinski commented Feb 1, 2018

@joshmouch @pan-wang has shared his email earlier in the thread - I assume if you have FREB logs showing issues related to what I posted, you can send them over to him with a quick description that they are for an issue both of us are seeing.

I believe they are related to the issue that preloadEnabled=false is supposed to fix as it has all the same symptoms: instance count changes do not affect the outcome, restarts are usually unsuccessful, very random throughout the day and accompanied by a log that has a sudden, unexpected app shutdown. Only a slot swap with a deploy seems to resolve it this time.

The preload fix was working fine until two days ago when at ~2pm our app stopped responding to requests, and one day ago when at ~1pm it also stopped responding to requests. Only slot swaps after a deploy could get it back to working.

@pan-wang
Copy link
Contributor Author

pan-wang commented Feb 2, 2018

@nforysinski slot swaps will lead to w3wp recycle. What do you mean "stopped responding"? 500s error or hang? Did you notice any abnormal about memory usage of w3wp.exe or dotnet.exe?

@nforysinski
Copy link

@pan-wang after two minutes, 502 which seems to indicate app service timeout instead of net core timeout. Looking back at our logs, there were no abnormal or unexpected rises in memory or CPU usage on the app service at all.

@joshmouch
Copy link

@nforysinski @pan-wang I'm still not sure if my problem is the same, but just though I'd chime in. When I see this timeout after two minutes, the FREB log shows a "2147954402" error in the AspNetCore module. For example:

93. | MODULE_SET_RESPONSE_ERROR_STATUSWarning | ModuleName="AspNetCoreModule",  Notification="EXECUTE_REQUEST_HANDLER", HttpStatus="502",  HttpReason="Bad Gateway", HttpSubStatus="3", ErrorCode="2147954402",  ConfigExceptionInfo="" | 06:48:58.315
-- | -- | -- | --
94. | NOTIFY_MODULE_COMPLETION | ModuleName="AspNetCoreModule",  Notification="EXECUTE_REQUEST_HANDLER",  fIsPostNotificationEvent="false", CompletionBytes="0", ErrorCode="The  operation completed successfully.  (0x0)" | 06:48:58.315


@pan-wang
Copy link
Contributor Author

pan-wang commented Feb 2, 2018

@joshmouch based on the log and error code 2147954402 (WININET_E_TIMEOUT) , I would say you hit a different issue other than warmup. In your case, ANCM forwarded the request to the backend while the backend never response. You may want to turn on stdout log in web.config to see what happened at the backend. In the past I saw one similar case that there was a deadlock in backend during start up.

@nforysinski
Copy link

@pan-wang I just emailed you an example of a failed request that occurred during this time. Spoiler alert: I am seeing the exact same error code as @joshmouch but this has occurred this morning and two different times about 10 minutes apart. After several interventions the site was able to recover. Again, this bears much resemblance to the warmup issues we were seeing before. I have turned on stdout to see if there are any errors reporting at the instance they are occurring.

@nforysinski
Copy link

nforysinski commented Feb 6, 2018

A little more info - in our event logs, everytime this occurs, we see several events like this:


        <System>

            <Provider Name="W3SVC-WP"/>

            <EventID>2303</EventID>

            <Level>2</Level>

            <Task>0</Task>

            <Keywords>Keywords</Keywords>

            <TimeCreated SystemTime="2018-02-06T22:30:25Z"/>

            <EventRecordID>32770765</EventRecordID>

            <Channel>Application</Channel>

            <Computer>RD00155D85A22C</Computer>

            <Security/>

        </System>

        <EventData>

            <Data>api__ed8f</Data>

            <Binary>02000780</Binary>

        </EventData>

    </Event>

    <Event>

        <System>

            <Provider Name="W3SVC-WP"/>

            <EventID>2303</EventID>

            <Level>2</Level>

            <Task>0</Task>

            <Keywords>Keywords</Keywords>

            <TimeCreated SystemTime="2018-02-06T22:30:25Z"/>

            <EventRecordID>32770781</EventRecordID>

            <Channel>Application</Channel>

            <Computer>RD00155D85A22C</Computer>

            <Security/>

        </System>

        <EventData>

            <Data>api__ed8f</Data>

            <Binary>02000780</Binary>

        </EventData>

    </Event>

    <Event>

        <System>

            <Provider Name="W3SVC-WP"/>

            <EventID>2303</EventID>

            <Level>2</Level>

            <Task>0</Task>

            <Keywords>Keywords</Keywords>

            <TimeCreated SystemTime="2018-02-06T22:30:26Z"/>

            <EventRecordID>28674375</EventRecordID>

            <Channel>Application</Channel>

            <Computer>RD00155D858F1B</Computer>

            <Security/>

        </System>

        <EventData>

            <Data>api__ed8f</Data>

            <Binary>02000780</Binary>

        </EventData>

    </Event>

    <Event>

        <System>

            <Provider Name="W3SVC-WP"/>

            <EventID>2303</EventID>

            <Level>2</Level>

            <Task>0</Task>

            <Keywords>Keywords</Keywords>

            <TimeCreated SystemTime="2018-02-06T22:30:26Z"/>

            <EventRecordID>28674390</EventRecordID>

            <Channel>Application</Channel>

            <Computer>RD00155D858F1B</Computer>

            <Security/>

        </System>

        <EventData>

            <Data>api__ed8f</Data>

            <Binary>02000780</Binary>

        </EventData>

    </Event>

@pan-wang
Copy link
Contributor Author

pan-wang commented Feb 7, 2018 via email

@nforysinski
Copy link

nforysinski commented Feb 7, 2018

@pan-wang
Here is an example of a failed log from the time of the event:

Connection id "0HLBDMOILAACT", Request id "0HLBDMOILAACT:00000001": An unhandled exception was thrown by the application.
System.NullReferenceException: Object reference not set to an instance of an object.
   at Custom.Domain.Query.Execute(IDbConnection connection) in Query.cs:line 30
   at Custom.Domain.Infrastructure.Queries.ListQueryHandler`1.Handle(IListQuery`1 query) in ListQueryHandler.cs:line 20
   at Custom.Domain.Controller.Search(StudentSearchRequest studentSearchRequest)
   at lambda_method(Closure , Object , Object[] )
   at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.<InvokeActionMethodAsync>d__12.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.<InvokeNextActionFilterAsync>d__10.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.Rethrow(ActionExecutedContext context)
   at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Internal.ControllerActionInvoker.<InvokeInnerFilterAsync>d__14.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.<InvokeNextExceptionFilterAsync>d__23.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.Rethrow(ExceptionContext context)
   at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.<InvokeNextResourceFilter>d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.Rethrow(ResourceExecutedContext context)
   at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.<InvokeFilterPipelineAsync>d__17.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.<InvokeAsync>d__15.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Builder.RouterMiddleware.<Invoke>d__4.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.ResponseCompression.ResponseCompressionMiddleware.<Invoke>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.<Invoke>d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Cors.Infrastructure.CorsMiddleware.<Invoke>d__7.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Server.IISIntegration.IISMiddleware.<Invoke>d__11.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Hosting.Internal.RequestServicesContainerMiddleware.<Invoke>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Frame`1.<ProcessRequestsAsync>d__2.MoveNext()
fail: Microsoft.AspNetCore.Server.Kestrel[13]

@pan-wang
Copy link
Contributor Author

pan-wang commented Feb 7, 2018

it seems the backend had unhandled exception and never return to the forwarded request. this make sense for the timeout and warmup failure. @Tratcher could you please help to diagnose this runtime compilerservices exception?

@Tratcher
Copy link
Member

Tratcher commented Feb 7, 2018

@nforysinski that stack trace shows an exception inside your MVC Action that gets caught and logged by Kestrel. Do you have any timestamps showing if this exception happened before or after the ANCM timeout?

What it looks like is happening is the action taking a long time due to a large/slow database operation, ANCM times out and closes the connection, Kestrel fires the RequestAborted cancellation token, and something null refs inside of the subsequent database operation cancellation.

@nforysinski
Copy link

@Tratcher Unfortunately I do not as my logs were only set to warning. Will changing the log level to information catch the appropriate log you need?

@Tratcher
Copy link
Member

Tratcher commented Feb 7, 2018

Yes it should help. Getting time stamps depends on the logger you use, but Hosting always logs the request elapsed time at the end of each request. See
https://github.com/aspnet/Hosting/blob/8765d55f9f97faa3e3df67743ae888d7c6557d22/src/Microsoft.AspNetCore.Hosting/Internal/HostingApplicationDiagnostics.cs#L186-L193

@nforysinski
Copy link

After some run-around with the Azure team, they recommended an adjustment to the applicationHost.xdt that we had setup. This seems to have resolved it now, but assuredly points to the continuation of an issue between IIS inAzure and ANCM with regards to app restarts and the preload module.

@Pluc15
Copy link

Pluc15 commented Feb 13, 2018

@nforysinski What is the adjustment?

@nforysinski
Copy link

nforysinski commented Feb 13, 2018

@Pluc15

Our original file:

<?xml version="1.0"?>
<configuration xmlns:xdt="http://schemas.microsoft.com/XML-Document-Transform">
    <system.applicationHost>
        <sites>
            <site name="site_name__01d" xdt:Locator="Match(name)">
                <application xdt:Transform="SetAttributes(preloadEnabled)" preloadEnabled="false"></application>
            </site>
        </sites>
    </system.applicationHost>
</configuration>

The adjusted file (note the wildcard in the site name portion):

<?xml version="1.0"?>
<configuration xmlns:xdt="http://schemas.microsoft.com/XML-Document-Transform">
    <system.applicationHost>
        <sites>
            <site name="%XDT_SITENAME%" xdt:Locator="Match(name)">
                <application xdt:Transform="SetAttributes(preloadEnabled)" preloadEnabled="false"></application>
            </site>
        </sites>
    </system.applicationHost>
</configuration>

@NoahStahl
Copy link

@nforysinski have you seen the issue recur after implementing the preload disable you mention on Feb 13? We saw the issue recur again on Feb 27 after supposedly disabling preload using the explicit site_name value (not the XDT_SITENAME variable).

@nforysinski
Copy link

@NoahStahl nope - not since adjusting the xdt file to use the variable - seems to be working well to prevent this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests