Make aspnetcore shutdown work by default on Kubernetes #30387

jkotalik · 2021-02-23T05:35:53Z

Right now, as soon as an aspnetcore app sees CTRL + C, it immediately starts shutting down the server, rejecting and draining connections. This works well when running on command line, but when you move into k8s, requests may be dropped.

This article gives a general overview of the problem. Kubernetes, the ingress controller, CoreDNS need time to remove the IP address from their internal state.

There are some solutions where you could have a dotnet app listen for endpoint changes within the k8s cluster. The recommended approach is to actually just wait after receiving a SIGTERM signal, rather than immediately shutting down.

The modification would be to enable shutdown on k8s to:

Wait a bit longer before exiting
Continue to process incoming traffic, even though we get SIGTERM
Shutdown all connections (drain), especially long lived ones like DB or WebSockets
Close the process

A recommend time to wait is 15 seconds.

I think the most difficult part of this will be detecting if you are running inside of k8s. Besides that, a Task.Delay seems like an appropriate solution.

davidfowl · 2021-02-23T07:33:36Z

I don't think we should detect if you're in k8s, it should just be a setting that we recommend people set in k8s. That would be the least surprising thing to do. Likely something akin to this ShutdownTimeout or maybe we can repurpose it?

Should we throw if this setting is < the shutdown timeout? Are those 2 things configured in the same place? I think this should be implemented in the GenericWebHostedService.StopAsync.

ghost · 2021-02-24T21:23:01Z

Thanks for contacting us.
We're moving this issue to the Next sprint planning milestone for future evaluation / consideration. We will evaluate the request when we are planning the work for the next milestone. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

halter73 · 2021-02-24T21:37:03Z

I don't think we should detect if you're in k8s, it should just be a setting that we recommend people set in k8s.

If we're going to require a specific action from developers to enable this this, why not just recommend doing the following as suggested in https://learnk8s.io/graceful-shutdown?

      lifecycle:
        preStop:
          exec:
            command: ["sleep", "15"]

From reading that article, this seems like a common problem. Do we know if any other frameworks continue processing new requests and accepting new connections after receiving a SIGTERM? Or if they even have the option?

davidfowl · 2021-02-24T21:50:15Z

If we're going to require a specific action from developers to enable this this, why not just recommend doing the following as suggested in https://learnk8s.io/graceful-shutdown?

We do right now.

From reading that article, this seems like a common problem. Do we know if any other frameworks continue processing new requests and accepting new connections after receiving a SIGTERM? Or if they even have the option?

Not sure, but I think we should build it in because it's simple and solves a common problem in a core scenario.

ghost · 2021-08-03T01:50:31Z

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

cdlliuy · 2022-03-29T10:20:40Z

From reading that article, this seems like a common problem. Do we know if any other frameworks continue processing new requests and accepting new connections after receiving a SIGTERM? Or if they even have the option?

Not sure, but I think we should build it in because it's simple and solves a common problem in a core scenario.

per my knowledge of Golang, when the SIGTERM is triggered, the golang web server will stop to accept new connection, and
wait for checking the existing connection to be completed. If no active connection, the server will shutdown safely.

I think it will be helpful for aspnet customer if it can provide similar capacity. Currently, in our own application, we plan to add a customized middle ware to count the active http connection number (count ++ for incoming request, count-- when the request is done), and hold the applicationstopping handler until the connection number is 0

davidfowl · 2022-03-29T15:54:53Z

I think it will be helpful for aspnet customer if it can provide similar capacity. Currently, in our own application, we plan to add a customized middle ware to count the active http connection number (count ++ for incoming request, count-- when the request is done), and hold the applicationstopping handler until the connection number is 0

ASP.NET Core does this already. You can verify this by having a connection that does a Task.Delay for 10 seconds and watch the behavior on shutdown.

The above describes that we should introduce a small delay before we start the shutdown process. Which you can do today in 2 ways:

Sleep in the ApplicationStopping event (kinda gross)
Or use Make aspnetcore shutdown work by default on Kubernetes #30387 (comment)

davidfowl · 2022-08-31T06:55:37Z

Moving this into .NET 8 planning. We can make some small changes here to make this work a bit better than it does today in k8s (it can even be opt in).

davidfowl · 2022-08-31T07:04:14Z

Graceful shutdown here is a bit of a misnomer because we shutdown gracefully today already. This is about dealing with the lack of coordination between signaling that the pod should be shutdown and spinning up a new pod. In a different reality, the new pod would be spun up before sending SIGTERM but it happens in parallel leading to this mess 😄

andrey-noskov · 2022-09-05T10:19:32Z

indeed, the mentioned delay is needed to bring some order to a process where two actions are executed in parallel - (1) where k8s removes the pod from the service Endpoints (from LoadBalancer) and (2) where the pod is signaled.
in addition, while the delay is active, I think, it would better for the pod to start to gracefully close connections as in Windows (as of today) the first action also results in blackholing for the already established connections with the pod.

karolz-ms · 2023-04-11T18:32:33Z

Graceful shutdown here is a bit of a misnomer because we shutdown gracefully today already.

Well, not quite. When SIGTERM arrives, the app stops accepting new requests, but existing requests are allowed to run to completion. This makes sense, except if the request handlers do not finish within the timeout period, they will be aborted. This also makes sense, but by default the request handlers are not "told" that the app is stopping. If you care about cancelling request work cleanly, you need to either:

handle ApplicationStopping explicitly, or
combine HttpContext.RequestAborted with ApplicationStopping, e.g. via custom middleware

Either way you have some work to do to be really graceful ☺️ Maybe there is something ASP.NET could do to make it easier?

This is about dealing with the lack of coordination between signaling that the pod should be shutdown and spinning up a new pod. In a different reality, the new pod would be spun up before sending SIGTERM but it happens in parallel leading to this mess 😄

No, this is about the inevitable (as in "by design") delay between K8s signaling the pod to shut down via SIGTERM and the ingress stopping sending requests to that pod.

Handling this problem via preStop lifecycle hook has the downside that one needs to have this setting explicitly added to every pod spec. It is a hassle. Yes, you can get fancy with labels and mutating admission webhooks, but it is yet another thing you need to remember to do, and K8s is not exactly the simplest thing. It also means you need to make sure that the image contains the executable used for the hook (problematic if you are trying to keep the images minimal), or (in case of HTTP-based hooks) that you implement the relevant endpoint for the hook.

I'd argue a better option is to introduce a mechanism for a delay between SIGTERM arriving and ApplicationStopping being raised (it is the "sleep in ApplicationStopping" idea #30387 (comment) but better ☺️ ). I have a proof-of-concept for this in the shape of a custom IHostLifecycle component; a clone of the default ConsoleLifecycle that adds such delay.

tmds · 2023-04-13T04:49:45Z

Based on this issue and dotnet/dotnet-docker#4502.

Chronologically, this seems to be the desired behavior:

After SIGTERM:

some time* to still accept requests, then stop accepting new requests.
some time to answer to all on-going requests
cancel the token for all still on-going requests
wait for all requests to be completed, or ShutdownTimeout (whatever happens sooner)

Maybe: during 1 and 2 new requests should be answered immediately taking into account the SIGTERM, for example, reply with 503 Service Unavailable. This could make it unnecessary to stop accepting new requests (1.).

Then there are two timeouts: ShutdownAcceptConnectionsTimeout (for 1.) and ShutdownCancelRequestsTimeout (for 3.).

*: this time should be chosen appropriately for when Kubernetes is expected to no longer send requests to this pod after SIGTERM.

andrey-noskov · 2023-04-13T15:39:19Z

I believe the application should continue to accept new requests as long as Kubernetes continues to send them. so, I think the behaviour should be:

give some time to Kubernetes to update the routing layer and act normally, i.e., add the shutdown delay
then wait for all requests to be completed, or ShutdownTimeout, i.e., preserve the current behaviour of the Host

hwoodiwiss · 2023-05-31T15:23:21Z

We've run into this recently on some high-volume applications, and after moving to k8s, we'd noticed a small amount of traffic lost on deployments.

We've implemented a fix based on recommendations from https://github.com/dotnet/dotnet-docker/blob/main/samples/kubernetes/graceful-shutdown/graceful-shutdown.md, which has us following the sequence of events that @tmds described.

My totally biased opinion would be that this would be a good candidate for .NET 8, as this hit us rather unexpectedly, I think we had a bit of a preconception that this would "just work".

If this was something that was to be considered for some time soon, I guess two options for implementations could be to either:

Add a new IHostLifetime implementation OOTB that can be opted into
OR Add a TimeSpan? ShutdownDelay (or some other name) on HostOptions and make use of that in the various shutdown handlers in the netcoreapp and notnetcoreapp implementions, with no delay if the ShutdownDelay is null

davidfowl · 2023-05-31T15:57:03Z

I still like the shutdown delay idea.

hwoodiwiss · 2023-05-31T16:46:40Z

I still like the shutdown delay idea.

FWIW, essentially just reused ShutdownTimeout in what we've implemented, so you get one ShutdownTimeout of time still taking requests, and another for finishing in-flight requests.

TimeSpan? ShutdownDelay feels like a nice solution to me, it makes it opt-in, and means it'll be configurable in a familiar way

andrey-noskov · 2023-06-01T09:40:30Z

I also like the nullable ShutdownDelay option

hwoodiwiss · 2023-06-04T11:07:30Z

I guess one question worth asking is where in the process of shutting down should a delay be added.

The Host Lifetime feels like the most "natural" place, as there the shutdown of all services can be delayed uniformly, but that also has the most wide-reaching implications and would add a soft requirement to implement this delay to derivers of IHostLifetime, otherwise consumers of custom IHostLifetimes would have unmet expectations. (I've PoC'd that here to help met think about it)

Another could be to have the delay in WebHost at the point in StopAsync where Server.StopAsync is called, but that has the issue that even if there aren't other IHostedServices that ASP.NET Relies on that would be shutdown, applications with hosted services would need to build the delay in as well.

davidfowl · 2023-09-26T04:45:02Z

@hwoodiwiss I love it.

@karolz-ms You wrote a doc about this recently right?

karolz-ms · 2023-09-26T15:10:23Z

Yes, https://github.com/dotnet/dotnet-docker/blob/main/samples/kubernetes/graceful-shutdown/graceful-shutdown.md is the guidance that @richlander and myself came up with.

hwoodiwiss · 2023-10-05T11:20:51Z

@karolz-ms My PoC implementation and the one we've used internally is heavily inspired by this guidance, thank you!

I'm happy to raise a PR based on my PoC, at least as a talking point, and I can flesh it out with tests, and an implementation in ConsoleLifetime.notnetcoreapp.cs, this would be great to have OOTB for .NET 9.

davidfowl · 2023-10-05T15:58:28Z

I'm happy to raise a PR based on my PoC, at least as a talking point, and I can flesh it out with tests, and an implementation in ConsoleLifetime.notnetcoreapp.cs, this would be great to have OOTB for .NET 9.

Lets get it into .NET 9!

jorgepsmatos · 2024-01-16T15:09:26Z

I believe we are seeing this issue on our services but would like a way to reproduce it consitently. Any suggestions on how to do it?

EDIT: Found this blog post that helps with reproducing this behaviour https://blog.markvincze.com/graceful-termination-in-kubernetes-with-asp-net-core/

ArminShoeibi · 2024-05-19T10:58:05Z

Yes, https://github.com/dotnet/dotnet-docker/blob/main/samples/kubernetes/graceful-shutdown/graceful-shutdown.md is the guidance that @richlander and myself came up with.

Hi there @karolz-ms, Does this method work better than this approach?

public class ApplicationLifetimeHostedService(IHostApplicationLifetime hostApplicationLifetime, ILogger<ApplicationLifetimeHostedService> logger) : BackgroundService
{
    protected override Task ExecuteAsync(CancellationToken stoppingToken)
    {
        hostApplicationLifetime.ApplicationStopping.Register(() =>
        {
            logger.LogError("SIGTERM received, waiting for 90 seconds");
            Thread.Sleep(TimeSpan.FromSeconds(90));
            logger.LogError("Termination delay complete, continuing stopping process");
        });
        return Task.CompletedTask;
    }
}

karolz-ms · 2024-05-20T16:58:52Z

@ArminShoeibi setting the ShutdownTimeout or having the ApplicationStopping handler block for a certain amount of time are essentially equivalent. Both will give requests in-flight extra time to finish. But they do not help with new requests that will come AFTER Kubernetes sent the replica a SIGTERM, but BEFORE the ingress has updated its routing tables. To allow these “late” requests to be processed and avoid 502 errors, you need to apply a custom IHostLifetime implementation. This is all described in more detail in the guide you referenced, specifically https://github.com/dotnet/dotnet-docker/blob/main/samples/kubernetes/graceful-shutdown/graceful-shutdown.md#adding-a-shutdown-delay part.

jkotalik added the area-servers label Feb 23, 2021

BrennanConroy added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Feb 24, 2021

BrennanConroy added this to the Next sprint planning milestone Feb 24, 2021

adityamandaleeka modified the milestones: Next sprint planning, Backlog Aug 3, 2021

davidfowl modified the milestones: Backlog, .NET 8 Planning Aug 31, 2022

karolz-ms mentioned this issue Apr 11, 2023

Add Kubernetes samples dotnet/dotnet-docker#4502

Merged

amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Aug 24, 2023

hwoodiwiss mentioned this issue Dec 10, 2023

[API Proposal]: Enable configurable HostLifetime shutdown delay via HostOptions dotnet/runtime#95835

Open

amcasey modified the milestones: .NET 8 Planning, .NET 9 Planning Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make aspnetcore shutdown work by default on Kubernetes #30387

Make aspnetcore shutdown work by default on Kubernetes #30387

jkotalik commented Feb 23, 2021

davidfowl commented Feb 23, 2021

ghost commented Feb 24, 2021

halter73 commented Feb 24, 2021

davidfowl commented Feb 24, 2021

ghost commented Aug 3, 2021

cdlliuy commented Mar 29, 2022

davidfowl commented Mar 29, 2022

davidfowl commented Aug 31, 2022

davidfowl commented Aug 31, 2022

andrey-noskov commented Sep 5, 2022

karolz-ms commented Apr 11, 2023

tmds commented Apr 13, 2023 •

edited

Loading

andrey-noskov commented Apr 13, 2023 •

edited

Loading

hwoodiwiss commented May 31, 2023

davidfowl commented May 31, 2023

hwoodiwiss commented May 31, 2023 •

edited

Loading

andrey-noskov commented Jun 1, 2023

hwoodiwiss commented Jun 4, 2023

davidfowl commented Sep 26, 2023

karolz-ms commented Sep 26, 2023

hwoodiwiss commented Oct 5, 2023

davidfowl commented Oct 5, 2023

jorgepsmatos commented Jan 16, 2024 •

edited

Loading

ArminShoeibi commented May 19, 2024

karolz-ms commented May 20, 2024

Make aspnetcore shutdown work by default on Kubernetes #30387

Make aspnetcore shutdown work by default on Kubernetes #30387

Comments

jkotalik commented Feb 23, 2021

davidfowl commented Feb 23, 2021

ghost commented Feb 24, 2021

halter73 commented Feb 24, 2021

davidfowl commented Feb 24, 2021

ghost commented Aug 3, 2021

cdlliuy commented Mar 29, 2022

davidfowl commented Mar 29, 2022

davidfowl commented Aug 31, 2022

davidfowl commented Aug 31, 2022

andrey-noskov commented Sep 5, 2022

karolz-ms commented Apr 11, 2023

tmds commented Apr 13, 2023 • edited Loading

andrey-noskov commented Apr 13, 2023 • edited Loading

hwoodiwiss commented May 31, 2023

davidfowl commented May 31, 2023

hwoodiwiss commented May 31, 2023 • edited Loading

andrey-noskov commented Jun 1, 2023

hwoodiwiss commented Jun 4, 2023

davidfowl commented Sep 26, 2023

karolz-ms commented Sep 26, 2023

hwoodiwiss commented Oct 5, 2023

davidfowl commented Oct 5, 2023

jorgepsmatos commented Jan 16, 2024 • edited Loading

ArminShoeibi commented May 19, 2024

karolz-ms commented May 20, 2024

tmds commented Apr 13, 2023 •

edited

Loading

andrey-noskov commented Apr 13, 2023 •

edited

Loading

hwoodiwiss commented May 31, 2023 •

edited

Loading

jorgepsmatos commented Jan 16, 2024 •

edited

Loading