Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make aspnetcore shutdown work by default on Kubernetes #30387

Open
jkotalik opened this issue Feb 23, 2021 · 25 comments
Open

Make aspnetcore shutdown work by default on Kubernetes #30387

jkotalik opened this issue Feb 23, 2021 · 25 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions

Comments

@jkotalik
Copy link
Contributor

Right now, as soon as an aspnetcore app sees CTRL + C, it immediately starts shutting down the server, rejecting and draining connections. This works well when running on command line, but when you move into k8s, requests may be dropped.

This article gives a general overview of the problem. Kubernetes, the ingress controller, CoreDNS need time to remove the IP address from their internal state.

There are some solutions where you could have a dotnet app listen for endpoint changes within the k8s cluster. The recommended approach is to actually just wait after receiving a SIGTERM signal, rather than immediately shutting down.

The modification would be to enable shutdown on k8s to:

  1. Wait a bit longer before exiting
  2. Continue to process incoming traffic, even though we get SIGTERM
  3. Shutdown all connections (drain), especially long lived ones like DB or WebSockets
  4. Close the process

A recommend time to wait is 15 seconds.

I think the most difficult part of this will be detecting if you are running inside of k8s. Besides that, a Task.Delay seems like an appropriate solution.

@davidfowl
Copy link
Member

I don't think we should detect if you're in k8s, it should just be a setting that we recommend people set in k8s. That would be the least surprising thing to do. Likely something akin to this ShutdownTimeout or maybe we can repurpose it?

Should we throw if this setting is < the shutdown timeout? Are those 2 things configured in the same place? I think this should be implemented in the GenericWebHostedService.StopAsync.

@BrennanConroy BrennanConroy added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Feb 24, 2021
@BrennanConroy BrennanConroy added this to the Next sprint planning milestone Feb 24, 2021
@ghost
Copy link

ghost commented Feb 24, 2021

Thanks for contacting us.
We're moving this issue to the Next sprint planning milestone for future evaluation / consideration. We will evaluate the request when we are planning the work for the next milestone. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

@halter73
Copy link
Member

I don't think we should detect if you're in k8s, it should just be a setting that we recommend people set in k8s.

If we're going to require a specific action from developers to enable this this, why not just recommend doing the following as suggested in https://learnk8s.io/graceful-shutdown?

      lifecycle:
        preStop:
          exec:
            command: ["sleep", "15"]

From reading that article, this seems like a common problem. Do we know if any other frameworks continue processing new requests and accepting new connections after receiving a SIGTERM? Or if they even have the option?

@davidfowl
Copy link
Member

If we're going to require a specific action from developers to enable this this, why not just recommend doing the following as suggested in https://learnk8s.io/graceful-shutdown?

We do right now.

From reading that article, this seems like a common problem. Do we know if any other frameworks continue processing new requests and accepting new connections after receiving a SIGTERM? Or if they even have the option?

Not sure, but I think we should build it in because it's simple and solves a common problem in a core scenario.

@ghost
Copy link

ghost commented Aug 3, 2021

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

@cdlliuy
Copy link

cdlliuy commented Mar 29, 2022

From reading that article, this seems like a common problem. Do we know if any other frameworks continue processing new requests and accepting new connections after receiving a SIGTERM? Or if they even have the option?

Not sure, but I think we should build it in because it's simple and solves a common problem in a core scenario.

per my knowledge of Golang, when the SIGTERM is triggered, the golang web server will stop to accept new connection, and
wait for checking the existing connection to be completed. If no active connection, the server will shutdown safely.

I think it will be helpful for aspnet customer if it can provide similar capacity. Currently, in our own application, we plan to add a customized middle ware to count the active http connection number (count ++ for incoming request, count-- when the request is done), and hold the applicationstopping handler until the connection number is 0

@davidfowl
Copy link
Member

I think it will be helpful for aspnet customer if it can provide similar capacity. Currently, in our own application, we plan to add a customized middle ware to count the active http connection number (count ++ for incoming request, count-- when the request is done), and hold the applicationstopping handler until the connection number is 0

ASP.NET Core does this already. You can verify this by having a connection that does a Task.Delay for 10 seconds and watch the behavior on shutdown.

The above describes that we should introduce a small delay before we start the shutdown process. Which you can do today in 2 ways:

@davidfowl davidfowl modified the milestones: Backlog, .NET 8 Planning Aug 31, 2022
@davidfowl
Copy link
Member

Moving this into .NET 8 planning. We can make some small changes here to make this work a bit better than it does today in k8s (it can even be opt in).

@davidfowl
Copy link
Member

Graceful shutdown here is a bit of a misnomer because we shutdown gracefully today already. This is about dealing with the lack of coordination between signaling that the pod should be shutdown and spinning up a new pod. In a different reality, the new pod would be spun up before sending SIGTERM but it happens in parallel leading to this mess 😄

@andrey-noskov
Copy link

indeed, the mentioned delay is needed to bring some order to a process where two actions are executed in parallel - (1) where k8s removes the pod from the service Endpoints (from LoadBalancer) and (2) where the pod is signaled.
in addition, while the delay is active, I think, it would better for the pod to start to gracefully close connections as in Windows (as of today) the first action also results in blackholing for the already established connections with the pod.

@karolz-ms
Copy link
Member

Graceful shutdown here is a bit of a misnomer because we shutdown gracefully today already.

Well, not quite. When SIGTERM arrives, the app stops accepting new requests, but existing requests are allowed to run to completion. This makes sense, except if the request handlers do not finish within the timeout period, they will be aborted. This also makes sense, but by default the request handlers are not "told" that the app is stopping. If you care about cancelling request work cleanly, you need to either:

  • handle ApplicationStopping explicitly, or
  • combine HttpContext.RequestAborted with ApplicationStopping, e.g. via custom middleware

Either way you have some work to do to be really graceful ☺️ Maybe there is something ASP.NET could do to make it easier?

This is about dealing with the lack of coordination between signaling that the pod should be shutdown and spinning up a new pod. In a different reality, the new pod would be spun up before sending SIGTERM but it happens in parallel leading to this mess 😄

No, this is about the inevitable (as in "by design") delay between K8s signaling the pod to shut down via SIGTERM and the ingress stopping sending requests to that pod.

Handling this problem via preStop lifecycle hook has the downside that one needs to have this setting explicitly added to every pod spec. It is a hassle. Yes, you can get fancy with labels and mutating admission webhooks, but it is yet another thing you need to remember to do, and K8s is not exactly the simplest thing. It also means you need to make sure that the image contains the executable used for the hook (problematic if you are trying to keep the images minimal), or (in case of HTTP-based hooks) that you implement the relevant endpoint for the hook.

I'd argue a better option is to introduce a mechanism for a delay between SIGTERM arriving and ApplicationStopping being raised (it is the "sleep in ApplicationStopping" idea #30387 (comment) but better ☺️ ). I have a proof-of-concept for this in the shape of a custom IHostLifecycle component; a clone of the default ConsoleLifecycle that adds such delay.

@tmds
Copy link
Member

tmds commented Apr 13, 2023

Based on this issue and dotnet/dotnet-docker#4502.

Chronologically, this seems to be the desired behavior:

After SIGTERM:

  1. some time* to still accept requests, then stop accepting new requests.
  2. some time to answer to all on-going requests
  3. cancel the token for all still on-going requests
  4. wait for all requests to be completed, or ShutdownTimeout (whatever happens sooner)

Maybe: during 1 and 2 new requests should be answered immediately taking into account the SIGTERM, for example, reply with 503 Service Unavailable. This could make it unnecessary to stop accepting new requests (1.).

Then there are two timeouts: ShutdownAcceptConnectionsTimeout (for 1.) and ShutdownCancelRequestsTimeout (for 3.).

*: this time should be chosen appropriately for when Kubernetes is expected to no longer send requests to this pod after SIGTERM.

@andrey-noskov
Copy link

andrey-noskov commented Apr 13, 2023

I believe the application should continue to accept new requests as long as Kubernetes continues to send them. so, I think the behaviour should be:

  1. give some time to Kubernetes to update the routing layer and act normally, i.e., add the shutdown delay
  2. then wait for all requests to be completed, or ShutdownTimeout, i.e., preserve the current behaviour of the Host

@hwoodiwiss
Copy link
Contributor

We've run into this recently on some high-volume applications, and after moving to k8s, we'd noticed a small amount of traffic lost on deployments.

We've implemented a fix based on recommendations from https://github.com/dotnet/dotnet-docker/blob/main/samples/kubernetes/graceful-shutdown/graceful-shutdown.md, which has us following the sequence of events that @tmds described.

My totally biased opinion would be that this would be a good candidate for .NET 8, as this hit us rather unexpectedly, I think we had a bit of a preconception that this would "just work".

If this was something that was to be considered for some time soon, I guess two options for implementations could be to either:

  • Add a new IHostLifetime implementation OOTB that can be opted into
  • OR Add a TimeSpan? ShutdownDelay (or some other name) on HostOptions and make use of that in the various shutdown handlers in the netcoreapp and notnetcoreapp implementions, with no delay if the ShutdownDelay is null

@davidfowl
Copy link
Member

I still like the shutdown delay idea.

@hwoodiwiss
Copy link
Contributor

hwoodiwiss commented May 31, 2023

I still like the shutdown delay idea.

FWIW, essentially just reused ShutdownTimeout in what we've implemented, so you get one ShutdownTimeout of time still taking requests, and another for finishing in-flight requests.

TimeSpan? ShutdownDelay feels like a nice solution to me, it makes it opt-in, and means it'll be configurable in a familiar way

@andrey-noskov
Copy link

I also like the nullable ShutdownDelay option

@hwoodiwiss
Copy link
Contributor

I guess one question worth asking is where in the process of shutting down should a delay be added.

The Host Lifetime feels like the most "natural" place, as there the shutdown of all services can be delayed uniformly, but that also has the most wide-reaching implications and would add a soft requirement to implement this delay to derivers of IHostLifetime, otherwise consumers of custom IHostLifetimes would have unmet expectations. (I've PoC'd that here to help met think about it)

Another could be to have the delay in WebHost at the point in StopAsync where Server.StopAsync is called, but that has the issue that even if there aren't other IHostedServices that ASP.NET Relies on that would be shutdown, applications with hosted services would need to build the delay in as well.

@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Aug 24, 2023
@davidfowl
Copy link
Member

@hwoodiwiss I love it.

@karolz-ms You wrote a doc about this recently right?

@karolz-ms
Copy link
Member

Yes, https://github.com/dotnet/dotnet-docker/blob/main/samples/kubernetes/graceful-shutdown/graceful-shutdown.md is the guidance that @richlander and myself came up with.

@hwoodiwiss
Copy link
Contributor

@karolz-ms My PoC implementation and the one we've used internally is heavily inspired by this guidance, thank you!

I'm happy to raise a PR based on my PoC, at least as a talking point, and I can flesh it out with tests, and an implementation in ConsoleLifetime.notnetcoreapp.cs, this would be great to have OOTB for .NET 9.

@davidfowl
Copy link
Member

I'm happy to raise a PR based on my PoC, at least as a talking point, and I can flesh it out with tests, and an implementation in ConsoleLifetime.notnetcoreapp.cs, this would be great to have OOTB for .NET 9.

Lets get it into .NET 9!

@jorgepsmatos
Copy link

jorgepsmatos commented Jan 16, 2024

I believe we are seeing this issue on our services but would like a way to reproduce it consitently. Any suggestions on how to do it?

EDIT: Found this blog post that helps with reproducing this behaviour https://blog.markvincze.com/graceful-termination-in-kubernetes-with-asp-net-core/

@ArminShoeibi
Copy link

Yes, https://github.com/dotnet/dotnet-docker/blob/main/samples/kubernetes/graceful-shutdown/graceful-shutdown.md is the guidance that @richlander and myself came up with.

Hi there @karolz-ms, Does this method work better than this approach?

public class ApplicationLifetimeHostedService(IHostApplicationLifetime hostApplicationLifetime, ILogger<ApplicationLifetimeHostedService> logger) : BackgroundService
{
    protected override Task ExecuteAsync(CancellationToken stoppingToken)
    {
        hostApplicationLifetime.ApplicationStopping.Register(() =>
        {
            logger.LogError("SIGTERM received, waiting for 90 seconds");
            Thread.Sleep(TimeSpan.FromSeconds(90));
            logger.LogError("Termination delay complete, continuing stopping process");
        });
        return Task.CompletedTask;
    }
}

@karolz-ms
Copy link
Member

@ArminShoeibi setting the ShutdownTimeout or having the ApplicationStopping handler block for a certain amount of time are essentially equivalent. Both will give requests in-flight extra time to finish. But they do not help with new requests that will come AFTER Kubernetes sent the replica a SIGTERM, but BEFORE the ingress has updated its routing tables. To allow these “late” requests to be processed and avoid 502 errors, you need to apply a custom IHostLifetime implementation. This is all described in more detail in the guide you referenced, specifically https://github.com/dotnet/dotnet-docker/blob/main/samples/kubernetes/graceful-shutdown/graceful-shutdown.md#adding-a-shutdown-delay part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions
Projects
None yet
Development

No branches or pull requests