Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: When stopping an IHost instance, IHostedService instances StopAsync method is called sequentially instead of asynchronously #68036

Closed
Tracked by #77390
jerryk414 opened this issue Apr 14, 2022 · 25 comments · Fixed by #84048
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-Extensions-Hosting tenet-performance Performance related issue
Milestone

Comments

@jerryk414
Copy link
Contributor

jerryk414 commented Apr 14, 2022

Background and motivation

As it is right now, when an IHost is asked to StopAsync for graceful shutdown, each IHostedService is requested to StopAsync in sequential order. There's no reason why each IHostedService.StopAsync task can't be executed asynchronously and awaited altogether.

Why is this a problem? Well technically everything is functioning to spec, but in practice, it's problematic.

So normally the idea of being able to set the ShutdownTimeout is great.. we can set it to say, 30 seconds, and say we expect all IHostedService instances to take no longer than 30 seconds to shutdown gracefully. The problem is that, this is an additive timeout. That is to say, if you have 10 IHostedService instances, and each one of them takes 20 seconds to shutdown, your graceful shutdown now takes 200 seconds. Why do that when you can just have everything shutdown asynchronously in 20 seconds?

One example of the problem with this is that, when deploying apps to ECS in AWS, when ECS triggers an instance of an app to shutdown (which occurs under normal circumstances such as when scaling down), the maximum amount of time it gives for the app to gracefully shutdown is 120 seconds (see stopTimeout), after that it just kills the process.

There's no reason that i can see that the IHostedService instances should be shutdown sequentially, they are async for a reason, and if shutdown order is important, than it should be managed by the consumer themselves, or at the very least, there should be an option for asynchronous IHostedService shutdown.

Regression?

This is not a regression

Analysis

Link to code causing the issue:
https://github.com/dotnet/runtime/blob/main/src/libraries/Microsoft.Extensions.Hosting/src/Internal/Host.cs#L126

This code could easily be modified to something similar to:

HashSet<Task> tasks = new HashSet<Task>();

foreach (IHostedService hostedService in _hostedServices.Reverse()) /* Why does this need to be reversed anyways? */
{
   tasks.Add(hostedService.StopAsync(token));
}

// WhenAll would be great here, but it will swallow up exceptions.. so this is the next best thing
foreach (Task task in tasks)
{
   try
   {
      await task.ConfigureAwait(false);
   }
   catch (Exception ex)
   {
      exceptions.Add(ex);
   }
}

API Proposal

    public class HostOptions
    {
+        public HostedServiceStopBehavior HostedServiceStopBehavior { get; set; } =
+            HostedServiceStopBehavior.Asynchronous;

        public BackgroundServiceExceptionBehavior BackgroundServiceExceptionBehavior { get; set; } =
            BackgroundServiceExceptionBehavior.StopHost;
    }
    
+    /// <summary>
+    /// Specifies a behavior that the <see cref="IHost"/> will honor when stopping <see cref="IHostedService"/> instances.
+    /// </summary>
+    public enum HostedServiceStopBehavior
+    {
+        /// <summary>
+        /// Indicates that the host will stop the <see cref="IHostedService"/> instances asynchronously.
+        /// </summary>
+        Asynchronous = 0,
+
+        /// <summary>
+        /// Indicates that the host will wait for the previous <see cref="IHostedService"/> instance to stop before stopping the next instance.
+        /// </summary>
+        Sequential = 1
+    }

API Usage

   public static IHostBuilder CreateWebHostBuilder(string[] args)
   {
      return Host.CreateDefaultBuilder(args)
         .ConfigureServices((context, services) =>
         {
            services.Configure<HostOptions>(options => 
            {
               options.BackgroundServiceStopBehavior = BackgroundServiceStopBehavior.Sequential; // Defaults to asynchronous
            }
         });
   }

Risks

The default behavior will technically be changing, which would be a breaking change. However, it may be a "safe" breaking change in the sense that the likelihood of someone building an app which depends on this previously internal behavior is highly unlikely and IF they did happen to, there is a way to revert back to the old behavior.

@jerryk414 jerryk414 added the tenet-performance Performance related issue label Apr 14, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-Extensions-Hosting untriaged New issue has not been triaged by the area owner labels Apr 14, 2022
@ghost
Copy link

ghost commented Apr 14, 2022

Tagging subscribers to this area: @dotnet/area-extensions-hosting
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

As it is right now, when an IHost is asked to StopAsync for graceful shutdown, each IHostedService is requested to StopAsync in sequential order. There's no reason why each IHostedService.StopAsync task can't be executed asynchronously and awaited altogether.

Why is this a problem? Well technically everything is functioning to spec, but in practice, it's problematic.

So normally the idea of being able to set the ShutdownTimeout is great.. we can set it to say, 30 seconds, and say we expect all IHostedService instances to take no longer than 30 seconds to shutdown gracefully. The problem is that, this is an additive timeout. That is to say, if you have 10 IHostedService instances, and each one of them takes 20 seconds to shutdown, your graceful shutdown now takes 200 seconds. Why do that when you can just have everything shutdown asynchronously in 20 seconds?

The real problem with this is that, when deploying apps to ECS in AWS, for example, when ECS triggers an instance of an app to shutdown, the maximum amount of time it gives for the app to gracefully shutdown is 120 seconds (see stopTimeout)e, after that it just kills the process.

There's no reason that i can see that the IHostedService instances should be shutdown sequentially, they are async for a reason, and if shutdown order is important, than it should be managed by the consumer themselves, or at the very least, there should be an option for asynchronous IHostedService shutdown.

Configuration

Regression?

I don't think it is

Data

Analysis

Link to code causing the issue:
https://github.com/dotnet/runtime/blob/main/src/libraries/Microsoft.Extensions.Hosting/src/Internal/Host.cs#L126

HashSet<Task> tasks = new HashSet<Task>();

foreach (IHostedService hostedService in _hostedServices.Reverse()) /* Why does this need to be reversed anyways? */
{
   tasks.Add(hostedService.StopAsync(token));
}

// WhenAll would be great here, but it will swallow up exceptions.. so this is the next best thing
foreach (Task task in tasks)
{
   try
   {
      await task.ConfigureAwait(false);
   }
   catch (Exception ex)
   {
      exceptions.Add(ex);
   }
}
Author: jerryk414
Assignees: -
Labels:

tenet-performance, untriaged, area-Extensions-Hosting

Milestone: -

@jerryk414 jerryk414 changed the title When stopping an IHost instance, IHostedService instances shutdown sequentially instead of asynchronously When stopping an IHost instance, IHostedService instances StopAsync method is called sequentially instead of asynchronously Apr 14, 2022
@eerhardt
Copy link
Member

cc @davidfowl - we were just discussing this a couple weeks ago.

@davidfowl
Copy link
Member

Agree we should do this and list it as a breaking change (as stop async will run concurrently now). I don't know if we need a flag to opt out of this behavior though.

@davidfowl
Copy link
Member

@jerryk414 would you be willing to send a PR for this change?

@quixoticaxis
Copy link

Agree we should do this and list it as a breaking change (as stop async will run concurrently now). I don't know if we need a flag to opt out of this behavior though.

Any knob to keep the old "started first — stopped last" would be appreciated. We use it to stop stateful in-process services which have cross dependencies.

@jerryk414
Copy link
Contributor Author

I can do that. I'll get an initial implementation done this week to see what thoughts are once the changes are visible.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 19, 2022
@jerryk414
Copy link
Contributor Author

@davidfowl @quixoticaxis
I've added an initial PR for this to make the default behavior asynchronous, with an option for specifying the behavior to be sequential. It would still be a breaking change, but would add the ability to change the behavior back to the way it was.

There are no tests yet - I want to get input if this is the route you would prefer prior to spending time on that.

@jerryk414 jerryk414 changed the title When stopping an IHost instance, IHostedService instances StopAsync method is called sequentially instead of asynchronously [API Proposal] When stopping an IHost instance, IHostedService instances StopAsync method is called sequentially instead of asynchronously Apr 19, 2022
@jerryk414 jerryk414 changed the title [API Proposal] When stopping an IHost instance, IHostedService instances StopAsync method is called sequentially instead of asynchronously [API Proposal]: When stopping an IHost instance, IHostedService instances StopAsync method is called sequentially instead of asynchronously Apr 19, 2022
@quixoticaxis
Copy link

@davidfowl @quixoticaxis
I've added an initial PR for this to make the default behavior asynchronous, with an option for specifying the behavior to be sequential. It would still be a breaking change, but would add the ability to change the behavior back to the way it was.

There are no tests yet - I want to get input if this is the route you would prefer prior to spending time on that.

For our product it would be easy to switch back to, looks great.

@jerryk414
Copy link
Contributor Author

@davidfowl @quixoticaxis I've added an initial PR for this to make the default behavior asynchronous, with an option for specifying the behavior to be sequential. It would still be a breaking change, but would add the ability to change the behavior back to the way it was.

There are no tests yet - I want to get input if this is the route you would prefer prior to spending time on that.

@davidfowl @quixoticaxis
The PR for this is ready for review.

@eerhardt eerhardt added the api-ready-for-review API is ready for review, it is NOT ready for implementation label May 19, 2022
@eerhardt eerhardt added this to the 7.0.0 milestone May 19, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label May 19, 2022
@eerhardt
Copy link
Member

I've updated the API proposal above and marked this proposal as ready for review. We will need to get the API proposal approved before we can make this change.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 19, 2022
@jerryk414
Copy link
Contributor Author

I've updated the API proposal above and marked this proposal as ready for review. We will need to get the API proposal approved before we can make this change.

Thank you @eerhardt!

@terrajobst
Copy link
Member

terrajobst commented Jun 23, 2022

Video

  • We'd prefer Concurrent over Asynchronous.
  • It seems start is currently also sequential
    • Would we want to configure this as well?
    • Does this imply that this can breaking because people have interdependencies between services?
namespace Microsoft.Extensions.Hosting;

public partial class HostOptions
{
    public bool ServicesStartConcurrently { get; set; } // = false; Maybe change in 8?
    public bool ServicesStopConcurrently { get; set; } // = false; Maybe change in 8?
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jun 23, 2022
@quixoticaxis
Copy link

- Does this imply that this can breaking because people have interdependencies between services?

It would probably not only break dependencies, but also mess up the exception handling around StartAsync.

@jerryk414
Copy link
Contributor Author

jerryk414 commented Jun 24, 2022

@terrajobst
I'm presuming that based on the comment, you would want the default to be false for the time being.

I like the idea of just making it a bool since there really should only ever be two behaviors. I can work on getting this change together and updating the initial PR changes to account for this.

Just to answer this question more specifically... when hosting apps in AWS ECS, you have an absolute maximum shutdown time of 120 seconds - even if you set that higher in the host itself via ShutdownTimeout . We were running into it in a real-life production scenario where shutdown was taking longer than that. We fixed it by basically just announcing to all services that a shutdown is incoming and have them all start prepping to shutdown concurrently (if one starts shutting down, we can presume they all will be soon).

From the description in the issue:
One example of the problem with this is that, when deploying apps to ECS in AWS, when ECS triggers an instance of an app to shutdown (which occurs under normal circumstances such as when scaling down), the maximum amount of time it gives for the app to gracefully shutdown is 120 seconds (see [stopTimeout](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html)), after that it just kills the process.

@quixoticaxis
I do see what you're saying - as it is right now StartAsync doesn't do anything to manage multiple exceptions - it will just stop on the first failure and return that.

I can update it to behave the same as StopAsync though where it will aggregate all exceptions and just throw an AggregateException when there is more than one failure with some generic message One or more hosted services failed to start.

Basically copy what is being done here:
https://github.com/dotnet/runtime/blob/main/src/libraries/Microsoft.Extensions.Hosting/src/Internal/Host.cs#L156

@quixoticaxis
Copy link

quixoticaxis commented Jun 25, 2022

@jerryk414 I personally am not certain AggregateException would be a desired scenario, because in case some services failing and others starting you'd have to wait for all of the StartAsync either completing or faulting. Maybe throwing on the first exception and cancelling all other services (either currently starting or already started) is a better way.

Starting seems to be fundamentally different from stopping, IMHO.

@jerryk414
Copy link
Contributor Author

@quixoticaxis

I would disagree with that for two reasons.

  1. In order to guarantee stopping on the first exception, you need to run things synchronously, which defeats the purpose. In addition, trying to cancel tasks or ignoring them and immediately returning on an exception sounds very misleading.
  2. The aggregate exception would only occur on new behavior - you'd have to opt into it. I would also feel secure saying throwing an aggregate exception in a process that spawns many other tasks is not necessarily unexpected behavior.

Another fun point, assuming StartAsync is awaited using await, a bit of fancy magic happens that will cause the first exception in the AggregateException to be what is actually thrown when the task returns. So you as a consumer, you will effectively get the first exception... Unless you do something like a ContinueWith and inspect the actual exception.

@quixoticaxis
Copy link

quixoticaxis commented Jun 25, 2022

@jerryk414

  1. In order to guarantee stopping on the first exception, you need to run things synchronously, which defeats the purpose. In addition, trying to cancel tasks or ignoring them and immediately returning on an exception sounds very misleading.

No, I believe I don't, but, sorry, I'll post the code snippet tomorrow.

Your seconds point seems valid to me though.

@jerryk414
Copy link
Contributor Author

@jerryk414

  1. In order to guarantee stopping on the first exception, you need to run things synchronously, which defeats the purpose. In addition, trying to cancel tasks or ignoring them and immediately returning on an exception sounds very misleading.

No, I believe I don't, but, sorry, I'll post the code snippet tomorrow.

Your seconds point seems valid to me though.

Yeah maybe I'm just unaware, but I'm not aware of any way to guarantee stopping on a first error without running each startup one by one. I don't have anything in front of me right now, but is there even a cancelation token that's passed into each start async? Even if there is, I doubt many people check it.

And assuming they don't check it, you could end up in a situation where either you're killing threads with these tasks, which could lead to corrupted states, or you could end up just swallowing other real exceptions, which is also bad.

@quixoticaxis
Copy link

@jerryk414 , if someone starts the services concurrently, I suppose it's only fair to assume that their starting logic may be running for prolonged period of time, so there's a fair chance that they respect cancellation requests (at the very least, we do in our stateful services that pre-load data in StartAsync). If starting phase is fast, than what's the point of doing it in parallel?

Considering your point about AggregateException being an expected behavior I suppose the start may look like something along the lines (throws AggregateException and handles cancellation in case of long-running starting logic):

using var stopper = CancellationTokenSource.CreateLinkedTokenSource(originalToken);

await Task.WhenAll(
    services
        .Select(
            service => service.StartAsync(stopper.Token))
        .Select(
            task => task.ContinueWith(original =>
            {
                if (original.IsFaulted)
                {
                    try
                    {
                        stopper.Cancel();
                    }
                    catch (AggregateException)
                    {
                        // probably report cancellation callback exceptions
                    }

                    var dispatcher = ExceptionDispatchInfo.Capture(
                        original.Exception!);
                    dispatcher.Throw();
                }
            })));

Anyway, the longer I think about it, the more I'm inclined to assume that concurrent start is much more tricky than concurrent stop.

@jerryk414
Copy link
Contributor Author

@jerryk414 , if someone starts the services concurrently, I suppose it's only fair to assume that their starting logic may be running for prolonged period of time, so there's a fair chance that they respect cancellation requests (at the very least, we do in our stateful services that pre-load data in StartAsync). If starting phase is fast, than what's the point of doing it in parallel?

Considering your point about AggregateException being an expected behavior I suppose the start may look like something along the lines (throws AggregateException and handles cancellation in case of long-running starting logic):

using var stopper = CancellationTokenSource.CreateLinkedTokenSource(originalToken);

await Task.WhenAll(
    services
        .Select(
            service => service.StartAsync(stopper.Token))
        .Select(
            task => task.ContinueWith(original =>
            {
                if (original.IsFaulted)
                {
                    try
                    {
                        stopper.Cancel();
                    }
                    catch (AggregateException)
                    {
                        // probably report cancellation callback exceptions
                    }

                    var dispatcher = ExceptionDispatchInfo.Capture(
                        original.Exception!);
                    dispatcher.Throw();
                }
            })));

Anyway, the longer I think about it, the more I'm inclined to assume that concurrent start is much more tricky than concurrent stop.

Yeah, I'm just thinking it doesn't have to be that complicated. With it being a new feature, it could just be accepted as being the way it is. If you have a problem with it and have interdependencies that could cause you to need to stop a start async method if one fails first or require order, you could revert back to the old way, or add your own inter-service communication.

I personally would argue the ROI isn't there for the extra complexity of ensuring an immediate return on first exceptions.

@quixoticaxis
Copy link

@jerryk414 my point was that I doubt that concurrent start should be implemented in the first place: too many different usage scenarios.

@buyaa-n
Copy link
Member

buyaa-n commented Dec 7, 2022

@jerryk414 are you still planning to work on it?

@jerryk414
Copy link
Contributor Author

jerryk414 commented Dec 30, 2022

@jerryk414 are you still planning to work on it?

I am - I had a PR up but honestly, it's funny. I'm doing this on my own time - i've got kids, family, and a fulltime job, and it's the holidays, so i just didn't get around to addressing all comments in the PR, and my personal computer took a crap on me, so i've got to get this all setup on another machine. It certainly would be nice if someone else who is paid to do this took ownership and addressed the very minor issues with the PR below.

#75894

If nobody does, I do have plans to get back around to it, but not until the new year.

@buyaa-n
Copy link
Member

buyaa-n commented Dec 30, 2022

I am - I had a PR up but honestly, it's funny. I'm doing this on my own time - i've got kids, family, and a fulltime job, and it's the holidays, so i just didn't get around to addressing all comments in the PR, and my personal computer took a crap on me, so i've got to get this all setup on another machine. It certainly would be nice if someone else who is paid to do this took ownership and addressed the very minor issues with the PR below.

Sure understand, no rush or any pressure, this is open source and it's up to you if you work on it or when you work. Though we would like to assign the issue if somebody already started working on it, so that other contributes would not start working on it simultaneously, if nobody working on it, we would add help wanted label so that other contributors could find it easily. I just saw you put up a PR and checking if you still planning to finish that PR.

If nobody does, I do have plans to get back around to it, but not until the new year.

Thank you, for now, nobody has planned to work on it, sounds like we can add help wanted label

@buyaa-n buyaa-n added the help wanted [up-for-grabs] Good issue for external contributors label Dec 30, 2022
@pedrobsaila
Copy link
Contributor

pedrobsaila commented Mar 27, 2023

Hey @buyaa-n I want to give it a try, you can assign it to me

@buyaa-n buyaa-n removed the help wanted [up-for-grabs] Good issue for external contributors label Mar 28, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 28, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 11, 2023
@ghost ghost locked as resolved and limited conversation to collaborators May 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-Extensions-Hosting tenet-performance Performance related issue
Projects
None yet
8 participants