Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hedged Request Fails to Return or Cancel Within Configured Timeout #5699

Open
renatogbp opened this issue Nov 26, 2024 · 0 comments
Open

Hedged Request Fails to Return or Cancel Within Configured Timeout #5699

renatogbp opened this issue Nov 26, 2024 · 0 comments
Labels
area-resilience bug This issue describes a behavior which is not expected - a bug.

Comments

@renatogbp
Copy link

Description

we are using IHttpClientBuilder.AddStandardHedgingHandler() from Microsoft.Extensions.DependencyInjection and we've noticed that sometimes a request is not being cancelled (e.g., it is not respecting the HttpStandardHedgingResilienceOptions.TotalRequestTimeout.Timeout), even on situations where the hedging request returns a successful response, the logical operation still waiting on the primary requests and it never returns.

Here is a log where a hedging request returned 200 but it never cancelled the primary request (Attempt 0):
Image

Here's a log where attempt 0 and attempt 2 were cancelled, but attempt 1 was never cancelled:

>PollyTelemetry:Polly|ExecutionAttempt|Execution attempt. Source: 'HttpClient.ServiceClient-standard-hedging//StandardHedging-Hedging', Operation Key: '', Result: 'The operation was canceled.', Handled: 'False', Attempt: '0', Execution Time: '3498.7123'|System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 
>PollyTelemetry:Polly|ExecutionAttempt|Execution attempt. Source: 'HttpClient.ServiceClient-standard-hedging//StandardHedging-Hedging', Operation Key: '', Result: 'The operation was canceled.', Handled: 'False', Attempt: '2', Execution Time: '1484.4099'|System.Threading.Tasks.TaskCanceledException: The operation was canceled.

Looking at our telemetry, we can see the 99.99th percentile latency sometimes is higher than the total timeout configured which should not be possible. For example, the HttpClient is configured to have both attempt and total timeout at 4s, but we see latency being reported above 4s:
Image

Please advise how we can debug this further.

Reproduction Steps

Unfortunately I am not able to reproduce this with unit tests, but this is how I am configuring the HttpClient:

public static void ConfigureStandardHedgingHttpClient(
    this IServiceCollection serviceCollection,
    StandardHedgingHttpClientConfiguration configuration)
{
    var httpClientBuilder = serviceCollection.AddHttpClient(configuration.ClientId);

    configuration.Configurator.ConfigureHeadHandlers(httpClientBuilder);
    configuration.Configurator.ConfigurePrimaryHttpMessageHandler(httpClientBuilder);

    var hedgingHandler = httpClientBuilder
        .AddStandardHedgingHandler();

    hedgingHandler = hedgingHandler
        .SelectPipelineByAuthority()
        .Configure(options =>
        {
            options.Hedging.Delay = configuration.HedgingDelay;
            options.Hedging.MaxHedgedAttempts = Math.Max(1, configuration.MaxHedgedAttempts - 1); // need to subtract 1 because the original request is counted as one attempt

            // if set, it overrides the default ShouldHandle
            if (configuration.ShouldHandleHedgingProvider != null)
            {
                options.Hedging.ShouldHandle = configuration.ShouldHandleHedgingProvider;
            }
            else
            {
                options.Hedging.ShouldHandle = args => HandleConnectTimeoutsAndTransientFailures(args);
            }

            options.Endpoint.Timeout.Timeout = configuration.AttemptTimeout;
            options.Endpoint.Timeout.TimeoutGenerator = configuration.AttemptTimeoutGenerator;

            // if set, it takes precedence over HedgingDelay
            options.Hedging.DelayGenerator = configuration.HedgingDelayGenerator;

            // circuit breaker configuration
            options.Endpoint.CircuitBreaker.SamplingDuration = configuration.SamplingDuration;
            options.Endpoint.CircuitBreaker.BreakDuration = configuration.BreakDuration;
            options.Endpoint.CircuitBreaker.FailureRatio = configuration.FailureThreshold;
            options.Endpoint.CircuitBreaker.MinimumThroughput = configuration.MinimumThroughput;

            if (configuration.ShouldHandleCircuitBreakerProvider != null)
            {
                options.Endpoint.CircuitBreaker.ShouldHandle = configuration.ShouldHandleCircuitBreakerProvider;
            }

            if (!configuration.IsCircuitBreakerEnabled)
            {
                options.Endpoint.CircuitBreaker.ShouldHandle = (outcome) => new ValueTask<bool>(false);
            }

            options.TotalRequestTimeout.Timeout = configuration.TotalRequestTimeout;
            options.TotalRequestTimeout.TimeoutGenerator = configuration.TotalTimeoutGenerator;
        });

    configuration.Configurator.ConfigureOrderedRoutingStrategy(hedgingHandler.RoutingStrategyBuilder);
    configuration.Configurator.ConfigureTailHandlers(httpClientBuilder);

    AddChaosInjectors(httpClientBuilder, configuration.ChaosSettings);

#pragma warning disable CS0618 // Type or member is obsolete
    RedactionExtensions.AddRedaction(httpClientBuilder
        .Services
        .AddResilienceEnricher());
#pragma warning restore CS0618 // Type or member is obsolete
}

Expected behavior

  1. All requests should be canceled at the configured TotalRequestTimeout
  2. HttpClient should cancel remaining requests when a request returns successfully.

Actual behavior

  1. Some requests are cancelled with a higher timeout than the one configured.
  2. Some requests are never cancelled when timeout reached.
  3. Pending requests are never cancelled when one of them returns successfully.

Regression?

No, this is the only build we have used, so probably it was always like that.

Known Workarounds

No known workarounds

Configuration

  • Microsoft.Extensions version is 8.0.0
  • Server is running on Windows Server containers 2019
  • .NET version is 8.0

Other information

No response

@renatogbp renatogbp added bug This issue describes a behavior which is not expected - a bug. untriaged labels Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-resilience bug This issue describes a behavior which is not expected - a bug.
Projects
None yet
Development

No branches or pull requests

2 participants