[BUG] EventHubListener causes message lost in shutdown #41784

yfujiwara-sansan · 2024-02-05T01:01:37Z

Library name and version

Microsoft.Azure.WebJobs.Extensions.EventHubs 6.0.2 and 5.5.0

Describe the bug

EventHubListener.PartitionProcessor progresses the checkpoint even when the application is shutting down (for example, configuration change, scaling, etc.). It causes message lost.
Note that I found that this issue is occurred occasionally.

I and my colleague guess that this issue should be fixed #36432, but reintroduced with #38067 as following out investigation results.

Our investigation results

In this situation, while _functionExecutionToken.IsCancellationRequested became true, linkedCts.IsCancellationRequested had not become true. So, the checkpoint was progressed even if the application execution had been cancelled.

By LinkedCancellationTokenSource source code, the following facts were found:

LinkdedCancellationTokenSource (linkedCts) is cancelled via callback of "linked" cancellation token (_functionExecutionToken's source).
The callback chain is invoked LIFO order.
The callback is also used in many places such as Task.Delay(CancellationToken) implementation.

By watching the callstack in the checkpointing, following sequence was occurred:

WebHost calls listener's StopAsync()
The listener calls CancellationTokenSource.Cancel() (source of _functionExecutionToken)
The application cancellation is occurred as continuation of async / await, then awaits in function runtimes are finished as part of registered callback execution. Note that this callback should be occurred before setting linkedCts.IsCancellationRequested to true as described above. So, the checkpoint is progressed because linkedCts.IsCancellationRequested has not been true yet.

Expected behavior

The checkpoint is never progressed when application process is shutting down ( _functionExecutionToken is cancelled).

Actual behavior

The checkpoint is progressed occasionally.

Reproduction Steps

Use event hub trigger with following in local.
After Recieve method started, press Ctrl + C to shutdown process.
The checkpoint should be progressed. You can investigate linkedCts and _functionExecutionToken states with break point in the checkpointing.

[FunctionName("Receive")]
[ExponentialBackoffRetry(5, "00:00:10", "00:10:00")]
public async Task Receive(
    [EventHubTrigger("%EventHubName%", Connection = "EventHubConnectionString")] EventData[] events,
    CancellationToken cancellationToken)
{
    await Task.Delay(TimeSpan.FromMinutes(3), cancellationToken);
}

Environment

Platform: Windows (Functions runtime v3 and v4, Azure App Service)
- We reproduced in local environment, but message lost was occurred in production Azure environment multiple times.

The text was updated successfully, but these errors were encountered:

jsquire · 2024-02-05T14:52:37Z

@JoshLove-msft: Would you please advise on whether this is related to the ask on the Functions team to expose whether an execution would be retried if the host is not shutting down? If so, please transfer to the Functions host repo.

jsquire · 2024-02-05T14:52:42Z

Thank you for your feedback. Tagging and routing to the team member best able to assist.

JoshLove-msft · 2024-02-10T19:33:45Z

@yfujiwara-sansan, I couldn't reproduce having _functionExecutionToken be canceled with linkedCts still not being cancelled. That said, it looks like you are correct that cancellation is not atomic when it comes to linked token sources, so it is possible that one of the sources can be canceled while the linked source is still not canceled. I will make a fix to explicitly check each of the token sources when making the checkpointing decision.

yfujiwara-sansan · 2024-02-11T04:30:08Z

@JoshLove-msft Thank you for your work! I guess that the repro code sometimes fails to repro due to scheduling of continuations.

By the way, as far as I read #38067 again, it looks that lInkedCts now can be passed to TryExecute bacause it is not lInked to the token which is cancelled In drain mode. I think there is a better option to pass linkedCts to TryExecute, because the app should be cancelled when the load balancer detects ownership lost.

JoshLove-msft · 2024-02-11T16:17:32Z

I think there is a better option to pass linkedCts to TryExecute, because the app should be cancelled when the load balancer detects ownership lost.

The semantics of the token passed to the function are that it is signaled only when shutting down in a way that is not guaranteed to allow the function to complete execution. In terms of whether it makes sense to also cancel this token when partition ownership is lost, I would have to defer to @jsquire on that.

yfujiwara-sansan · 2024-02-12T04:34:56Z

I understand the semantics of the token passeed to the app. The cancellation behavior for ownership lost should be discussed as a different issue because it is intentional behavior.

Thank you for your answer! I and my colleagues are waiting for fixking the race condition.

jsquire · 2024-02-12T13:22:53Z

I think there is a better option to pass linkedCts to TryExecute, because the app should be cancelled when the load balancer detects ownership lost.

The semantics of the token passed to the function are that it is signaled only when shutting down in a way that is not guaranteed to allow the function to complete execution. In terms of whether it makes sense to also cancel this token when partition ownership is lost, I would have to defer to @jsquire on that.

@JoshLove-msft : The token that the processor passes to OnProcessingEventBatch will get canceled when partition ownership is lost. I would assume that we're flowing that token into the Executor so that the Function is notified. Looking over the implementation, it seems that we're not passing that along, for some reason. Any idea why?

JoshLove-msft · 2024-02-12T16:30:24Z

I don't think there was a good reason - it may have just been an oversight. Updated the PR to pass linkedCts token to TryExecute.

yfujiwara-sansan · 2024-02-13T02:14:51Z

@JoshLove-msft

Thank you for fix, but let me confirm just in case.

Is it intentional to fix only single dispatch path? It looks following arguments should be fixed, too:

JoshLove-msft · 2024-02-13T03:23:23Z

You are right again. Apologies for the oversight. I will add these updates in.

JoshLove-msft · 2024-02-13T03:27:58Z

I think this instance also needs updating - https://github.com/JoshLove-msft/azure-sdk-for-net/blob/4a51f3567f5da030a964494807aa25e1aba888aa/sdk/eventhub/Microsoft.Azure.WebJobs.Extensions.EventHubs/src/Listeners/EventHubListener.PartitionProcessor.cs#L289

jsquire assigned JoshLove-msft Feb 5, 2024

jsquire added needs-team-attention This issue needs attention from Azure service team or SDK team and removed needs-team-triage This issue needs the team to triage. labels Feb 5, 2024

JoshLove-msft mentioned this issue Feb 10, 2024

Avoid race condition when making checkpointing decision #41891

Merged

JoshLove-msft closed this as completed in #41891 Feb 12, 2024

simonvandermeer mentioned this issue Mar 1, 2024

Message loss when using Microsoft.Azure.Functions.Worker.Extensions.EventHubs Azure/azure-functions-dotnet-worker#2311

Closed

github-actions bot locked and limited conversation to collaborators May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] EventHubListener causes message lost in shutdown #41784

[BUG] EventHubListener causes message lost in shutdown #41784

yfujiwara-sansan commented Feb 5, 2024

jsquire commented Feb 5, 2024

jsquire commented Feb 5, 2024

JoshLove-msft commented Feb 10, 2024

yfujiwara-sansan commented Feb 11, 2024

JoshLove-msft commented Feb 11, 2024

yfujiwara-sansan commented Feb 12, 2024

jsquire commented Feb 12, 2024 •

edited

Loading

JoshLove-msft commented Feb 12, 2024

yfujiwara-sansan commented Feb 13, 2024

JoshLove-msft commented Feb 13, 2024

JoshLove-msft commented Feb 13, 2024

[BUG] EventHubListener causes message lost in shutdown #41784

[BUG] EventHubListener causes message lost in shutdown #41784

Comments

yfujiwara-sansan commented Feb 5, 2024

Library name and version

Describe the bug

Our investigation results

Expected behavior

Actual behavior

Reproduction Steps

Environment

jsquire commented Feb 5, 2024

jsquire commented Feb 5, 2024

JoshLove-msft commented Feb 10, 2024

yfujiwara-sansan commented Feb 11, 2024

JoshLove-msft commented Feb 11, 2024

yfujiwara-sansan commented Feb 12, 2024

jsquire commented Feb 12, 2024 • edited Loading

JoshLove-msft commented Feb 12, 2024

yfujiwara-sansan commented Feb 13, 2024

JoshLove-msft commented Feb 13, 2024

JoshLove-msft commented Feb 13, 2024

jsquire commented Feb 12, 2024 •

edited

Loading