Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"499 client closed connection" on PubSub consumer after upgrading to .NET 8 #1237

Closed
KrylixZA opened this issue Feb 15, 2024 · 14 comments
Closed
Labels
kind/bug Something isn't working

Comments

@KrylixZA
Copy link

KrylixZA commented Feb 15, 2024

Expected Behavior

Dapr not to close connections when event processing is successful.

Actual Behavior

Dapr appears to be passing a cancellation to my API controllers which is triggering a 499 - Client Closed Connection exception to throw even after processing within our application completes successfully.

What I am observing from my OTEL traces is an event will be picked up by the Dapr sidecar off a topic, passed into my ASP.NET controller where it is successfully processed. Our code uses Dapr to publish an event on to the next topic. It all then wraps up and returns a 200 - OK to the sidecar. Dapr should then ack the message off the queue. But there are many times where it doesn't. It seems to throw some kind of processing error. The status code I see on the sidecar in the trace is a response code of 2 which then appears to be passed into the controller as a client closing the connection which results in a flurry of 499 response codes.

A bit more context is needed here. Our API controllers are all built to take in a CancellationToken cancellationToken = default as the final parameter of every controller. Obviously that means that the calling context which Dapr provides is wrapped up and through magical ASP.NET middleware makes it's way into our code. At every layer in our code, we check the status of the cancellation token in an effort to reduce the amount of work the system is doing at all times.
From what I can tell, the failure in the Dapr sidecar is propagating a cancelled context into our middleware and that is resulting in the trace being brought up to Application Insights. As best we can see, the event does indeed get published on to the next service via PubSub so in terms of functionality, nothing is necessarily wrong. But the behaviour we are observing is extremely erratic and it floods our telemetry with useless errors and hides real errors due to the sampling rates we have picked for cost reasons.

In a given 24 hour period, we're seeing on average 40k of these 499 errors. We also see around 5000 - 10,000 500 errors for cancellations being triggered. My gut feel says both are similarly related.

When I enabled debug logs in my sidecar, I am seeing a lot of messages saying the following:

skipping status check due to error parsing result from pub/sub event : EOF

I think this is where the error starts.

More context:

  • .NET 8 / C# 12
  • AKS
  • Using Kafka for PubSub
  • Running Dapr 1.12.5 with 1.12.0 NuGet packages

Steps to Reproduce the Problem

I am only able to reproduce this under heavy load - during peak periods where we are doing around 170k events per minute. I have never been able to reproduce it locally or in my dev/test environment.

Release Note

RELEASE NOTE:

@KrylixZA
Copy link
Author

Not confirmed but I do believe there is a connection to dapr/dapr#7110. Just need to prove/disprove it. My gut feel says it has something to do with cancellations and context between Dapr sidecars and the .NET APIs.

@ItalyPaleAle
Copy link
Contributor

@KrylixZA could you please share some logs from the Dapr sidecar?

@philliphoff
Copy link
Collaborator

@KrylixZA A couple more questions:

  • Do you have any metrics related to typical time taken in the pub-sub handler?
  • How is the application subscribing to and handling the messages (i.e through "raw" handlers or via SDK-provided mechanisms)?

@berndverst
Copy link
Member

berndverst commented Feb 16, 2024

I believe I had explained this elsewhere: from the error I am getting the impression your PubSub topic contains some data which is not in cloud event format, but the Dapr subscription isn't in raw format and therefore expects a cloud event (which is JSON).

Can you double check that you are only publishing to the respective topic using Dapr? Or that if using a third party system to publish all events are stored in cloud event format exclusively?

If other event types are a requirement the Dapr subscription needs to be updated to instead subscribe in raw mode.

Ignore the comment, the error comes from here:
https://github.com/dapr/dapr/blob/518753d502cbb1b57b244031b574862944ad86ce/pkg/runtime/processor/pubsub/publish.go#L124

Which indicates that the dapr sidecar is publishing to your application via HTTP. But your application returns errors, an invalid response or the publish request times out.

So as @philliphoff said, please elaborate how you are subscribing to the pubsub events. I recommend instrumenting additional logging in your application.

From what I can tell this is entirely an application error in your pubsub receiver application. Even if you are using the Dapr SDK for receiving, this is almost certainly not an issue in the SDK but rather in your own application code.

The Dapr runtime expects a response from your app each time the runtime makes the HTTP request to deliver the event to your app. But instead it is getting EOF while attempting to read the response.

Could it be that your application handler is incredibly slow at times and hence the request from the Dapr sidecar times out? (Your application would probably continue processing request, but Dapr would see it as an error at this point)

Another option is that response body is not valid JSON. As a result Dapr is reaching the end of the body without being able to parse the document as valid JSON.

The response to a publish request must be valid JSON if you include a HTTP body
https://docs.dapr.io/reference/api/pubsub_api/#expected-http-response

The JSON must be of the form:

{
  "status": "<status>"
}

Otherwise, you can choose to have an empty response body and only send a HTTP status code.

The code makes it clear we are getting a 200 status code from your app, but the response is invalid JSON.

https://github.com/dapr/dapr/blob/518753d502cbb1b57b244031b574862944ad86ce/pkg/runtime/processor/pubsub/publish.go#L142

@KrylixZA
Copy link
Author

KrylixZA commented Feb 17, 2024

Hey all. Gonna do my best to give us many details as I can without giving away IP here.

Logs

First off, here's a snippet of the last 25 lines of my sidecar logs when running with debug logging:

daprd logs
time="2024-02-15T09:44:07.475815156Z" level=debug msg="Processing Kafka message: KAFKA_TOPIC/21/19313724 [key=some key]" app_id=my-santisied-dapr-app component="pubsub (pubsub.kafka/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.499852524Z" level=debug msg="skipping status check due to error parsing result from pub/sub event <nil>: EOF" app_id=my-santisied-dapr-app instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.runtime.processor.pubsub type=log ver=1.12.5
time="2024-02-15T09:44:07.499914688Z" level=debug msg="Processing Kafka message: KAFKA_TOPIC/21/19313725 [key=some key]" app_id=my-santisied-dapr-app component="pubsub (pubsub.kafka/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.509025356Z" level=debug msg="skipping status check due to error parsing result from pub/sub event <nil>: EOF" app_id=my-santisied-dapr-app instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.runtime.processor.pubsub type=log ver=1.12.5
time="2024-02-15T09:44:07.50906464Z" level=debug msg="Processing Kafka message: KAFKA_TOPIC/22/18845218 [key=some key]" app_id=my-santisied-dapr-app component="pubsub (pubsub.kafka/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.514974525Z" level=debug msg="skipping status check due to error parsing result from pub/sub event <nil>: EOF" app_id=my-santisied-dapr-app instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.runtime.processor.pubsub type=log ver=1.12.5
time="2024-02-15T09:44:07.51500328Z" level=debug msg="Processing Kafka message: KAFKA_TOPIC/23/19981887 [key=some key]" app_id=my-santisied-dapr-app component="pubsub (pubsub.kafka/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.52462686Z" level=debug msg="skipping status check due to error parsing result from pub/sub event <nil>: EOF" app_id=my-santisied-dapr-app instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.runtime.processor.pubsub type=log ver=1.12.5
time="2024-02-15T09:44:07.539242941Z" level=debug msg="skipping status check due to error parsing result from pub/sub event <nil>: EOF" app_id=my-santisied-dapr-app instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.runtime.processor.pubsub type=log ver=1.12.5
time="2024-02-15T09:44:07.546718585Z" level=debug msg="#operations=1,partitionkey=my-santisied-dapr-app||MyDaprActor||ActorUniquekey" app_id=my-santisied-dapr-app component="actorstate (state.azure.cosmosdb/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.547964071Z" level=debug msg="skipping status check due to error parsing result from pub/sub event <nil>: EOF" app_id=my-santisied-dapr-app instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.runtime.processor.pubsub type=log ver=1.12.5
time="2024-02-15T09:44:07.547993258Z" level=debug msg="Processing Kafka message: KAFKA_TOPIC/23/19981888 [key=some key]" app_id=my-santisied-dapr-app component="pubsub (pubsub.kafka/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.56450723Z" level=debug msg="Operation 0 completed with status code 200" app_id=my-santisied-dapr-app component="actorstate (state.azure.cosmosdb/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.566461143Z" level=debug msg="skipping status check due to error parsing result from pub/sub event <nil>: EOF" app_id=my-santisied-dapr-app instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.runtime.processor.pubsub type=log ver=1.12.5
time="2024-02-15T09:44:07.600133971Z" level=debug msg="#operations=1,partitionkey=my-santisied-dapr-app||MyDaprActor||ActorUniquekey" app_id=my-santisied-dapr-app component="actorstate (state.azure.cosmosdb/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.607434487Z" level=debug msg="#operations=1,partitionkey=my-santisied-dapr-app||MyDaprActor||ActorUniquekey" app_id=my-santisied-dapr-app component="actorstate (state.azure.cosmosdb/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.613075762Z" level=debug msg="skipping status check due to error parsing result from pub/sub event <nil>: EOF" app_id=my-santisied-dapr-app instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.runtime.processor.pubsub type=log ver=1.12.5
time="2024-02-15T09:44:07.613108497Z" level=debug msg="Processing Kafka message: KAFKA_TOPIC/23/19981889 key=some key]" app_id=my-santisied-dapr-app component="pubsub (pubsub.kafka/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.618679813Z" level=debug msg="Processing Kafka message: KAFKA_TOPIC/21/19313726 [key=some key]" app_id=my-santisied-dapr-app component="pubsub (pubsub.kafka/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.619346936Z" level=debug msg="Operation 0 completed with status code 200" app_id=my-santisied-dapr-app component="actorstate (state.azure.cosmosdb/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.625879327Z" level=debug msg="Operation 0 completed with status code 200" app_id=my-santisied-dapr-app component="actorstate (state.azure.cosmosdb/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.644219792Z" level=debug msg="#operations=1,partitionkey=my-santisied-dapr-app||MyDaprActor||ActorUniquekey" app_id=my-santisied-dapr-app component="actorstate (state.azure.cosmosdb/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.646260711Z" level=debug msg="skipping status check due to error parsing result from pub/sub event <nil>: EOF" app_id=my-santisied-dapr-app instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.runtime.processor.pubsub type=log ver=1.12.5
time="2024-02-15T09:44:07.655700049Z" level=debug msg="#operations=1,partitionkey=my-santisied-dapr-app||MyDaprActor||ActorUniquekey" app_id=my-santisied-dapr-app component="actorstate (state.azure.cosmosdb/v1)" instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.contrib type=log ver=1.12.5
time="2024-02-15T09:44:07.660807136Z" level=debug msg="skipping status check due to error parsing result from pub/sub event <nil>: EOF" app_id=my-santisied-dapr-app instance=my-santisied-dapr-app-844fd5d57f-67dvl scope=dapr.runtime.processor.pubsub type=log ver=1.12.5

Average processing time

Secondly, we have average processing times around 25ms. There are, of course, some operations that run slowly and genuinely do terminate because they are a long running process. However that's not the case for all of these 499 errors. I am busy taking a look now and we have some of these 499s being triggered on operations returning to the sidecar in as low as 6.8ms.

Raw vs CloudEvent

We are seeing this on services consuming raw events and services consuming cloud events. For context, we have a service which we consider the entry point to our system. This service is using Dapr to consume from multiple Kafka topics which are being populated by on-premise systems that are not using Dapr. We would consider these events raw events. This service publishes a CloudEvent to a topic we use internally to route events that are relevant to us to the next service in the processing flow. That service is, of course, consuming cloud events.
To be clear, we are using PubSub components and not subscriber components in YAML to bind to the topics. We are programmatically binding the topics in Dapr through:

// in Program.cs
app.MapSubscriberHandlers()

// in our controller
[Topic("PubSub", "KAFKA_TOPIC")]

Our APIs do not return a response message when events are processed successfully. We simply return Ok with no response body. For errors, we return an HTTP response of either 400 or 500 (our global exception handler makes the choice for this) and we return an internal generic error response message that as JSON looks like:

{
  "application": "App Name",
  "code": <some integer relative to the app>,
  "details": "Coding details",
  "message": "Error friendly message"
}

Application vs Dapr error

I am not convinced the application is the root cause of the error. The errors appear to originate when calling into/returning to the sidecar and not during actual processing. We picked it up on our Global Exception Handler middleware. Below is an example stack trace that I found in my application logs.

.NET stack trace
2024-02-17 04:57:56.7406|ERROR|Internal.GlobalExceptionMiddleware|The operation was canceled.|Exception occurred:System.OperationCanceledException: The operation was canceled.
 ---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
   at System.Net.Http.Http2Connection.Http2Stream.<>c.<WaitForDataAsync>b__88_0(Object s, CancellationToken cancellationToken)
   at System.Threading.CancellationTokenSource.CallbackNode.ExecuteCallback()
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   at Grpc.Net.Client.Internal.GrpcCall`2.CancelCall(Status status)
   at Grpc.Net.Client.Internal.GrpcCall`2.CancelCallFromCancellationToken(CancellationToken cancellationToken)
   at Grpc.Net.Client.Internal.GrpcCall`2.<>c.<RegisterCancellation>b__70_0(Object state, CancellationToken ct)
   at System.Threading.CancellationTokenSource.Invoke(Delegate d, Object state, CancellationTokenSource source)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.CancellationTokenSource.CallbackNode.ExecuteCallback()
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.CancelRequestAbortedTokenCallback()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---
   at System.Net.Http.Http2Connection.Http2Stream.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Http.Http2Connection.Http2Stream.ReadResponseHeadersAsync(CancellationToken cancellationToken)
   at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.DiagnosticsHandler.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Grpc.Net.Client.Internal.GrpcCall`2.RunCall(HttpRequestMessage request, Nullable`1 timeout)
--- End of inner exception stack trace ---
   at Grpc.Net.Client.Internal.GrpcCall`2.ResolveException(String summary, Exception ex, Nullable`1& status, Exception& resolvedException)
   at Grpc.Net.Client.Internal.GrpcCall`2.RunCall(HttpRequestMessage request, Nullable`1 timeout)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetException(Exception exception, Task`1& taskField)
   at System.Net.Http.Http2Connection.Http2Stream.ReadResponseHeadersAsync(CancellationToken cancellationToken)
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
--- End of stack trace from previous location ---
   at Dapr.Client.DaprClientGrpc.MakePublishRequest(String pubsubName, String topicName, ByteString content, Dictionary`2 metadata, String dataContentType, CancellationToken cancellationToken)
   at Infrastructure.PubSub.EventPublisher.PublishEventAsync[T](T eventData, String topicName, Guid correlationId, Dictionary`2 daprMetadata, Dictionary`2 pubSubMetadata, CancellationToken cancellationToken) in /app/src/Infrastructure/PubSub/EventPublisher.cs:line 74
   at EventService.Managers.EventManager.SubmitAndPublishPlayerEventAsync[T](EventType eventType, T playerEvent, Guid correlationId, CancellationToken cancellationToken) in /app/src/EventService/Managers/EventManager.cs:line 58
   at EventService.Controllers.EventsController.ProcessEvent(Event event, CancellationToken cancellationToken) in /app/src/EventService/Controllers/EventsController.cs:line 161
   at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.TaskOfIActionResultExecutor.Execute(ActionContext actionContext, IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeActionMethodAsync>g__Logged|12_1(ControllerActionInvoker invoker)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeNextActionFilterAsync>g__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeInnerFilterAsync>g__Awaited|13_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeNextResourceFilter>g__Awaited|25_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Rethrow(ResourceExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeFilterPipelineAsync>g__Awaited|20_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Logged|17_1(ResourceInvoker invoker)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Logged|17_1(ResourceInvoker invoker)
   at Internal.GlobalExceptionMiddleware.InvokeAsync(HttpContext context)

Every controller action on all services (PubSub consumer or not) is expecting a CancellationToken cancellationToken = default as a parameter. We did this under the presumption that cancellations would seldom, if ever, be triggered while consuming events off a topic and for actual APIs, we should acknowledge cancellations. In every method at every layer, we have a one-liner:

cancellationToken.ThrowIfCancellationRequested();

(Yes, we have catered for this in our global exception handler 😃 )

What convinces me even further that this particular issue is not an application error is that when we remove the cancellation token as a parameter from our controller actions, the errors vanish - entirely. All messages are processed successfully, raw or CloudEvent. This does feel a bit like we're treating the symptoms and not the problem here though 😞.

Hmmmmm

Considering what @berndverst mentioned about the response JSON payload, the fact that removing cancellations seems to "work" and our internal generic error response message is not what Dapr is expecting, I wonder if all are somehow connected?

@KrylixZA
Copy link
Author

KrylixZA commented Feb 17, 2024

The response to a publish request must be valid JSON if you include a HTTP body
https://docs.dapr.io/reference/api/pubsub_api/#expected-http-response
The JSON must be of the form:
{
"status": ""
}
Otherwise, you can choose to have an empty response body and only send a HTTP status code.

I don't know how this is the first time I am seeing this page 👀 Will need to read through it more carefully. We're almost guaranteed to be doing the wrong thing.

Otherwise, you can choose to have an empty response body and only send a HTTP status code.

As with all things enterprise, it's not quite that simple. We have a company-wide standard global exception handler and error response payload. We will need to either find a way not to use that global exception handler and write our own, or implement our own try {} catch {} into all our controllers which I am not a fan of.

The code makes it clear we are getting a 200 status code from your app, but the response is invalid JSON.
https://github.com/dapr/dapr/blob/518753d502cbb1b57b244031b574862944ad86ce/pkg/runtime/processor/pubsub/publish.go#L142

Pretty much sums it up then. I wonder why 499 suddenly started appearing after upgrading to .NET 8. In .NET 6, we have a bunch of 500 errors but not as many as we are seeing now. There must be something in the bowls of .NET 7/8 that has caused this issue to become more prevalent.

@KrylixZA
Copy link
Author

Howdy. Just an update from our side. We tried removing cancellation tokens from all of our consuming services and unfortunately are still seeing these errors. I have updated my post. Seems like it may be beyond the scope of our application to control. We're going to continue the investigation and will feedback if we find anything.

@KrylixZA
Copy link
Author

Hey folkes. Further update: we disabled our application telemetry entirely for a couple hours but left the sidecar telemetry flowing through our OTEL collector and are still observing the 499 error.

Not sure if this helps much, but at least we can confirm its origin in terms of telemetry signals.

Will continue investigating further.

@KrylixZA
Copy link
Author

Further observations: this particular issue only appears to be prevalent on endpoints that are interaction with a Kafka PubSub component. The other services seem okay.

We now have 3 systems running .NET 8 & Dapr consuming from Kafka all exhibiting the same behaviour. Issues are rooted in the Dapr OTEL.

@philliphoff
Copy link
Collaborator

philliphoff commented Feb 26, 2024

The skipping status check due to error parsing result from pub/sub event <id>: EOF is likely a red herring, as an EOF is an expected case when no body is returned. (A basic pub/sub .NET following the docs where no value is explicitly returned from the event handler will result in that log message.) I've filed an issue to update that logging to avoid confusion in the future (see dapr/dapr#7561).

@philliphoff
Copy link
Collaborator

@KrylixZA Just to be clear, when you say a "499 client closed connection" is triggered, do you mean that you're seeing cancellation requested of the incoming CancellationToken for the message handler endpoint, or literally that a 499 response being generated somewhere. If the latter, where is this occurring?

From that call stacks, it looks like theTaskCanceledException is getting thrown while attempting to publish the resulting message. How is the original CancellationToken for the message handler endpoint flowed through to subsequent calls; is it given to or aggregated with a token given to DaprClient.PublishEventAsync()?

What does your global exception handling middleware do with the exception? You mention that it is bubbling that back up to the message handler and then returning a 400/500 but, if that's the case, I'd expect to see one of the two following warning/error messages in the Dapr logs:

  • In the case of a 404: non-retriable error returned from app while processing pub/sub event
  • Otherwise: retriable error returned from app while processing pub/sub event

Do you see either of these in your telemetry?

@KrylixZA
Copy link
Author

Hey @philliphoff

The skipping status check due to error parsing result from pub/sub event <id>: EOF is likely a red herring, as an EOF is an expected case when no body is returned. (A basic pub/sub .NET following the docs where no value is explicitly returned from the event handler will result in that log message.) I've filed an issue to update that logging to avoid confusion in the future (see dapr/dapr#7561).

Thanks and agreed. Definitely could use better wording to indicate there wasn't necessarily anything wrong, there's just nothing to look at.

do you mean that you're seeing cancellation requested of the incoming CancellationToken for the message handler endpoint, or literally that a 499 response being generated somewhere. If the latter, where is this occurring?

The latter. It's being generated during communication from our application to the Dapr client. Typically we see it when either trying to publish an event to a Kafka topic or when we are using an actor proxy to communicate with an actor.

How is the original CancellationToken for the message handler endpoint flowed through to subsequent calls; is it given to or aggregated with a token given to DaprClient.PublishEventAsync()?

Yes it is being given to the Dapr client. We pass the cancellation token through the entire call stack. The cancellation token originates (for us at least) in the controller actions but is obviously coming from ASP.NET middleware. In a perfect world, the context being passed to us from the Dapr sidecar is the same context being passed to the Dapr client when publishing an event or communicating with an actor. The scope of the cancellation only lives from Dapr -> Our app -> Dapr and terminates there. All communication between services is asynchronous through Kafka topics.

What does your global exception handling middleware do with the exception?

At the moment we have no specific coding for it. The global exception handler a generic response it produces in this situation. Therefore, the error translates the exception into a 500 response message with a general, non-specific error message in the JSON response. The response is of the form:

{
  "application": "App Name",
  "code": 12345,
  "details": "Coding details",
  "message": "A general, non-specific error has occurred."
}

I'd expect to see one of the two following warning/error messages in the Dapr logs:

  • In the case of a 404: non-retriable error returned from app while processing pub/sub event
  • Otherwise: retriable error returned from app while processing pub/sub event

We see the latter. A lot of it. Like 10s of thousands of them per day. Probably even 100s of thousands per day. So much so we've just dismissed this warning as a "known error" and not bothered looking at right from the start. Maybe skipping over them in the beginning wasn't the best plan 🤔

Where we're at right now

I can also give some more context here. We've been working with @yaron2 on this internally. Due to company policies I am not allowed to upload screenshots of our App Insights spans to GitHub. What we've picked up is that whenever we see these errors, looking at the time spans, there are long periods where nothing is happening. I was initially mislead due to issues with the quality of our telemetry, but after making some tweaks to our OTEL config, we've been able to uncover a lot more detail, and as mentioned, there are long periods where nothing happens. It's either stuck waiting on the actor to respond or stuck waiting to publish an event.

You might ask, specifically with publishing, why is it waiting? Especially given the Dapr client is transiently injected into applications. The problem we had there was Kafka was heavily rate limiting our publishing. So we created a wrapper of the Dapr Client specifically for publishing events to a topic. We then registered that implemented in our DI container as a singleton and inject that wrapper to code that needs to publish a message to a topic.

Initially I was convinced it couldn't be a response time thing on our applications because we saw the errors even when the response time was less than 10ms. Now, however, with clearer detail, we can see that there are some scenarios where the error is thrown very soon after we begin processing, but in context of the greater span where I can see the Dapr sidecar communicating with my app and my app being essentially paused (sometimes for longer than 10s), it can error almost immediately. With all kinds of concurrency in the mix and currently no limit being set to the dapr.io/app-max-concurrency, it is possible that a single pod is indeed unresponsive.

It seems, on the surface at least, that there is something new in .NET (either .NET 7 or 8) that means it's no longer happy to just wait around for our application to return slowly as it used to do with .NET 6. It now terminates early and reports these 499 errors. That exception being thrown appears to be what is triggering the cancellation.

We've not completely confirmed our suspicions yet, but we're continuing on the journey. What we can definitely say is this is not a Dapr issue but a .NET issue. There may be some things we can do in the Dapr .NET SDK around the default timeouts in the HTTP client but I doubt that is actually the cause of this problem.

We're going to investigate concurrency as an option next and see if that brings us any joy and also look if there are ways for us to change how our code interacts with the Dapr client as well.


I'd like to keep this issue open for the time being until we've confirmed our suspicions and reported back a solution if that's okay with the OSS Community?

@KrylixZA
Copy link
Author

I see this issue listed on the components repo appears to be reporting the same problem: dapr/components-contrib#3327

As I have mentioned and @berndverst mentioned in his comment dapr/components-contrib#3327 (comment), this is likely an issue that needs to be handled in the application code.

Just wanting to loop the issues together as they're all related.

@KrylixZA
Copy link
Author

KrylixZA commented Apr 4, 2024

Hi all. Resolving this thread. 499s and 500s happen interchangeably since .NET 7 but they were ultimately caused by cancellations being thrown for in-flight messages on PubSub during app shutdowns. This has been resolved in Dapr 1.13.2.

@KrylixZA KrylixZA closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants