-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"499 client closed connection" on PubSub consumer after upgrading to .NET 8 #1237
Comments
Not confirmed but I do believe there is a connection to dapr/dapr#7110. Just need to prove/disprove it. My gut feel says it has something to do with cancellations and context between Dapr sidecars and the .NET APIs. |
@KrylixZA could you please share some logs from the Dapr sidecar? |
@KrylixZA A couple more questions:
|
Ignore the comment, the error comes from here: Which indicates that the dapr sidecar is publishing to your application via HTTP. But your application returns errors, an invalid response So as @philliphoff said, please elaborate how you are subscribing to the pubsub events. I recommend instrumenting additional logging in your application. From what I can tell this is entirely an application error in your pubsub receiver application. Even if you are using the Dapr SDK for receiving, this is almost certainly not an issue in the SDK but rather in your own application code. The Dapr runtime expects a response from your app each time the runtime makes the HTTP request to deliver the event to your app. But instead it is getting EOF while attempting to read the response.
Another option is that response body is not valid JSON. As a result Dapr is reaching the end of the body without being able to parse the document as valid JSON. The response to a publish request must be valid JSON if you include a HTTP body The JSON must be of the form: {
"status": "<status>"
} Otherwise, you can choose to have an empty response body and only send a HTTP status code. The code makes it clear we are getting a 200 status code from your app, but the response is invalid JSON. |
Hey all. Gonna do my best to give us many details as I can without giving away IP here. LogsFirst off, here's a snippet of the last 25 lines of my sidecar logs when running with debug logging: daprd logs
Average processing timeSecondly, we have average processing times around 25ms. There are, of course, some operations that run slowly and genuinely do terminate because they are a long running process. However that's not the case for all of these 499 errors. I am busy taking a look now and we have some of these 499s being triggered on operations returning to the sidecar in as low as 6.8ms. Raw vs CloudEventWe are seeing this on services consuming raw events and services consuming cloud events. For context, we have a service which we consider the entry point to our system. This service is using Dapr to consume from multiple Kafka topics which are being populated by on-premise systems that are not using Dapr. We would consider these events raw events. This service publishes a CloudEvent to a topic we use internally to route events that are relevant to us to the next service in the processing flow. That service is, of course, consuming cloud events. // in Program.cs
app.MapSubscriberHandlers()
// in our controller
[Topic("PubSub", "KAFKA_TOPIC")] Our APIs do not return a response message when events are processed successfully. We simply return {
"application": "App Name",
"code": <some integer relative to the app>,
"details": "Coding details",
"message": "Error friendly message"
} Application vs Dapr errorI am not convinced the application is the root cause of the error. The errors appear to originate when calling into/returning to the sidecar and not during actual processing. We picked it up on our Global Exception Handler middleware. Below is an example stack trace that I found in my application logs. .NET stack trace
Every controller action on all services (PubSub consumer or not) is expecting a cancellationToken.ThrowIfCancellationRequested(); (Yes, we have catered for this in our global exception handler 😃 )
HmmmmmConsidering what @berndverst mentioned about the response JSON payload, the fact that removing cancellations seems to "work" and our internal generic error response message is not what Dapr is expecting, I wonder if all are somehow connected? |
I don't know how this is the first time I am seeing this page 👀 Will need to read through it more carefully. We're almost guaranteed to be doing the wrong thing.
As with all things enterprise, it's not quite that simple. We have a company-wide standard global exception handler and error response payload. We will need to either find a way not to use that global exception handler and write our own, or implement our own
Pretty much sums it up then. I wonder why |
Howdy. Just an update from our side. We tried removing cancellation tokens from all of our consuming services and unfortunately are still seeing these errors. I have updated my post. Seems like it may be beyond the scope of our application to control. We're going to continue the investigation and will feedback if we find anything. |
Hey folkes. Further update: we disabled our application telemetry entirely for a couple hours but left the sidecar telemetry flowing through our OTEL collector and are still observing the 499 error. Not sure if this helps much, but at least we can confirm its origin in terms of telemetry signals. Will continue investigating further. |
Further observations: this particular issue only appears to be prevalent on endpoints that are interaction with a Kafka PubSub component. The other services seem okay. We now have 3 systems running .NET 8 & Dapr consuming from Kafka all exhibiting the same behaviour. Issues are rooted in the Dapr OTEL. |
The |
@KrylixZA Just to be clear, when you say a "499 client closed connection" is triggered, do you mean that you're seeing cancellation requested of the incoming From that call stacks, it looks like the What does your global exception handling middleware do with the exception? You mention that it is bubbling that back up to the message handler and then returning a 400/500 but, if that's the case, I'd expect to see one of the two following warning/error messages in the Dapr logs:
Do you see either of these in your telemetry? |
Hey @philliphoff
Thanks and agreed. Definitely could use better wording to indicate there wasn't necessarily anything wrong, there's just nothing to look at.
The latter. It's being generated during communication from our application to the Dapr client. Typically we see it when either trying to publish an event to a Kafka topic or when we are using an actor proxy to communicate with an actor.
Yes it is being given to the Dapr client. We pass the cancellation token through the entire call stack. The cancellation token originates (for us at least) in the controller actions but is obviously coming from ASP.NET middleware. In a perfect world, the context being passed to us from the Dapr sidecar is the same context being passed to the Dapr client when publishing an event or communicating with an actor. The scope of the cancellation only lives from Dapr -> Our app -> Dapr and terminates there. All communication between services is asynchronous through Kafka topics.
At the moment we have no specific coding for it. The global exception handler a generic response it produces in this situation. Therefore, the error translates the exception into a {
"application": "App Name",
"code": 12345,
"details": "Coding details",
"message": "A general, non-specific error has occurred."
}
We see the latter. A lot of it. Like 10s of thousands of them per day. Probably even 100s of thousands per day. So much so we've just dismissed this warning as a "known error" and not bothered looking at right from the start. Maybe skipping over them in the beginning wasn't the best plan 🤔 Where we're at right nowI can also give some more context here. We've been working with @yaron2 on this internally. Due to company policies I am not allowed to upload screenshots of our App Insights spans to GitHub. What we've picked up is that whenever we see these errors, looking at the time spans, there are long periods where nothing is happening. I was initially mislead due to issues with the quality of our telemetry, but after making some tweaks to our OTEL config, we've been able to uncover a lot more detail, and as mentioned, there are long periods where nothing happens. It's either stuck waiting on the actor to respond or stuck waiting to publish an event. You might ask, specifically with publishing, why is it waiting? Especially given the Dapr client is transiently injected into applications. The problem we had there was Kafka was heavily rate limiting our publishing. So we created a wrapper of the Dapr Client specifically for publishing events to a topic. We then registered that implemented in our DI container as a singleton and inject that wrapper to code that needs to publish a message to a topic. Initially I was convinced it couldn't be a response time thing on our applications because we saw the errors even when the response time was less than 10ms. Now, however, with clearer detail, we can see that there are some scenarios where the error is thrown very soon after we begin processing, but in context of the greater span where I can see the Dapr sidecar communicating with my app and my app being essentially paused (sometimes for longer than 10s), it can error almost immediately. With all kinds of concurrency in the mix and currently no limit being set to the It seems, on the surface at least, that there is something new in .NET (either .NET 7 or 8) that means it's no longer happy to just wait around for our application to return slowly as it used to do with .NET 6. It now terminates early and reports these 499 errors. That exception being thrown appears to be what is triggering the cancellation. We've not completely confirmed our suspicions yet, but we're continuing on the journey. What we can definitely say is this is not a Dapr issue but a .NET issue. There may be some things we can do in the Dapr .NET SDK around the default timeouts in the HTTP client but I doubt that is actually the cause of this problem. We're going to investigate concurrency as an option next and see if that brings us any joy and also look if there are ways for us to change how our code interacts with the Dapr client as well. I'd like to keep this issue open for the time being until we've confirmed our suspicions and reported back a solution if that's okay with the OSS Community? |
I see this issue listed on the components repo appears to be reporting the same problem: dapr/components-contrib#3327 As I have mentioned and @berndverst mentioned in his comment dapr/components-contrib#3327 (comment), this is likely an issue that needs to be handled in the application code. Just wanting to loop the issues together as they're all related. |
Hi all. Resolving this thread. 499s and 500s happen interchangeably since .NET 7 but they were ultimately caused by cancellations being thrown for in-flight messages on PubSub during app shutdowns. This has been resolved in Dapr 1.13.2. |
Expected Behavior
Dapr not to close connections when event processing is successful.
Actual Behavior
Dapr appears to be passing a cancellation to my API controllers which is triggering a
499 - Client Closed Connection
exception to throw even after processing within our application completes successfully.What I am observing from my OTEL traces is an event will be picked up by the Dapr sidecar off a topic, passed into my ASP.NET controller where it is successfully processed. Our code uses Dapr to publish an event on to the next topic. It all then wraps up and returns a
200 - OK
to the sidecar. Dapr should then ack the message off the queue. But there are many times where it doesn't. It seems to throw some kind of processing error. The status code I see on the sidecar in the trace is a response code of2
which then appears to be passed into the controller as a client closing the connection which results in a flurry of 499 response codes.A bit more context is needed here. Our API controllers are all built to take in a
CancellationToken cancellationToken = default
as the final parameter of every controller. Obviously that means that the calling context which Dapr provides is wrapped up and through magical ASP.NET middleware makes it's way into our code. At every layer in our code, we check the status of the cancellation token in an effort to reduce the amount of work the system is doing at all times.From what I can tell, the failure in the Dapr sidecar is propagating a cancelled context into our middleware and that is resulting in the trace being brought up to Application Insights. As best we can see, the event does indeed get published on to the next service via PubSub so in terms of functionality, nothing is necessarily wrong. But the behaviour we are observing is extremely erratic and it floods our telemetry with useless errors and hides real errors due to the sampling rates we have picked for cost reasons.
In a given 24 hour period, we're seeing on average 40k of these
499
errors. We also see around 5000 - 10,000500
errors for cancellations being triggered. My gut feel says both are similarly related.When I enabled debug logs in my sidecar, I am seeing a lot of messages saying the following:
I think this is where the error starts.
More context:
Steps to Reproduce the Problem
I am only able to reproduce this under heavy load - during peak periods where we are doing around 170k events per minute. I have never been able to reproduce it locally or in my dev/test environment.
Release Note
RELEASE NOTE:
The text was updated successfully, but these errors were encountered: