New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SocketAsyncEngine not signaling ManualResetEvent #53921
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsWe're trying to diagnose a recurring production problem in a component that processes messages from Amazon SQS whereby it will stall (fail to process messages on the queue). Examination of a memory dump in the stalled state seems to indicate either a deadlock or missed notification.
This problem has been going on for a while now across several of our components. Possibly related:
netcoreapp3.1on Ubuntu 18 (Docker Container) See dump analysis:
|
@flakey-bit this is very likely a duplicate of #46505 and #31570 which is reported to be gone in .NET 5.0. Any chance you can upgrade? |
Tagging subscribers to this area: @dotnet/ncl Issue DetailsWe're trying to diagnose a recurring production problem in a component that processes messages from Amazon SQS whereby it will stall (fail to process messages on the queue). Examination of a memory dump in the stalled state seems to indicate either a deadlock or missed notification.
This problem has been going on for a while now across several of our components. Possibly related:
netcoreapp3.1on Ubuntu 18 (Docker Container) See dump analysis:
|
@antonfirsov I'd seen those two issues already, but thanks for highlighting.
3.1 is under LTS right? |
After a second look I realized this is indeed a different issue (#46505 involves cancellation), my apologies. From what I see here this might be an issue calling a synchronous Receive on an already asynchronous (non-blocking) socket. Any chance you can isolate a repro? It is really hard (if not impossible) to make this actionable without that. |
Any chance you have a network trace to go along with this? The stacks suggest that the response had a Content-Length header and the code is waiting for as much data to arrive as was promised. It could, for example, be an issue of the server simply not sending as much as was promised, or something along those lines. |
Thanks for taking a look. @antonfirsov, we don't have an isolated repro but will try and work on getting one. It will be difficult, because we don't really have any idea what triggers it. It could, as Stephen says simply be the server didn't send us data. It could also be environmental or temporary loss of network connectivity. @stephentoub, we don't have a network trace unfortunately. Things can run fine for weeks before stalling occurs. I think we'll look at capturing network traces on an ongoing basis / circular so that we have them available for next time the issue occurs. I've dumped the corresponding Other (potentially) relevant information: |
Not actionable at the moment. Closing. Feel free to reopen if/when more info is available. |
We're trying to diagnose a recurring production problem in a component that processes messages from Amazon SQS whereby it will stall (fail to process messages on the queue). Examination of a memory dump in the stalled state seems to indicate either a deadlock or missed notification.
This problem has been going on for a while now across several of our components.
Possibly related:
netcoreapp3.1on Ubuntu 18 (Docker Container)
See dump analysis:
EDIT: Dumped
BufferMemoryReceiveOperation
showing 0 bytes transferred so farThe text was updated successfully, but these errors were encountered: