-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. #23010
Comments
The error message means that the sending side of a system message first receives acknowledgement for a message and then later receives a negative acknowledgement for that same message. How could that happen? IIUC these buffers and management data structures are kept even if the physical connection had to be reestablished. E.g. in the case above
Message 0 was likely confirmed earlier. The receiver then seems to "have forgotten" about that message and when it receives other messages later on it sends a negative acknowledgement for that message. The question is why this information gets lost at some point. |
…23010 Reproducer (TransportFailSpec): * watch from first to second node, i.e. sys msg with seq number 1 * trigger transport failure detection to tear down the connection * the bug was that on the second node the ReliableDeliverySupervisor was stopped because the send buffer had not been used on that side, but that removed the receive buffer entry * later, after gating elapsed another watch from first to second node, i.e. sys msg with seq number 2 * when that watch msg was received on the second node the receive buffer had been cleared and therefore it thought that seq number 1 was missing, and therefore sent nack to the first node * when first node received the nack it thrown IllegalStateException: Error encountered while processing system message acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}] caused by: ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer This was fixed by not stopping the ReliableDeliverySupervisor so that the receive buffer was preserved. Not necessary for fixing the issue, but the following config was adjusted * increased transport-failure-detector timeout to avoid tearing down the connection too early * reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor actors earlier
…23010 Reproducer (TransportFailSpec): * watch from first to second node, i.e. sys msg with seq number 1 * trigger transport failure detection to tear down the connection * the bug was that on the second node the ReliableDeliverySupervisor was stopped because the send buffer had not been used on that side, but that removed the receive buffer entry * later, after gating elapsed another watch from first to second node, i.e. sys msg with seq number 2 * when that watch msg was received on the second node the receive buffer had been cleared and therefore it thought that seq number 1 was missing, and therefore sent nack to the first node * when first node received the nack it thrown IllegalStateException: Error encountered while processing system message acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}] caused by: ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer This was fixed by not stopping the ReliableDeliverySupervisor so that the receive buffer was preserved. Not necessary for fixing the issue, but the following config settings were adjusted: * increased transport-failure-detector timeout to avoid tearing down the connection too early * reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor actors earlier
…23010 Reproducer (TransportFailSpec): * watch from first to second node, i.e. sys msg with seq number 1 * trigger transport failure detection to tear down the connection * the bug was that on the second node the ReliableDeliverySupervisor was stopped because the send buffer had not been used on that side, but that removed the receive buffer entry * later, after gating elapsed another watch from first to second node, i.e. sys msg with seq number 2 * when that watch msg was received on the second node the receive buffer had been cleared and therefore it thought that seq number 1 was missing, and therefore sent nack to the first node * when first node received the nack it thrown IllegalStateException: Error encountered while processing system message acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}] caused by: ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer This was fixed by not stopping the ReliableDeliverySupervisor so that the receive buffer was preserved. Not necessary for fixing the issue, but the following config settings were adjusted: * increased transport-failure-detector timeout to avoid tearing down the connection too early * reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor actors earlier
…23010 Reproducer (TransportFailSpec): * watch from first to second node, i.e. sys msg with seq number 1 * trigger transport failure detection to tear down the connection * the bug was that on the second node the ReliableDeliverySupervisor was stopped because the send buffer had not been used on that side, but that removed the receive buffer entry * later, after gating elapsed another watch from first to second node, i.e. sys msg with seq number 2 * when that watch msg was received on the second node the receive buffer had been cleared and therefore it thought that seq number 1 was missing, and therefore sent nack to the first node * when first node received the nack it thrown IllegalStateException: Error encountered while processing system message acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}] caused by: ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer This was fixed by not stopping the ReliableDeliverySupervisor so that the receive buffer was preserved. Not necessary for fixing the issue, but the following config settings were adjusted: * increased transport-failure-detector timeout to avoid tearing down the connection too early * reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor actors earlier (cherry picked from commit 32f0936)
hi hAkkers |
We will release on Monday |
…ption-patriknw Fix ResendUnfulfillableException after transport failure detection, #23010
…ption-2.4-patriknw Fix ResendUnfulfillableException after transport failure detection, #23010 (for validation)
* follow up on #23010, ActorsLeakSpec sometimes fails because the reliableEndpointWriter is not stopped as early as before
increase timeout in ActorsLeakSpec, #23010
…iknw increase timeout in ActorsLeakSpec, #23010
…23010 Reproducer (TransportFailSpec): * watch from first to second node, i.e. sys msg with seq number 1 * trigger transport failure detection to tear down the connection * the bug was that on the second node the ReliableDeliverySupervisor was stopped because the send buffer had not been used on that side, but that removed the receive buffer entry * later, after gating elapsed another watch from first to second node, i.e. sys msg with seq number 2 * when that watch msg was received on the second node the receive buffer had been cleared and therefore it thought that seq number 1 was missing, and therefore sent nack to the first node * when first node received the nack it thrown IllegalStateException: Error encountered while processing system message acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}] caused by: ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer This was fixed by not stopping the ReliableDeliverySupervisor so that the receive buffer was preserved. Not necessary for fixing the issue, but the following config settings were adjusted: * increased transport-failure-detector timeout to avoid tearing down the connection too early * reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor actors earlier
* follow up on #23010, ActorsLeakSpec sometimes fails because the reliableEndpointWriter is not stopped as early as before
@patriknw thank you, I see release for 2.4. is there a chance to see 2.5 with this fix soon? |
yes, we will release 2.5.3 in a few days |
We have seen a few (but infrequent) reports of quarantining happening with this error message.
The error messages usually look like this:
The particular acknowledgement setup can be different.
We have seen reports for this occurring on Akka 2.4.7, 2.4.11, and 2.4.17.
It seems to happen under different kind of circumstances:
So far, we haven't been able to reproduce the issue or having been able to get hold of a complete set of logs that would allow us to reproduce the issue.
Maybe related: #16623 and #19780
The text was updated successfully, but these errors were encountered: