akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. #23010

jrudolph · 2017-05-23T09:32:00Z

We have seen a few (but infrequent) reports of quarantining happening with this error message.

The error messages usually look like this:

Association to [...] with UID [...] irrecoverably failed. Quarantining address. 
java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [7 {1, 2, 3, 4, 5, 6, 7}] ack: ACK[7, {0}] 
at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:301) 
at akka.actor.Actor$class.aroundReceive(Actor.scala:484) 
at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:200) 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) 
at akka.actor.ActorCell.invoke(ActorCell.scala:495) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) 
at akka.dispatch.Mailbox.run(Mailbox.scala:224) 
at akka.dispatch.Mailbox.exec(Mailbox.scala:234) 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Caused by: akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. 
at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:107) 
at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:297) 
... 11 common frames omitted

The particular acknowledgement setup can be different.

We have seen reports for this occurring on Akka 2.4.7, 2.4.11, and 2.4.17.

It seems to happen under different kind of circumstances:

node is restarted
temporary unreachability of a node due to GC
underlying connection breakdown due to external causes

So far, we haven't been able to reproduce the issue or having been able to get hold of a complete set of logs that would allow us to reproduce the issue.

Maybe related: #16623 and #19780

The text was updated successfully, but these errors were encountered:

jrudolph · 2017-05-23T14:44:43Z

The error message means that the sending side of a system message first receives acknowledgement for a message and then later receives a negative acknowledgement for that same message.

How could that happen? IIUC these buffers and management data structures are kept even if the physical connection had to be reestablished.

E.g. in the case above

 acknowledgement buffer: [7 {1, 2, 3, 4, 5, 6, 7}] ack: ACK[7, {0}]

Message 0 was likely confirmed earlier. The receiver then seems to "have forgotten" about that message and when it receives other messages later on it sends a negative acknowledgement for that message. The question is why this information gets lost at some point.

…23010 Reproducer (TransportFailSpec): * watch from first to second node, i.e. sys msg with seq number 1 * trigger transport failure detection to tear down the connection * the bug was that on the second node the ReliableDeliverySupervisor was stopped because the send buffer had not been used on that side, but that removed the receive buffer entry * later, after gating elapsed another watch from first to second node, i.e. sys msg with seq number 2 * when that watch msg was received on the second node the receive buffer had been cleared and therefore it thought that seq number 1 was missing, and therefore sent nack to the first node * when first node received the nack it thrown IllegalStateException: Error encountered while processing system message acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}] caused by: ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer This was fixed by not stopping the ReliableDeliverySupervisor so that the receive buffer was preserved. Not necessary for fixing the issue, but the following config was adjusted * increased transport-failure-detector timeout to avoid tearing down the connection too early * reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor actors earlier

…23010 Reproducer (TransportFailSpec): * watch from first to second node, i.e. sys msg with seq number 1 * trigger transport failure detection to tear down the connection * the bug was that on the second node the ReliableDeliverySupervisor was stopped because the send buffer had not been used on that side, but that removed the receive buffer entry * later, after gating elapsed another watch from first to second node, i.e. sys msg with seq number 2 * when that watch msg was received on the second node the receive buffer had been cleared and therefore it thought that seq number 1 was missing, and therefore sent nack to the first node * when first node received the nack it thrown IllegalStateException: Error encountered while processing system message acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}] caused by: ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer This was fixed by not stopping the ReliableDeliverySupervisor so that the receive buffer was preserved. Not necessary for fixing the issue, but the following config settings were adjusted: * increased transport-failure-detector timeout to avoid tearing down the connection too early * reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor actors earlier

…23010 Reproducer (TransportFailSpec): * watch from first to second node, i.e. sys msg with seq number 1 * trigger transport failure detection to tear down the connection * the bug was that on the second node the ReliableDeliverySupervisor was stopped because the send buffer had not been used on that side, but that removed the receive buffer entry * later, after gating elapsed another watch from first to second node, i.e. sys msg with seq number 2 * when that watch msg was received on the second node the receive buffer had been cleared and therefore it thought that seq number 1 was missing, and therefore sent nack to the first node * when first node received the nack it thrown IllegalStateException: Error encountered while processing system message acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}] caused by: ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer This was fixed by not stopping the ReliableDeliverySupervisor so that the receive buffer was preserved. Not necessary for fixing the issue, but the following config settings were adjusted: * increased transport-failure-detector timeout to avoid tearing down the connection too early * reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor actors earlier (cherry picked from commit 32f0936)

412b · 2017-06-10T21:13:20Z

hi hAkkers
we've just recently hit that particular bug in production, so can this be pushed a bit and released faster? :)
I see it being a hidden timebomb in our cluster and fix for it is critical for us :)

patriknw · 2017-06-11T07:12:09Z

We will release on Monday

…ption-patriknw Fix ResendUnfulfillableException after transport failure detection, #23010

…ption-2.4-patriknw Fix ResendUnfulfillableException after transport failure detection, #23010 (for validation)

* follow up on #23010, ActorsLeakSpec sometimes fails because the reliableEndpointWriter is not stopped as early as before

* follow up on #23010, ActorsLeakSpec sometimes fails because the reliableEndpointWriter is not stopped as early as before (cherry picked from commit 4f9921b)

increase timeout in ActorsLeakSpec, #23010

…iknw increase timeout in ActorsLeakSpec, #23010

…23010 Reproducer (TransportFailSpec): * watch from first to second node, i.e. sys msg with seq number 1 * trigger transport failure detection to tear down the connection * the bug was that on the second node the ReliableDeliverySupervisor was stopped because the send buffer had not been used on that side, but that removed the receive buffer entry * later, after gating elapsed another watch from first to second node, i.e. sys msg with seq number 2 * when that watch msg was received on the second node the receive buffer had been cleared and therefore it thought that seq number 1 was missing, and therefore sent nack to the first node * when first node received the nack it thrown IllegalStateException: Error encountered while processing system message acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}] caused by: ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer This was fixed by not stopping the ReliableDeliverySupervisor so that the receive buffer was preserved. Not necessary for fixing the issue, but the following config settings were adjusted: * increased transport-failure-detector timeout to avoid tearing down the connection too early * reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor actors earlier

* follow up on #23010, ActorsLeakSpec sometimes fails because the reliableEndpointWriter is not stopped as early as before

412b · 2017-06-13T08:27:44Z

@patriknw thank you, I see release for 2.4. is there a chance to see 2.5 with this fix soon?

patriknw · 2017-06-13T08:48:14Z

yes, we will release 2.5.3 in a few days

jrudolph added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted bug t:remoting labels May 23, 2017

This was referenced Jun 9, 2017

Fix ResendUnfulfillableException after transport failure detection, #23010 #23129

Merged

Akka Sprint Plan 2017-06-05 akka/akka-meta#53

Closed

patriknw mentioned this issue Jun 9, 2017

Fix ResendUnfulfillableException after transport failure detection, #23010 (for validation) #23130

Merged

patriknw added this to the 2.4.19 milestone Jun 9, 2017

patriknw added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Jun 9, 2017

patriknw self-assigned this Jun 9, 2017

patriknw added a commit that referenced this issue Jun 11, 2017

Merge pull request #23129 from akka/wip-23010-ResendUnfulfillableExce…

14617d3

…ption-patriknw Fix ResendUnfulfillableException after transport failure detection, #23010

patriknw added a commit that referenced this issue Jun 11, 2017

Merge pull request #23130 from akka/wip-23010-ResendUnfulfillableExce…

957bdbd

…ption-2.4-patriknw Fix ResendUnfulfillableException after transport failure detection, #23010 (for validation)

patriknw removed the 3 - in progress Someone is working on this ticket label Jun 11, 2017

patriknw closed this as completed Jun 11, 2017

patriknw added a commit that referenced this issue Jun 11, 2017

increase timeout in ActorsLeakSpec, #23010

4f9921b

* follow up on #23010, ActorsLeakSpec sometimes fails because the reliableEndpointWriter is not stopped as early as before

patriknw mentioned this issue Jun 11, 2017

increase timeout in ActorsLeakSpec, #23010 #23137

Merged

patriknw added a commit that referenced this issue Jun 11, 2017

increase timeout in ActorsLeakSpec, #23010

34ac46f

* follow up on #23010, ActorsLeakSpec sometimes fails because the reliableEndpointWriter is not stopped as early as before (cherry picked from commit 4f9921b)

patriknw mentioned this issue Jun 11, 2017

increase timeout in ActorsLeakSpec, #23010 #23138

Merged

patriknw added a commit that referenced this issue Jun 11, 2017

Merge pull request #23137 from akka/wip-23010-ActorsLeakSpec-patriknw

c761772

increase timeout in ActorsLeakSpec, #23010

patriknw added a commit that referenced this issue Jun 11, 2017

Merge pull request #23138 from akka/wip-23010-ActorsLeakSpec-2.4-patr…

e983068

…iknw increase timeout in ActorsLeakSpec, #23010

patriknw added a commit that referenced this issue Jun 12, 2017

increase timeout in ActorsLeakSpec, #23010

6685bb8

* follow up on #23010, ActorsLeakSpec sometimes fails because the reliableEndpointWriter is not stopped as early as before

seungyongshim mentioned this issue Sep 9, 2019

Found missing porting of Akka.Remote.ResendUnfulfillableException patch from Akka akkadotnet/akka.net#3905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. #23010

akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. #23010

jrudolph commented May 23, 2017

jrudolph commented May 23, 2017

412b commented Jun 10, 2017 •

edited

Loading

patriknw commented Jun 11, 2017

412b commented Jun 13, 2017

patriknw commented Jun 13, 2017

akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. #23010

akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. #23010

Comments

jrudolph commented May 23, 2017

jrudolph commented May 23, 2017

412b commented Jun 10, 2017 • edited Loading

patriknw commented Jun 11, 2017

412b commented Jun 13, 2017

patriknw commented Jun 13, 2017

412b commented Jun 10, 2017 •

edited

Loading