Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. #23010

Closed
jrudolph opened this issue May 23, 2017 · 5 comments
Assignees
Milestone

Comments

@jrudolph
Copy link
Member

We have seen a few (but infrequent) reports of quarantining happening with this error message.

The error messages usually look like this:

Association to [...] with UID [...] irrecoverably failed. Quarantining address. 
java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [7 {1, 2, 3, 4, 5, 6, 7}] ack: ACK[7, {0}] 
at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:301) 
at akka.actor.Actor$class.aroundReceive(Actor.scala:484) 
at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:200) 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) 
at akka.actor.ActorCell.invoke(ActorCell.scala:495) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) 
at akka.dispatch.Mailbox.run(Mailbox.scala:224) 
at akka.dispatch.Mailbox.exec(Mailbox.scala:234) 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Caused by: akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. 
at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:107) 
at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:297) 
... 11 common frames omitted 

The particular acknowledgement setup can be different.

We have seen reports for this occurring on Akka 2.4.7, 2.4.11, and 2.4.17.

It seems to happen under different kind of circumstances:

  • node is restarted
  • temporary unreachability of a node due to GC
  • underlying connection breakdown due to external causes

So far, we haven't been able to reproduce the issue or having been able to get hold of a complete set of logs that would allow us to reproduce the issue.

Maybe related: #16623 and #19780

@jrudolph jrudolph added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted bug t:remoting labels May 23, 2017
@jrudolph
Copy link
Member Author

The error message means that the sending side of a system message first receives acknowledgement for a message and then later receives a negative acknowledgement for that same message.

How could that happen? IIUC these buffers and management data structures are kept even if the physical connection had to be reestablished.

E.g. in the case above

 acknowledgement buffer: [7 {1, 2, 3, 4, 5, 6, 7}] ack: ACK[7, {0}] 

Message 0 was likely confirmed earlier. The receiver then seems to "have forgotten" about that message and when it receives other messages later on it sends a negative acknowledgement for that message. The question is why this information gets lost at some point.

patriknw added a commit that referenced this issue Jun 9, 2017
…23010

Reproducer (TransportFailSpec):

* watch from first to second node, i.e. sys msg with seq number 1
* trigger transport failure detection to tear down the connection
* the bug was that on the second node the ReliableDeliverySupervisor
  was stopped because the send buffer had not been used on that side,
  but that removed the receive buffer entry
* later, after gating elapsed another watch from first to second node,
  i.e. sys msg with seq number 2
* when that watch msg was received on the second node the receive buffer
  had been cleared and therefore it thought that seq number 1 was missing,
  and therefore sent nack to the first node
* when first node received the nack it thrown
  IllegalStateException: Error encountered while processing system message
    acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}]
  caused by: ResendUnfulfillableException: Unable to fulfill resend request since
    negatively acknowledged payload is no longer in buffer

This was fixed by not stopping the ReliableDeliverySupervisor so that the
receive buffer was preserved.

Not necessary for fixing the issue, but the following config was adjusted
* increased transport-failure-detector timeout to avoid tearing down the
  connection too early
* reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor
  actors earlier
patriknw added a commit that referenced this issue Jun 9, 2017
…23010

Reproducer (TransportFailSpec):

* watch from first to second node, i.e. sys msg with seq number 1
* trigger transport failure detection to tear down the connection
* the bug was that on the second node the ReliableDeliverySupervisor
  was stopped because the send buffer had not been used on that side,
  but that removed the receive buffer entry
* later, after gating elapsed another watch from first to second node,
  i.e. sys msg with seq number 2
* when that watch msg was received on the second node the receive buffer
  had been cleared and therefore it thought that seq number 1 was missing,
  and therefore sent nack to the first node
* when first node received the nack it thrown
  IllegalStateException: Error encountered while processing system message
    acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}]
  caused by: ResendUnfulfillableException: Unable to fulfill resend request since
    negatively acknowledged payload is no longer in buffer

This was fixed by not stopping the ReliableDeliverySupervisor so that the
receive buffer was preserved.

Not necessary for fixing the issue, but the following config settings were adjusted:
* increased transport-failure-detector timeout to avoid tearing down the
  connection too early
* reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor
  actors earlier
patriknw added a commit that referenced this issue Jun 9, 2017
…23010

Reproducer (TransportFailSpec):

* watch from first to second node, i.e. sys msg with seq number 1
* trigger transport failure detection to tear down the connection
* the bug was that on the second node the ReliableDeliverySupervisor
  was stopped because the send buffer had not been used on that side,
  but that removed the receive buffer entry
* later, after gating elapsed another watch from first to second node,
  i.e. sys msg with seq number 2
* when that watch msg was received on the second node the receive buffer
  had been cleared and therefore it thought that seq number 1 was missing,
  and therefore sent nack to the first node
* when first node received the nack it thrown
  IllegalStateException: Error encountered while processing system message
    acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}]
  caused by: ResendUnfulfillableException: Unable to fulfill resend request since
    negatively acknowledged payload is no longer in buffer

This was fixed by not stopping the ReliableDeliverySupervisor so that the
receive buffer was preserved.

Not necessary for fixing the issue, but the following config settings were adjusted:
* increased transport-failure-detector timeout to avoid tearing down the
  connection too early
* reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor
  actors earlier
patriknw added a commit that referenced this issue Jun 9, 2017
…23010

Reproducer (TransportFailSpec):

* watch from first to second node, i.e. sys msg with seq number 1
* trigger transport failure detection to tear down the connection
* the bug was that on the second node the ReliableDeliverySupervisor
  was stopped because the send buffer had not been used on that side,
  but that removed the receive buffer entry
* later, after gating elapsed another watch from first to second node,
  i.e. sys msg with seq number 2
* when that watch msg was received on the second node the receive buffer
  had been cleared and therefore it thought that seq number 1 was missing,
  and therefore sent nack to the first node
* when first node received the nack it thrown
  IllegalStateException: Error encountered while processing system message
    acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}]
  caused by: ResendUnfulfillableException: Unable to fulfill resend request since
    negatively acknowledged payload is no longer in buffer

This was fixed by not stopping the ReliableDeliverySupervisor so that the
receive buffer was preserved.

Not necessary for fixing the issue, but the following config settings were adjusted:
* increased transport-failure-detector timeout to avoid tearing down the
  connection too early
* reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor
  actors earlier

(cherry picked from commit 32f0936)
@patriknw patriknw added this to the 2.4.19 milestone Jun 9, 2017
@patriknw patriknw added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Jun 9, 2017
@patriknw patriknw self-assigned this Jun 9, 2017
@412b
Copy link
Contributor

412b commented Jun 10, 2017

hi hAkkers
we've just recently hit that particular bug in production, so can this be pushed a bit and released faster? :)
I see it being a hidden timebomb in our cluster and fix for it is critical for us :)

@patriknw
Copy link
Member

We will release on Monday

patriknw added a commit that referenced this issue Jun 11, 2017
…ption-patriknw

Fix ResendUnfulfillableException after transport failure detection, #23010
patriknw added a commit that referenced this issue Jun 11, 2017
…ption-2.4-patriknw

Fix ResendUnfulfillableException after transport failure detection, #23010 (for validation)
@patriknw patriknw removed the 3 - in progress Someone is working on this ticket label Jun 11, 2017
patriknw added a commit that referenced this issue Jun 11, 2017
* follow up on #23010, ActorsLeakSpec sometimes fails
  because the reliableEndpointWriter is not stopped as
  early as before
patriknw added a commit that referenced this issue Jun 11, 2017
* follow up on #23010, ActorsLeakSpec sometimes fails
  because the reliableEndpointWriter is not stopped as
  early as before

(cherry picked from commit 4f9921b)
patriknw added a commit that referenced this issue Jun 11, 2017
patriknw added a commit that referenced this issue Jun 11, 2017
patriknw added a commit that referenced this issue Jun 12, 2017
…23010

Reproducer (TransportFailSpec):

* watch from first to second node, i.e. sys msg with seq number 1
* trigger transport failure detection to tear down the connection
* the bug was that on the second node the ReliableDeliverySupervisor
  was stopped because the send buffer had not been used on that side,
  but that removed the receive buffer entry
* later, after gating elapsed another watch from first to second node,
  i.e. sys msg with seq number 2
* when that watch msg was received on the second node the receive buffer
  had been cleared and therefore it thought that seq number 1 was missing,
  and therefore sent nack to the first node
* when first node received the nack it thrown
  IllegalStateException: Error encountered while processing system message
    acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}]
  caused by: ResendUnfulfillableException: Unable to fulfill resend request since
    negatively acknowledged payload is no longer in buffer

This was fixed by not stopping the ReliableDeliverySupervisor so that the
receive buffer was preserved.

Not necessary for fixing the issue, but the following config settings were adjusted:
* increased transport-failure-detector timeout to avoid tearing down the
  connection too early
* reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor
  actors earlier
patriknw added a commit that referenced this issue Jun 12, 2017
* follow up on #23010, ActorsLeakSpec sometimes fails
  because the reliableEndpointWriter is not stopped as
  early as before
@412b
Copy link
Contributor

412b commented Jun 13, 2017

@patriknw thank you, I see release for 2.4. is there a chance to see 2.5 with this fix soon?

@patriknw
Copy link
Member

yes, we will release 2.5.3 in a few days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants