akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. #23010

Closed
jrudolph opened this Issue May 23, 2017 · 5 comments

Comments

Projects
None yet
3 participants
Member

jrudolph commented May 23, 2017

We have seen a few (but infrequent) reports of quarantining happening with this error message.

The error messages usually look like this:

Association to [...] with UID [...] irrecoverably failed. Quarantining address. 
java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [7 {1, 2, 3, 4, 5, 6, 7}] ack: ACK[7, {0}] 
at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:301) 
at akka.actor.Actor$class.aroundReceive(Actor.scala:484) 
at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:200) 
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) 
at akka.actor.ActorCell.invoke(ActorCell.scala:495) 
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) 
at akka.dispatch.Mailbox.run(Mailbox.scala:224) 
at akka.dispatch.Mailbox.exec(Mailbox.scala:234) 
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Caused by: akka.remote.ResendUnfulfillableException: Unable to fulfill resend request since negatively acknowledged payload is no longer in buffer. The resend states between two systems are compromised and cannot be recovered. 
at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:107) 
at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:297) 
... 11 common frames omitted 

The particular acknowledgement setup can be different.

We have seen reports for this occurring on Akka 2.4.7, 2.4.11, and 2.4.17.

It seems to happen under different kind of circumstances:

  • node is restarted
  • temporary unreachability of a node due to GC
  • underlying connection breakdown due to external causes

So far, we haven't been able to reproduce the issue or having been able to get hold of a complete set of logs that would allow us to reproduce the issue.

Maybe related: #16623 and #19780

Member

jrudolph commented May 23, 2017

The error message means that the sending side of a system message first receives acknowledgement for a message and then later receives a negative acknowledgement for that same message.

How could that happen? IIUC these buffers and management data structures are kept even if the physical connection had to be reestablished.

E.g. in the case above

 acknowledgement buffer: [7 {1, 2, 3, 4, 5, 6, 7}] ack: ACK[7, {0}] 

Message 0 was likely confirmed earlier. The receiver then seems to "have forgotten" about that message and when it receives other messages later on it sends a negative acknowledgement for that message. The question is why this information gets lost at some point.

@patriknw patriknw added a commit that referenced this issue Jun 9, 2017

@patriknw patriknw Fix ResendUnfulfillableException after transport failure detection, #…
…23010

Reproducer (TransportFailSpec):

* watch from first to second node, i.e. sys msg with seq number 1
* trigger transport failure detection to tear down the connection
* the bug was that on the second node the ReliableDeliverySupervisor
  was stopped because the send buffer had not been used on that side,
  but that removed the receive buffer entry
* later, after gating elapsed another watch from first to second node,
  i.e. sys msg with seq number 2
* when that watch msg was received on the second node the receive buffer
  had been cleared and therefore it thought that seq number 1 was missing,
  and therefore sent nack to the first node
* when first node received the nack it thrown
  IllegalStateException: Error encountered while processing system message
    acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}]
  caused by: ResendUnfulfillableException: Unable to fulfill resend request since
    negatively acknowledged payload is no longer in buffer

This was fixed by not stopping the ReliableDeliverySupervisor so that the
receive buffer was preserved.

Not necessary for fixing the issue, but the following config was adjusted
* increased transport-failure-detector timeout to avoid tearing down the
  connection too early
* reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor
  actors earlier
4e91f8d

@patriknw patriknw added a commit that referenced this issue Jun 9, 2017

@patriknw patriknw Fix ResendUnfulfillableException after transport failure detection, #…
…23010

Reproducer (TransportFailSpec):

* watch from first to second node, i.e. sys msg with seq number 1
* trigger transport failure detection to tear down the connection
* the bug was that on the second node the ReliableDeliverySupervisor
  was stopped because the send buffer had not been used on that side,
  but that removed the receive buffer entry
* later, after gating elapsed another watch from first to second node,
  i.e. sys msg with seq number 2
* when that watch msg was received on the second node the receive buffer
  had been cleared and therefore it thought that seq number 1 was missing,
  and therefore sent nack to the first node
* when first node received the nack it thrown
  IllegalStateException: Error encountered while processing system message
    acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}]
  caused by: ResendUnfulfillableException: Unable to fulfill resend request since
    negatively acknowledged payload is no longer in buffer

This was fixed by not stopping the ReliableDeliverySupervisor so that the
receive buffer was preserved.

Not necessary for fixing the issue, but the following config settings were adjusted:
* increased transport-failure-detector timeout to avoid tearing down the
  connection too early
* reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor
  actors earlier
c7cab39

@patriknw patriknw added a commit that referenced this issue Jun 9, 2017

@patriknw patriknw Fix ResendUnfulfillableException after transport failure detection, #…
…23010

Reproducer (TransportFailSpec):

* watch from first to second node, i.e. sys msg with seq number 1
* trigger transport failure detection to tear down the connection
* the bug was that on the second node the ReliableDeliverySupervisor
  was stopped because the send buffer had not been used on that side,
  but that removed the receive buffer entry
* later, after gating elapsed another watch from first to second node,
  i.e. sys msg with seq number 2
* when that watch msg was received on the second node the receive buffer
  had been cleared and therefore it thought that seq number 1 was missing,
  and therefore sent nack to the first node
* when first node received the nack it thrown
  IllegalStateException: Error encountered while processing system message
    acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}]
  caused by: ResendUnfulfillableException: Unable to fulfill resend request since
    negatively acknowledged payload is no longer in buffer

This was fixed by not stopping the ReliableDeliverySupervisor so that the
receive buffer was preserved.

Not necessary for fixing the issue, but the following config settings were adjusted:
* increased transport-failure-detector timeout to avoid tearing down the
  connection too early
* reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor
  actors earlier
32f0936

@patriknw patriknw added a commit that referenced this issue Jun 9, 2017

@patriknw patriknw Fix ResendUnfulfillableException after transport failure detection, #…
…23010

Reproducer (TransportFailSpec):

* watch from first to second node, i.e. sys msg with seq number 1
* trigger transport failure detection to tear down the connection
* the bug was that on the second node the ReliableDeliverySupervisor
  was stopped because the send buffer had not been used on that side,
  but that removed the receive buffer entry
* later, after gating elapsed another watch from first to second node,
  i.e. sys msg with seq number 2
* when that watch msg was received on the second node the receive buffer
  had been cleared and therefore it thought that seq number 1 was missing,
  and therefore sent nack to the first node
* when first node received the nack it thrown
  IllegalStateException: Error encountered while processing system message
    acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}]
  caused by: ResendUnfulfillableException: Unable to fulfill resend request since
    negatively acknowledged payload is no longer in buffer

This was fixed by not stopping the ReliableDeliverySupervisor so that the
receive buffer was preserved.

Not necessary for fixing the issue, but the following config settings were adjusted:
* increased transport-failure-detector timeout to avoid tearing down the
  connection too early
* reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor
  actors earlier

(cherry picked from commit 32f0936)
63676ee

patriknw added this to the 2.4.19 milestone Jun 9, 2017

@patriknw patriknw added 3 - in progress and removed 1 - triaged labels Jun 9, 2017

patriknw self-assigned this Jun 9, 2017

Contributor

412b commented Jun 10, 2017 edited

hi hAkkers
we've just recently hit that particular bug in production, so can this be pushed a bit and released faster? :)
I see it being a hidden timebomb in our cluster and fix for it is critical for us :)

Owner

patriknw commented Jun 11, 2017

We will release on Monday

@patriknw patriknw added a commit that referenced this issue Jun 11, 2017

@patriknw patriknw Merge pull request #23129 from akka/wip-23010-ResendUnfulfillableExce…
…ption-patriknw

Fix ResendUnfulfillableException after transport failure detection, #23010
14617d3

@patriknw patriknw added a commit that referenced this issue Jun 11, 2017

@patriknw patriknw Merge pull request #23130 from akka/wip-23010-ResendUnfulfillableExce…
…ption-2.4-patriknw

Fix ResendUnfulfillableException after transport failure detection, #23010 (for validation)
957bdbd

patriknw closed this Jun 11, 2017

@patriknw patriknw added a commit that referenced this issue Jun 11, 2017

@patriknw patriknw increase timeout in ActorsLeakSpec, #23010
* follow up on #23010, ActorsLeakSpec sometimes fails
  because the reliableEndpointWriter is not stopped as
  early as before
4f9921b

@patriknw patriknw added a commit that referenced this issue Jun 11, 2017

@patriknw patriknw increase timeout in ActorsLeakSpec, #23010
* follow up on #23010, ActorsLeakSpec sometimes fails
  because the reliableEndpointWriter is not stopped as
  early as before

(cherry picked from commit 4f9921b)
34ac46f

@patriknw patriknw added a commit that referenced this issue Jun 11, 2017

@patriknw patriknw Merge pull request #23137 from akka/wip-23010-ActorsLeakSpec-patriknw
increase timeout in ActorsLeakSpec, #23010
c761772

@patriknw patriknw added a commit that referenced this issue Jun 11, 2017

@patriknw patriknw Merge pull request #23138 from akka/wip-23010-ActorsLeakSpec-2.4-patr…
…iknw

increase timeout in ActorsLeakSpec, #23010
e983068

@patriknw patriknw added a commit that referenced this issue Jun 12, 2017

@patriknw patriknw Fix ResendUnfulfillableException after transport failure detection, #…
…23010

Reproducer (TransportFailSpec):

* watch from first to second node, i.e. sys msg with seq number 1
* trigger transport failure detection to tear down the connection
* the bug was that on the second node the ReliableDeliverySupervisor
  was stopped because the send buffer had not been used on that side,
  but that removed the receive buffer entry
* later, after gating elapsed another watch from first to second node,
  i.e. sys msg with seq number 2
* when that watch msg was received on the second node the receive buffer
  had been cleared and therefore it thought that seq number 1 was missing,
  and therefore sent nack to the first node
* when first node received the nack it thrown
  IllegalStateException: Error encountered while processing system message
    acknowledgement buffer: [2 {2}] ack: ACK[2, {1, 0}]
  caused by: ResendUnfulfillableException: Unable to fulfill resend request since
    negatively acknowledged payload is no longer in buffer

This was fixed by not stopping the ReliableDeliverySupervisor so that the
receive buffer was preserved.

Not necessary for fixing the issue, but the following config settings were adjusted:
* increased transport-failure-detector timeout to avoid tearing down the
  connection too early
* reduce the quarantine-after-silence to cleanup ReliableDeliverySupervisor
  actors earlier
a4ab142

@patriknw patriknw added a commit that referenced this issue Jun 12, 2017

@patriknw patriknw increase timeout in ActorsLeakSpec, #23010
* follow up on #23010, ActorsLeakSpec sometimes fails
  because the reliableEndpointWriter is not stopped as
  early as before
6685bb8
Contributor

412b commented Jun 13, 2017

@patriknw thank you, I see release for 2.4. is there a chance to see 2.5 with this fix soon?

Owner

patriknw commented Jun 13, 2017

yes, we will release 2.5.3 in a few days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment