Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while decoding incoming Akka PDU Exception when communicating between Remote Actors with a large number of messages on Linux #3370

Closed
IanPattersonMuo opened this issue Mar 22, 2018 · 8 comments · Fixed by #3395
Labels
Milestone

Comments

@IanPattersonMuo
Copy link

Akka.Net 1.3.5

We get an "Error while decoding incoming Akka PDU" exception when sending a large number of messages from a Shard Entity to another actor in a different process in the same cluster. It manifests when deployed on multiple Linux machines or whilst running in Docker (Linux). The error is seen on the client side of the communication and forces the process to Disassociate from the cluster. The full stack trace from the client process is as follows:

[ERROR][03/22/2018 15:36:58][Thread 0012][[akka://testsystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Ftestsystem%40172.18.0.2%3A4053-1/endpointWriter#790743899]] AssociationError [akka.tcp://testsystem@172.18.0.3:42941] -> akka.tcp://testsystem@172.18.0.2:4053: Error [Error while decoding incoming Akka PDU] [   at Akka.Remote.EndpointReader.TryDecodeMessageAndAck(ByteString pdu)
   at Akka.Remote.EndpointReader.<Reading>b__11_1(InboundPayload inbound)
   at lambda_method(Closure , Object , Action`1 , Action`1 , Action`1 )
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)]
Cause: Unknown
[WARNING][03/22/2018 15:36:58][Thread 0012][[akka://testsystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Ftestsystem%40172.18.0.2%3A4053-1#1816327727]] Association with remote system akka.tcp://testsystem@172.18.0.2:4053 has failed; address is now gated for 5000 ms. Reason is: [Akka.Remote.EndpointException: Error while decoding incoming Akka PDU ---> Google.Protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.
   at Google.Protobuf.CodedInputStream.RefillBuffer(Boolean mustSucceed)
   at Google.Protobuf.CodedInputStream.ReadRawByte()
   at Google.Protobuf.CodedInputStream.ReadRawLittleEndian64()
   at Google.Protobuf.CodedInputStream.SkipLastField()
   at Akka.Remote.Serialization.Proto.Msg.Payload.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Akka.Remote.Serialization.Proto.Msg.RemoteEnvelope.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Akka.Remote.Serialization.Proto.Msg.AckAndEnvelopeContainer.MergeFrom(CodedInputStream input)
   at Google.Protobuf.MessageExtensions.MergeFrom(IMessage message, ByteString data)
   at Google.Protobuf.MessageParser`1.ParseFrom(ByteString data)
   at Akka.Remote.Transport.AkkaPduProtobuffCodec.DecodeMessage(ByteString raw, IRemoteActorRefProvider provider, Address localAddress)
   at Akka.Remote.EndpointReader.TryDecodeMessageAndAck(ByteString pdu)
   --- End of inner exception stack trace ---
   at Akka.Remote.EndpointReader.TryDecodeMessageAndAck(ByteString pdu)
   at Akka.Remote.EndpointReader.<Reading>b__11_1(InboundPayload inbound)
   at lambda_method(Closure , Object , Action`1 , Action`1 , Action`1 )
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Akka.Actor.ActorCell.HandleFailed(Failed f)
   at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)]
[ERROR][03/22/2018 15:36:58][Thread 0012][akka://testsystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Ftestsystem%40172.18.0.2%3A4053-1/endpointWriter] Error while decoding incoming Akka PDU
Cause: Akka.Remote.EndpointException: Error while decoding incoming Akka PDU ---> Google.Protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.
   at Google.Protobuf.CodedInputStream.RefillBuffer(Boolean mustSucceed)
   at Google.Protobuf.CodedInputStream.ReadRawByte()
   at Google.Protobuf.CodedInputStream.ReadRawLittleEndian64()
   at Google.Protobuf.CodedInputStream.SkipLastField()
   at Akka.Remote.Serialization.Proto.Msg.Payload.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Akka.Remote.Serialization.Proto.Msg.RemoteEnvelope.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Akka.Remote.Serialization.Proto.Msg.AckAndEnvelopeContainer.MergeFrom(CodedInputStream input)
   at Google.Protobuf.MessageExtensions.MergeFrom(IMessage message, ByteString data)
   at Google.Protobuf.MessageParser`1.ParseFrom(ByteString data)
   at Akka.Remote.Transport.AkkaPduProtobuffCodec.DecodeMessage(ByteString raw, IRemoteActorRefProvider provider, Address localAddress)
   at Akka.Remote.EndpointReader.TryDecodeMessageAndAck(ByteString pdu)
   --- End of inner exception stack trace ---
   at Akka.Remote.EndpointReader.TryDecodeMessageAndAck(ByteString pdu)
   at Akka.Remote.EndpointReader.<Reading>b__11_1(InboundPayload inbound)
   at lambda_method(Closure , Object , Action`1 , Action`1 , Action`1 )
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Akka.Actor.ActorCell.HandleFailed(Failed f)
   at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)

The server also shows a disassociation error

[WARNING][03/22/2018 16:27:07][Thread 0010][[akka://testsystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Ftestsystem%40172.18.0.3%3A37107-1#1885238336]] Association with remote system akka.tcp://testsystem@172.18.0.3:37107 has failed; address is now gated for 5000 ms. Reason is: [Akka.Remote.EndpointDisassociatedException: Disassociated
   at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level, Boolean needToThrow)
   at Akka.Actor.ActorCell.<>c__DisplayClass106_0.<Akka.Actor.IUntypedActorContext.Become>b__0(Object m)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)]
[ERROR][03/22/2018 16:27:07][Thread 0010][akka://testsystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Ftestsystem%40172.18.0.3%3A37107-1/endpointWriter] Disassociated
Cause: Akka.Remote.EndpointDisassociatedException: Disassociated
   at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level, Boolean needToThrow)
   at Akka.Actor.ActorCell.<>c__DisplayClass106_0.<Akka.Actor.IUntypedActorContext.Become>b__0(Object m)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)

I have created a sample application that replicates the issue.

https://github.com/muo-ltd/Akka-NetCore-DockerClusterWithShards

It creates a cluster with two processes, one acts as a client and the other acts as a server. The server has a Sharded Entity which the client sends a message to requesting information. The Sharded entity then streams a large number of messages back to the client to consume. The messages returned are simple, the payload is string with 1000 characters all 1's, and 10000 messages are generated and returned. Using a lower number of messages (e.g. 1000) does not show this error so it does appear to relate to volume. I have tested it using both the default JSON serialiser and Hyperion. The only difference between the two is that the Hyperion error is slightly different. It contains the error

DotNetty.Codecs.TooLongFrameException: Adjusted frame length exceeds 128000: 825307445 - discarded

Full stack trace below

[ERROR][03/22/2018 16:44:48][Thread 0011][[akka://testsystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Ftestsystem%40172.18.0.2%3A4053-1/endpointWriter#1992530996]] AssociationError [akka.tcp://testsystem@172.18.0.3:33235] -> akka.tcp://testsystem@172.18.0.2:4053: Error [Error while decoding incoming Akka PDU] [   at Akka.Remote.EndpointReader.TryDecodeMessageAndAck(ByteString pdu)
   at Akka.Remote.EndpointReader.<Reading>b__11_1(InboundPayload inbound)
   at lambda_method(Closure , Object , Action`1 , Action`1 , Action`1 )
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)]
Cause: Unknown
[ERROR][03/22/2018 16:44:48][Thread 0017][Akka.Remote.Transport.DotNetty.TcpClientHandler] Error caught channel [[::ffff:172.18.0.3]:43925->[::ffff:172.18.0.2]:4053](Id=64d2fccc)
Cause: DotNetty.Codecs.TooLongFrameException: Adjusted frame length exceeds 128000: 825307445 - discarded
   at DotNetty.Codecs.LengthFieldBasedFrameDecoder.Fail(Int64 frameLength)
   at DotNetty.Codecs.LengthFieldBasedFrameDecoder.Decode(IChannelHandlerContext context, IByteBuffer input)
   at DotNetty.Codecs.LengthFieldBasedFrameDecoder.Decode(IChannelHandlerContext context, IByteBuffer input, List`1 output)
   at DotNetty.Codecs.ByteToMessageDecoder.CallDecode(IChannelHandlerContext context, IByteBuffer input, List`1 output)
   at DotNetty.Codecs.ByteToMessageDecoder.ChannelRead(IChannelHandlerContext context, Object message)
   at DotNetty.Transport.Channels.AbstractChannelHandlerContext.InvokeChannelRead(Object msg)
[WARNING][03/22/2018 16:44:48][Thread 0011][[akka://testsystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Ftestsystem%40172.18.0.2%3A4053-1#870631177]] Association with remote system akka.tcp://testsystem@172.18.0.2:4053 has failed; address is now gated for 5000 ms. Reason is: [Akka.Remote.EndpointException: Error while decoding incoming Akka PDU ---> Google.Protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.
   at Google.Protobuf.CodedInputStream.RefillBuffer(Boolean mustSucceed)
   at Google.Protobuf.CodedInputStream.ReadRawByte()
   at Google.Protobuf.CodedInputStream.ReadRawLittleEndian64()
   at Google.Protobuf.CodedInputStream.SkipLastField()
   at Akka.Remote.Serialization.Proto.Msg.Payload.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Akka.Remote.Serialization.Proto.Msg.RemoteEnvelope.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Akka.Remote.Serialization.Proto.Msg.AckAndEnvelopeContainer.MergeFrom(CodedInputStream input)
   at Google.Protobuf.MessageExtensions.MergeFrom(IMessage message, ByteString data)
   at Google.Protobuf.MessageParser`1.ParseFrom(ByteString data)
   at Akka.Remote.Transport.AkkaPduProtobuffCodec.DecodeMessage(ByteString raw, IRemoteActorRefProvider provider, Address localAddress)
   at Akka.Remote.EndpointReader.TryDecodeMessageAndAck(ByteString pdu)
   --- End of inner exception stack trace ---
   at Akka.Remote.EndpointReader.TryDecodeMessageAndAck(ByteString pdu)
   at Akka.Remote.EndpointReader.<Reading>b__11_1(InboundPayload inbound)
   at lambda_method(Closure , Object , Action`1 , Action`1 , Action`1 )
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Akka.Actor.ActorCell.HandleFailed(Failed f)
   at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)]
[ERROR][03/22/2018 16:44:48][Thread 0011][akka://testsystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Ftestsystem%40172.18.0.2%3A4053-1/endpointWriter] Error while decoding incoming Akka PDU
Cause: Akka.Remote.EndpointException: Error while decoding incoming Akka PDU ---> Google.Protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.
   at Google.Protobuf.CodedInputStream.RefillBuffer(Boolean mustSucceed)
   at Google.Protobuf.CodedInputStream.ReadRawByte()
   at Google.Protobuf.CodedInputStream.ReadRawLittleEndian64()
   at Google.Protobuf.CodedInputStream.SkipLastField()
   at Akka.Remote.Serialization.Proto.Msg.Payload.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Akka.Remote.Serialization.Proto.Msg.RemoteEnvelope.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Akka.Remote.Serialization.Proto.Msg.AckAndEnvelopeContainer.MergeFrom(CodedInputStream input)
   at Google.Protobuf.MessageExtensions.MergeFrom(IMessage message, ByteString data)
   at Google.Protobuf.MessageParser`1.ParseFrom(ByteString data)
   at Akka.Remote.Transport.AkkaPduProtobuffCodec.DecodeMessage(ByteString raw, IRemoteActorRefProvider provider, Address localAddress)
   at Akka.Remote.EndpointReader.TryDecodeMessageAndAck(ByteString pdu)
   --- End of inner exception stack trace ---
   at Akka.Remote.EndpointReader.TryDecodeMessageAndAck(ByteString pdu)
   at Akka.Remote.EndpointReader.<Reading>b__11_1(InboundPayload inbound)
   at lambda_method(Closure , Object , Action`1 , Action`1 , Action`1 )
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Akka.Actor.ActorCell.HandleFailed(Failed f)
   at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)

While running this locally on Mac, Windows or Linux it appears to work correctly. If it is deployed to Docker it will fail and if deployed across multiple Linux hosts it will also fail. Deploying to multiple Windows hosts appears to work correctly.

@IanPattersonMuo
Copy link
Author

IanPattersonMuo commented Mar 23, 2018

I have simplified this to calling a remote actor and the problem persists, which means that it is probably not an issue with either Clustering or Sharding, and is a more general problem of networking on Linux.

@Aaronontheweb
Copy link
Member

I've been able to reproduce this using the sample above, so the original sample fails on my machine too.

However, when I ported WebCrawler and ran that in a pure Linux environment this week - had no issues at all. Makes me wonder if the problem might actually be related to the combination of the .NET Core version + Linux.

@Aaronontheweb
Copy link
Member

@IanPattersonMuo so if I look at the Dockerfiles I'm using for WebCrawler, I'm using an older version of the .NET Core 2.0 runtime: https://github.com/petabridge/akkadotnet-code-samples/blob/master/Cluster.WebCrawler/src/WebCrawler.Web/Dockerfile

2.0 vs 2.0.5.

But I don't think that's the issue. I'm wondering though if the issue might stem from the way the Docker images are built in your solution. Judging from your build.sh, looks like the images are dotnet publish-ed locally and then subsequently copied into an image. The way mine work is through the use of the Docker build pipeline - my project is actually restored and compiled inside the same runtime that the Docker image itself will be executing at runtime. I wonder if there's an environmental difference between your build-time enviroment and your run-time environment that could cause this - such as one of the .NET Standard libraries having a slightly different socket IO implementation depending on the host OS.

What do you think - is that worth trying to rule out as a possibility?

@IanPattersonMuo
Copy link
Author

I have tested it natively on multiple Linux hosts where it is build and run on each host. This failed as well so I don't think it is to do with the way it is built in docker. I'll look to give it a go with earlier version of the runtime and see if that makes a difference.

@Aaronontheweb
Copy link
Member

@IanPattersonMuo I take that back - the versioning is a bit misleading around these images.

microsoft/aspnetcore:2.0 - this actually is an image that is updated for each minor revision. Meaning that the version itself is mutable. It's not 2.0.0, but rather 2.0.8 or whatever the latest of the 2.0.* branch is. So I'm actually using a newer version of the image.

@IanPattersonMuo
Copy link
Author

IanPattersonMuo commented Mar 26, 2018

I have checked it with the 2.1.0 preview images as well with the same error. I created a branch in the sample code which just uses akka.remoting rather than clustering and sharding and it fails with the same error. I also created a branch that uses Windows containers and it works correctly and it works as expected.

@IanPattersonMuo
Copy link
Author

@Aaronontheweb I checked the issue building the code within the container itself and it fails with the same error.

@IanPattersonMuo IanPattersonMuo changed the title Error while decoding incoming Akka PDU Exception when communicating between Shard Entity and Actor with a large number of messages Error while decoding incoming Akka PDU Exception when communicating between Remote Actors with a large number of messages on Linux Mar 28, 2018
@Aaronontheweb
Copy link
Member

@IanPattersonMuo I'm going to bring this up with the DotNetty folks - sure looks like a message framing inconsistency between platforms. Weird part is that I can't recreate this error using WebCrawler on Akka.Remote / Akka.Cluster in .NET Core on Linux, but I can with your application...

Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue Apr 17, 2018
…e Linux by disabling buffer pooling via HOCON
marcpiechura pushed a commit that referenced this issue Apr 17, 2018
@Aaronontheweb Aaronontheweb added this to the 1.3.6 milestone Apr 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants