Akka.Remote is really slow #2378

florisch · 2016-11-15T12:57:29Z

Hi,

The company I work at is working on two new projects involving Akka.Net. We designed our application around Akka clustering capabilities, now we see big performance issues with Akka cluster/remote application.

In our applications we have many (more than 20) nodes. Each node sending more than 30K msg/sec to a central server. We have a simulation mode where we generate all the messages from the same server/process in wich case everythings works fine. As soon as we generate the messages using Akka.Remote our performances issues appears (and even cause a "BufferIsEmpty" inside Sockets/Helios exception on the linux arm mono nodes after some time wich just kill the process).

In order to understand the performance limitations of Akka, on a modern I7 computer, we ran simple pingpong benchmark in two scenarios.

With one single application running receivers and senders I get a good 25 Million messages per seconds.
Running receiver and sender actors in two processes on a local machine communicating through Akka.Remote (127.0.0 with wire serializer). We get 54K messages per seconds with 100% of cpu usage.

Of course we see that in this condition our application will never be able to run on with current Akka.net remote performances. We understand that remote involves a lot more processing than local message passing because of serialization and network but here after a quick profiling session it appears that most of the time is spent doing some internal akka/remote stuff and not in helios/wire/protobuf.

Is this performance considered normal or is there any hope to have better performances by tweaking options or any scheduled remote improvement ?

rogeralsing · 2016-11-15T13:38:22Z

The new Stream transport planned for 1.5 bumps this up to ~100k messages per second.

That being said, there are some design issues we have inherited from the Scala implementation.
e.g. every message have a sender, the sender is an actorref and that has to be serialized, deserialized and then resolved on the remote system.
This is one of the major bottlenecks.

So even if the receiving side don't use the Sender, it will still be there and there is nothing we can do except use some form of cache for the sender lookup.

In JVM Akka, they have recently introduced a new transport "Artery" built on the new Aeron protocol.
In this transport they now can store actorrefs by ID, thus only passing the actorref as an integer between two nodes.
Using protobuf and the new transport they are able to do ~300 messages per second.

There are no on plans right now to port this transport to .NET, it would be possible as Aeron also exist for .NET now.

So I dont really have a good answer for you. this is how it is right now.
Project Orleans also have a throughput of ~100k msg/sec, but AFAIK, without ordering.

You could probably get some improvement by having some form of batch message actors, that collect messages and then pass them as a single big message. that would get around some of the per message resolutions that currently needs to be done.

florisch · 2016-11-15T14:44:01Z

Thanks for your answer. From my perspective this is a major deal-breaker for large scale distributed applications witch at first really look like the big selling point of Akka with clustering.

I already implemented a solution batching messages with some retry and flow control system. Using this I get better throughput but it's still far from fast enough for my application. From what I read about Akka.net, this also goes against the rule to avoid sending large messages over network with akka.net.

As for a workaround, after some quick tests, on the same I7 computer with NetMq push/pull with localhost tcp I'm able to send about 2.5 millions 100bytes msg/sec with only two cores used (50% cpu).

Currently, the best workaround I've found is to keep using Akka.Cluster for some low rate management messages and the using really fast NetMq and some in-house fast serializer for all our messages.

slorion · 2017-02-03T15:57:47Z

I am curious to know if anything has happened regarding this issue. We are redesigning a major app serving about 5 million customers and while I would love to architect the new version around Akka.NET, reading the above is bothering me greatly. Thank you in advance for your feedback!

Aaronontheweb · 2017-02-03T16:47:19Z

@slorion we have to different initiatives which should help here.

The first is the switch from JSON.NET to Hyperion as our default serializer; Hyperion is roughly 20-30x faster depending on the types of data involved.

The second is our switch from Helios to DotNetty, which is just about ready to be merged here: #2444 - by the time this is fully performance tuned this should result in an improvement in overall network speeds as well.

In addition to those two changes we have some other optimizations coming that should also help improve the performance. So there's still work being done on this issue, but it's happening on a few different fronts.

slorion · 2017-02-05T13:35:58Z

@Aaronontheweb Thank you for the quick follow-up. Are the improvements you are describing what @rogeralsing meant by "new Stream transport planned for 1.5" ? Is 100k msg/s still a valid figure ?

Also, we are going to be using .NET Core, will it be fully supported ? I took a look at Hyperion issues and saw that it is not yet the case (akkadotnet/Hyperion#3). Is this something that will be part of the next release ?

Aaronontheweb · 2017-02-06T15:34:17Z

@slorion yep, that's correct. The 100k msg/s range is still a valid target. We're going to ship the DotNetty transport first, most likely, followed by the serializer switchover and other improvements in subsequent releases.

We're working on the .NET Core support. Most of the legwork for that is done already across all of the different parts; working on getting the build server to integrate it all correctly since the tooling has been a moving target.

Neftedollar · 2017-04-13T21:32:24Z

@Aaronontheweb There is 1.2 version was released, and, as I understand, it brings DotNetty on board. What about the performance now?

aliakhtar · 2017-12-08T21:42:08Z

So I'm evaluating Akka for a new project, and wondering where this stands now. What's the current expected throughput? cc @Aaronontheweb

beachwalker · 2018-06-04T06:06:43Z

Still interested too.

Aaronontheweb · 2018-06-04T13:50:14Z

No idea why I haven't commented on this sooner.

Running the latest dev branch sources on .NET Core 1.1 on my local development machine (Quadcore Intel I7 - machine is about 5.5 years old:)

ProcessorCount: 8
ClockSpeed: 0 MHZ
Actor Count: 16
Messages sent/received per client: 20000 (2e4)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 20000, 37524, 533.86
5, 100000, 49020, 2040.46
10, 200000, 53107, 3766.56
15, 300000, 52174, 5750.86
20, 400000, 53023, 7544.77
25, 500000, 52933, 9446.58
30, 600000, 52897, 11343.44

Middle value on that table is the number of messages per second. Averages about 52-53k msg / s under this configuration across a single Akka.Remote connection.

Aaronontheweb · 2018-06-04T13:58:00Z

Here's the numbers for the same hardware, running .NET 4.6.1:

ProcessorCount: 8
ClockSpeed: 0 MHZ
Actor Count: 16
Messages sent/received per client: 20000 (2e4)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 20000, 46083, 434.37
5, 100000, 59917, 1669.46
10, 200000, 63695, 3140.42
15, 300000, 64048, 4684.32
20, 400000, 59032, 6776.21
25, 500000, 58652, 8525.28
30, 600000, 57051, 10517.46

Aaronontheweb · 2018-06-04T14:38:58Z

Just for kicks, here's the .NET Core 2.1 numbers - same hardware as the rest:

ProcessorCount: 8
ClockSpeed: 0 MHZ
Actor Count: 16
Messages sent/received per client: 20000 (2e4)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 20000, 44248, 452.49
5, 100000, 61163, 1635.64
10, 200000, 62854, 3182.58
15, 300000, 61716, 4861.34
20, 400000, 59711, 6699.13
25, 500000, 58398, 8562.91
30, 600000, 61013, 9834.86

voltcode · 2018-06-04T15:02:16Z

@Aaronontheweb

Assuming this was achieved using Wire serializer and DotNetty? If so, seems like despite planned improvements, 100k is still far away? Do you have any config tips or plans on how to achieve better throughput? What was the CPU utilization in your tests?

Aaronontheweb · 2018-06-04T16:41:45Z

@voltcode this is just using the defaults (no Hyperion,) but we are using DotNetty. I also think we're using the plain-old string serializer in these benchmarks too, so we don't have any JSON.NET overhead. The system spikes at about 90% CPU utilization on this machine when I run these benchmarks. On my custom-built AMD Ryzen machine at home I get about 100-110k msg / s.

But you're unlikely to achieve that Ryzen kind of performance on a cloud machine - hardware on public clouds is at least 2-3 years old on average plus you have the Hypervisor penalty, plus noisy neighbor problems and other unpredictable issues that stem from running in a shared environment.

Serialization can certainly contribute to performance issues, but with the way this benchmark is designed we've basically eliminated "choice of serializer" as a contributor to the figures you can see above. What you're seeing here is a measure of how the remoting system + the default transport performs. We, like you, are of the opinion that it is in need of improvement. It's definitely the case that some of the non-configurable parts of our serialization system are contributors to the performance here (i.e. the way we use serialization inside the built-in remoting actors.)

So that being said, I'll take a moment to expand on the avenues we're considering for improving Akka.Remote's performance - since it's the area that translates to the biggest "bang for the buck" for our users.

Akka.Remote Flow and Staging

Today, Akka.Remote tries to achieve a semblance of batching and being "reactive" to the amount of available bandwidth (processing power, buffer space, etc) in the system using an adaptive backoff system that we ported from the JVM. In part, we also rely on the transport itself to signal to us when it is ready for more data vs. when it is not, and the DotNetty transport can do this.

In my view though, there's a fundamental problem with this adaptive backoff system: it relies on the remoting actors "guessing" when the transport is going to become available again based on the adaptive backoff formula, rather than being explicitly signalled by the underlying transport that it is ready for more data again. This means there are windows of capacity we may not be utilizing at all.

On top of that, I'm not sure DotNetty's system for determining when the channel is writable vs. non-writable (outbound buffers are full) is 100% deterministic. There might be a delay between when the channel is writable and when it's not.

Lots of users have noticed that the performance of Akka.Remote starts off really strongly until the first time they see the adaptive backoff system kick in (there's a big INFO message that shows up when it occurs.) That points to the items I've called out above - that our adaptive backoff system is inefficient.

So, there's a really great technology that can help us solve this problem altogether: Akka.Streams. Rather than force the remoting actors to guess when the underlying transport might be available for more writes again, Akka.Streams can simply be pulled by the lowest part of the remoting stack when it has capacity again. This is the whole premise of Reactive Streams.

This is why the JVM Akka project has taken steps to rewrite some of the remoting actors as Akka.Streams components: they, like us, suffered from similar issues with the original Akka.Remote design. I expect we will follow suit there and reap some benefits from it.

There are other inefficient things in our flow too - like not serializing messages when we're backing off and getting them ready for writing and not trying to push multiple messages onto the socket at once if they can fit within a frame. Theoretically, the underlying transport should do the latter for us - but we have no way of really verifying that from the outside.

I personally think this is the biggest contributor to performance issues; there's bits and pieces I'm going to address further down, but I think that our workflow around the remoting actors and the remote transports is the area where we can realize the biggest improvements to Akka.Remote performance.

Transports and Event Models

I have a couple of working theories on performance stuff related to Akka.Remote and the transport system - latest one being that DotNetty's eventing model and the STEE (single thread event executor) it uses act as a bigger bottleneck than we originally thought. DotNetty has played around with their event model, their buffer system, and a lot of other pieces of infrastructure a lot over the past year or so - to a point where I have a lot of trouble following it these days. This isn't a criticism of their work - just me saying that "I don't know what's really going on under the hood there as well as I used to."

DotNetty's topline performance is higher than Akka.Remote's - 250k msg / s last I checked, but those metrics are always captured in isolation from Akka.NET and other parts of a .NET process that could compete for CPU time. When I run a detailed performance profiling of this same benchmark today, the DotNetty pollTask operation that occurs inside the STEE is the single biggest area of CPU consumption. There's some other implementation details regarding the way the STEE and EventLoopGroups in DotNetty work that I'm going to skip for the sake of brevity, but let's just say I suspect that Akka.NET and DotNetty are competing for resources in a way that has performance consequences.

So, I've been working on a proof of concept here to put this theory to the test: https://github.com/Aaronontheweb/Akka.Streams.RemoteTransport

This is a pure Akka.Streams transport that I'm still chipping away at, but it'll run on top of the Akka.Remote dispatcher and use the same threads that the remoting actors already do. In essence, we're going to let Akka.NET's dispatcher model do 100% of the work instead of half of it. On top of that, I think this might be a low-impact way to dogfood some of the flow changes described above.

Serialization and Buffer Use

This has been the most intensely debated area for performance improvements within Akka.Remote probably because it touches on other popular issues in the .NET ecosystem, such as Span and Memory<T> and because it's the most "visible" part of the remoting system due to its configurability.

Long story short: we know that our current design lends itself to lots of unnecessary buffer copying and garbage collection pressure. A big part of this is unavoidable: each serialization library wants to manage its own buffer allocation and re-use. On top of that, most serailizers operate as though they are "stateless" and do a bunch of duplicative work on each successive serialization operation.

Part of the Hyperion design that @Horusiath has been advocating is to introduce the concept of serialization "sessions" which can reduce the amount of work done on each successive per-message serialization. I.E. turn serialization into a "stateful" activity so we can re-use bits of information we've seen before on previous serialization attempts. It's not clear to us yet what kind of real-world benefits that will produce, but it's an avenue worth exploring.

From my part, since I'm coming at the remoting system from a different angle - I think redesigning the flow of Akka.Remote and making it such that the serialization components can all share from a common buffer pool and work in concert, rather than independently, will result in some large improvements. However, we're a bit dependent on third-party serializers like Google.Protobuf adding Span or Memory support in order to pull this off.

Anyway, that's my long-form braindump of what we're looking at for improving performance. Even more radical ideas, like changing the default transport from TCP to something else, aren't really being considered at this time but could be in the future.

In case it's not clear from what I've written here, we've given this a lot of thought and we're working on trying to fix it. What we're looking at are some bigger structural changes to Akka.Remote itself, and I think that's what will be required to fully address it in the future.

I can't make any promises around timeframes, but given that this is an issue that affects the bottom lines of our users - it's a high priority. We've made other performance fixes in the recent past (caching around Akka.Remote in 1.3.0 being the most recent I can think of) but the large leaps in performance are going to need to come from architectural improvements rather than performance tweaks.

seungyongshim · 2019-09-10T00:11:19Z

I heard that "Artery" is blaze faster than Akka.Remote.

	Remoting "classic"	Artery Remoting	Artery TCP Remoting
Protocol	TCP TLS+TCP	UDP Aeron	TCP TLS+TCP
Large messages	Troublesome	Dedicated lanes	Dedicated connection
Heartbeat and System Messages	Prioritised	Dedicated lanes	Dedicated connection
Benchmarked throughput	70k msg/s	700k+ msg/s	600k+ msg/s

https://www.slideshare.net/ktoso/networks-and-types-the-future-of-akka-scaladays-nyc-2018
121page

It is interesting for Akka.NET, is'n it?

Aaronontheweb · 2019-09-10T00:54:49Z

We could do Artery over TCP, but not Aeron - currently looking into some System.Pipelines.IO stuff too.

…

Sent from my iPhone

On Sep 9, 2019, at 7:11 PM, Chris Shim ***@***.***> wrote: I heard that "Artery" is blaze faster than Akka.Remote. Remoting "classic" Artery Remoting Artery TCP Remoting Protocol TCP TLS+TCP UDP Aeron TCP TLS+TCP Large messages Troublesome Dedicated lanes Dedicated connection Heartbeat and System Messages Prioritised Dedicated lanes Dedicated connection Benchmarked throughput 70k msg/s 700k+ msg/s 600k+ msg/s https://www.slideshare.net/ktoso/networks-and-types-the-future-of-akka-scaladays-nyc-2018 121page It is Not interesting for Akka.net, is'n it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

vncoelho · 2019-09-24T18:48:23Z

@Aaronontheweb and colleagues,

First of all, thanks for all your efforts and studies. I am not sure if this thread is the best one to ask for that, but here it goes.

Currently, on the NEO Blockchain, we are using Akka actors to initialize TCP connection and managing payloads with Mailbox: internal abstract class PriorityMailbox : MailboxType, IProducesMessageQueue<PriorityMessageQueue>.

Recently, some of our colleagues mentioned by Akka P2P was slow summed with with On.Tell event being single thread.

What are your recommendations?
Recently, some members of our team started a discussion about initializing several TCP akka channels instead of using the mailbox for priorities.

Best regards,

Aaronontheweb · 2020-02-25T22:03:38Z

closing this out - made major improvements in 1.4.0, but ultimately #4007, which we are starting work on after the 1.4.0 release, is what will get us the increase in wire throughput that we would really like to see.

rogeralsing mentioned this issue Feb 13, 2017

Remote performance - 2.9 mil msg/sec asynkron/protoactor-dotnet#50

Closed

alexvaluyskiy added this to the 1.3.0 milestone Apr 20, 2017

alexvaluyskiy modified the milestones: 1.4.0, 1.3.0 Jun 27, 2017

Aaronontheweb mentioned this issue Jun 29, 2018

[WIP] Akka.Serializer performance fixes #3532

Open

Aaronontheweb mentioned this issue Oct 28, 2019

Port Artery.TCP to Akka.Remote #4007

Open

Aaronontheweb mentioned this issue Dec 18, 2019

Akka.Remote: improved write performance with DotNetty flush-batching #4106

Merged

Aaronontheweb mentioned this issue Feb 25, 2020

disable buffer pooling in DotNetty transport #4252

Merged

Aaronontheweb closed this as completed Feb 25, 2020

This was referenced Feb 27, 2020

why use akka.net for p2p?it is slow. neo-project/neo#611

Closed

Update Akka to up-to-date version neo-project/neo#1455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akka.Remote is really slow #2378

Akka.Remote is really slow #2378

florisch commented Nov 15, 2016 •

edited

Loading

rogeralsing commented Nov 15, 2016 •

edited

Loading

florisch commented Nov 15, 2016

slorion commented Feb 3, 2017

Aaronontheweb commented Feb 3, 2017

slorion commented Feb 5, 2017

Aaronontheweb commented Feb 6, 2017

Neftedollar commented Apr 13, 2017

aliakhtar commented Dec 8, 2017 •

edited

Loading

beachwalker commented Jun 4, 2018

Aaronontheweb commented Jun 4, 2018 •

edited

Loading

Aaronontheweb commented Jun 4, 2018

Aaronontheweb commented Jun 4, 2018

voltcode commented Jun 4, 2018 •

edited

Loading

Aaronontheweb commented Jun 4, 2018 •

edited

Loading

seungyongshim commented Sep 10, 2019 •

edited

Loading

Aaronontheweb commented Sep 10, 2019 via email

vncoelho commented Sep 24, 2019

Aaronontheweb commented Feb 25, 2020

Akka.Remote is really slow #2378

Akka.Remote is really slow #2378

Comments

florisch commented Nov 15, 2016 • edited Loading

rogeralsing commented Nov 15, 2016 • edited Loading

florisch commented Nov 15, 2016

slorion commented Feb 3, 2017

Aaronontheweb commented Feb 3, 2017

slorion commented Feb 5, 2017

Aaronontheweb commented Feb 6, 2017

Neftedollar commented Apr 13, 2017

aliakhtar commented Dec 8, 2017 • edited Loading

beachwalker commented Jun 4, 2018

Aaronontheweb commented Jun 4, 2018 • edited Loading

Aaronontheweb commented Jun 4, 2018

Aaronontheweb commented Jun 4, 2018

voltcode commented Jun 4, 2018 • edited Loading

Aaronontheweb commented Jun 4, 2018 • edited Loading

Akka.Remote Flow and Staging

Transports and Event Models

Serialization and Buffer Use

seungyongshim commented Sep 10, 2019 • edited Loading

Aaronontheweb commented Sep 10, 2019 via email

vncoelho commented Sep 24, 2019

Aaronontheweb commented Feb 25, 2020

florisch commented Nov 15, 2016 •

edited

Loading

rogeralsing commented Nov 15, 2016 •

edited

Loading

aliakhtar commented Dec 8, 2017 •

edited

Loading

Aaronontheweb commented Jun 4, 2018 •

edited

Loading

voltcode commented Jun 4, 2018 •

edited

Loading

Aaronontheweb commented Jun 4, 2018 •

edited

Loading

seungyongshim commented Sep 10, 2019 •

edited

Loading