Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Remote is really slow #2378

Closed
florisch opened this issue Nov 15, 2016 · 18 comments
Closed

Akka.Remote is really slow #2378

florisch opened this issue Nov 15, 2016 · 18 comments
Milestone

Comments

@florisch
Copy link

florisch commented Nov 15, 2016

Hi,

The company I work at is working on two new projects involving Akka.Net. We designed our application around Akka clustering capabilities, now we see big performance issues with Akka cluster/remote application.

In our applications we have many (more than 20) nodes. Each node sending more than 30K msg/sec to a central server. We have a simulation mode where we generate all the messages from the same server/process in wich case everythings works fine. As soon as we generate the messages using Akka.Remote our performances issues appears (and even cause a "BufferIsEmpty" inside Sockets/Helios exception on the linux arm mono nodes after some time wich just kill the process).

In order to understand the performance limitations of Akka, on a modern I7 computer, we ran simple pingpong benchmark in two scenarios.

  • With one single application running receivers and senders I get a good 25 Million messages per seconds.

  • Running receiver and sender actors in two processes on a local machine communicating through Akka.Remote (127.0.0 with wire serializer). We get 54K messages per seconds with 100% of cpu usage.

Of course we see that in this condition our application will never be able to run on with current Akka.net remote performances. We understand that remote involves a lot more processing than local message passing because of serialization and network but here after a quick profiling session it appears that most of the time is spent doing some internal akka/remote stuff and not in helios/wire/protobuf.

Is this performance considered normal or is there any hope to have better performances by tweaking options or any scheduled remote improvement ?

@rogeralsing
Copy link
Contributor

rogeralsing commented Nov 15, 2016

The new Stream transport planned for 1.5 bumps this up to ~100k messages per second.

That being said, there are some design issues we have inherited from the Scala implementation.
e.g. every message have a sender, the sender is an actorref and that has to be serialized, deserialized and then resolved on the remote system.
This is one of the major bottlenecks.

So even if the receiving side don't use the Sender, it will still be there and there is nothing we can do except use some form of cache for the sender lookup.

In JVM Akka, they have recently introduced a new transport "Artery" built on the new Aeron protocol.
In this transport they now can store actorrefs by ID, thus only passing the actorref as an integer between two nodes.
Using protobuf and the new transport they are able to do ~300 messages per second.

There are no on plans right now to port this transport to .NET, it would be possible as Aeron also exist for .NET now.

So I dont really have a good answer for you. this is how it is right now.
Project Orleans also have a throughput of ~100k msg/sec, but AFAIK, without ordering.

You could probably get some improvement by having some form of batch message actors, that collect messages and then pass them as a single big message. that would get around some of the per message resolutions that currently needs to be done.

@florisch
Copy link
Author

Thanks for your answer. From my perspective this is a major deal-breaker for large scale distributed applications witch at first really look like the big selling point of Akka with clustering.

I already implemented a solution batching messages with some retry and flow control system. Using this I get better throughput but it's still far from fast enough for my application. From what I read about Akka.net, this also goes against the rule to avoid sending large messages over network with akka.net.

As for a workaround, after some quick tests, on the same I7 computer with NetMq push/pull with localhost tcp I'm able to send about 2.5 millions 100bytes msg/sec with only two cores used (50% cpu).

Currently, the best workaround I've found is to keep using Akka.Cluster for some low rate management messages and the using really fast NetMq and some in-house fast serializer for all our messages.

@slorion
Copy link

slorion commented Feb 3, 2017

I am curious to know if anything has happened regarding this issue. We are redesigning a major app serving about 5 million customers and while I would love to architect the new version around Akka.NET, reading the above is bothering me greatly. Thank you in advance for your feedback!

@Aaronontheweb
Copy link
Member

@slorion we have to different initiatives which should help here.

The first is the switch from JSON.NET to Hyperion as our default serializer; Hyperion is roughly 20-30x faster depending on the types of data involved.

The second is our switch from Helios to DotNetty, which is just about ready to be merged here: #2444 - by the time this is fully performance tuned this should result in an improvement in overall network speeds as well.

In addition to those two changes we have some other optimizations coming that should also help improve the performance. So there's still work being done on this issue, but it's happening on a few different fronts.

@slorion
Copy link

slorion commented Feb 5, 2017

@Aaronontheweb Thank you for the quick follow-up. Are the improvements you are describing what @rogeralsing meant by "new Stream transport planned for 1.5" ? Is 100k msg/s still a valid figure ?

Also, we are going to be using .NET Core, will it be fully supported ? I took a look at Hyperion issues and saw that it is not yet the case (akkadotnet/Hyperion#3). Is this something that will be part of the next release ?

@Aaronontheweb
Copy link
Member

@slorion yep, that's correct. The 100k msg/s range is still a valid target. We're going to ship the DotNetty transport first, most likely, followed by the serializer switchover and other improvements in subsequent releases.

We're working on the .NET Core support. Most of the legwork for that is done already across all of the different parts; working on getting the build server to integrate it all correctly since the tooling has been a moving target.

@Neftedollar
Copy link

@Aaronontheweb There is 1.2 version was released, and, as I understand, it brings DotNetty on board. What about the performance now?

@alexvaluyskiy alexvaluyskiy added this to the 1.3.0 milestone Apr 20, 2017
@alexvaluyskiy alexvaluyskiy modified the milestones: 1.4.0, 1.3.0 Jun 27, 2017
@aliakhtar
Copy link

aliakhtar commented Dec 8, 2017

So I'm evaluating Akka for a new project, and wondering where this stands now. What's the current expected throughput? cc @Aaronontheweb

@beachwalker
Copy link
Contributor

Still interested too.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Jun 4, 2018

No idea why I haven't commented on this sooner.

Running the latest dev branch sources on .NET Core 1.1 on my local development machine (Quadcore Intel I7 - machine is about 5.5 years old:)

ProcessorCount: 8
ClockSpeed: 0 MHZ
Actor Count: 16
Messages sent/received per client: 20000 (2e4)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 20000, 37524, 533.86
5, 100000, 49020, 2040.46
10, 200000, 53107, 3766.56
15, 300000, 52174, 5750.86
20, 400000, 53023, 7544.77
25, 500000, 52933, 9446.58
30, 600000, 52897, 11343.44

Middle value on that table is the number of messages per second. Averages about 52-53k msg / s under this configuration across a single Akka.Remote connection.

@Aaronontheweb
Copy link
Member

Here's the numbers for the same hardware, running .NET 4.6.1:

ProcessorCount: 8
ClockSpeed: 0 MHZ
Actor Count: 16
Messages sent/received per client: 20000 (2e4)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 20000, 46083, 434.37
5, 100000, 59917, 1669.46
10, 200000, 63695, 3140.42
15, 300000, 64048, 4684.32
20, 400000, 59032, 6776.21
25, 500000, 58652, 8525.28
30, 600000, 57051, 10517.46

@Aaronontheweb
Copy link
Member

Just for kicks, here's the .NET Core 2.1 numbers - same hardware as the rest:

ProcessorCount: 8
ClockSpeed: 0 MHZ
Actor Count: 16
Messages sent/received per client: 20000 (2e4)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 20000, 44248, 452.49
5, 100000, 61163, 1635.64
10, 200000, 62854, 3182.58
15, 300000, 61716, 4861.34
20, 400000, 59711, 6699.13
25, 500000, 58398, 8562.91
30, 600000, 61013, 9834.86

@voltcode
Copy link
Contributor

voltcode commented Jun 4, 2018

@Aaronontheweb

Assuming this was achieved using Wire serializer and DotNetty? If so, seems like despite planned improvements, 100k is still far away? Do you have any config tips or plans on how to achieve better throughput? What was the CPU utilization in your tests?

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Jun 4, 2018

@voltcode this is just using the defaults (no Hyperion,) but we are using DotNetty. I also think we're using the plain-old string serializer in these benchmarks too, so we don't have any JSON.NET overhead. The system spikes at about 90% CPU utilization on this machine when I run these benchmarks. On my custom-built AMD Ryzen machine at home I get about 100-110k msg / s.

But you're unlikely to achieve that Ryzen kind of performance on a cloud machine - hardware on public clouds is at least 2-3 years old on average plus you have the Hypervisor penalty, plus noisy neighbor problems and other unpredictable issues that stem from running in a shared environment.

Serialization can certainly contribute to performance issues, but with the way this benchmark is designed we've basically eliminated "choice of serializer" as a contributor to the figures you can see above. What you're seeing here is a measure of how the remoting system + the default transport performs. We, like you, are of the opinion that it is in need of improvement. It's definitely the case that some of the non-configurable parts of our serialization system are contributors to the performance here (i.e. the way we use serialization inside the built-in remoting actors.)

So that being said, I'll take a moment to expand on the avenues we're considering for improving Akka.Remote's performance - since it's the area that translates to the biggest "bang for the buck" for our users.

Akka.Remote Flow and Staging

Today, Akka.Remote tries to achieve a semblance of batching and being "reactive" to the amount of available bandwidth (processing power, buffer space, etc) in the system using an adaptive backoff system that we ported from the JVM. In part, we also rely on the transport itself to signal to us when it is ready for more data vs. when it is not, and the DotNetty transport can do this.

In my view though, there's a fundamental problem with this adaptive backoff system: it relies on the remoting actors "guessing" when the transport is going to become available again based on the adaptive backoff formula, rather than being explicitly signalled by the underlying transport that it is ready for more data again. This means there are windows of capacity we may not be utilizing at all.

On top of that, I'm not sure DotNetty's system for determining when the channel is writable vs. non-writable (outbound buffers are full) is 100% deterministic. There might be a delay between when the channel is writable and when it's not.

Lots of users have noticed that the performance of Akka.Remote starts off really strongly until the first time they see the adaptive backoff system kick in (there's a big INFO message that shows up when it occurs.) That points to the items I've called out above - that our adaptive backoff system is inefficient.

So, there's a really great technology that can help us solve this problem altogether: Akka.Streams. Rather than force the remoting actors to guess when the underlying transport might be available for more writes again, Akka.Streams can simply be pulled by the lowest part of the remoting stack when it has capacity again. This is the whole premise of Reactive Streams.

This is why the JVM Akka project has taken steps to rewrite some of the remoting actors as Akka.Streams components: they, like us, suffered from similar issues with the original Akka.Remote design. I expect we will follow suit there and reap some benefits from it.

There are other inefficient things in our flow too - like not serializing messages when we're backing off and getting them ready for writing and not trying to push multiple messages onto the socket at once if they can fit within a frame. Theoretically, the underlying transport should do the latter for us - but we have no way of really verifying that from the outside.

I personally think this is the biggest contributor to performance issues; there's bits and pieces I'm going to address further down, but I think that our workflow around the remoting actors and the remote transports is the area where we can realize the biggest improvements to Akka.Remote performance.

Transports and Event Models

I have a couple of working theories on performance stuff related to Akka.Remote and the transport system - latest one being that DotNetty's eventing model and the STEE (single thread event executor) it uses act as a bigger bottleneck than we originally thought. DotNetty has played around with their event model, their buffer system, and a lot of other pieces of infrastructure a lot over the past year or so - to a point where I have a lot of trouble following it these days. This isn't a criticism of their work - just me saying that "I don't know what's really going on under the hood there as well as I used to."

DotNetty's topline performance is higher than Akka.Remote's - 250k msg / s last I checked, but those metrics are always captured in isolation from Akka.NET and other parts of a .NET process that could compete for CPU time. When I run a detailed performance profiling of this same benchmark today, the DotNetty pollTask operation that occurs inside the STEE is the single biggest area of CPU consumption. There's some other implementation details regarding the way the STEE and EventLoopGroups in DotNetty work that I'm going to skip for the sake of brevity, but let's just say I suspect that Akka.NET and DotNetty are competing for resources in a way that has performance consequences.

So, I've been working on a proof of concept here to put this theory to the test: https://github.com/Aaronontheweb/Akka.Streams.RemoteTransport

This is a pure Akka.Streams transport that I'm still chipping away at, but it'll run on top of the Akka.Remote dispatcher and use the same threads that the remoting actors already do. In essence, we're going to let Akka.NET's dispatcher model do 100% of the work instead of half of it. On top of that, I think this might be a low-impact way to dogfood some of the flow changes described above.

Serialization and Buffer Use

This has been the most intensely debated area for performance improvements within Akka.Remote probably because it touches on other popular issues in the .NET ecosystem, such as Span and Memory<T> and because it's the most "visible" part of the remoting system due to its configurability.

Long story short: we know that our current design lends itself to lots of unnecessary buffer copying and garbage collection pressure. A big part of this is unavoidable: each serialization library wants to manage its own buffer allocation and re-use. On top of that, most serailizers operate as though they are "stateless" and do a bunch of duplicative work on each successive serialization operation.

Part of the Hyperion design that @Horusiath has been advocating is to introduce the concept of serialization "sessions" which can reduce the amount of work done on each successive per-message serialization. I.E. turn serialization into a "stateful" activity so we can re-use bits of information we've seen before on previous serialization attempts. It's not clear to us yet what kind of real-world benefits that will produce, but it's an avenue worth exploring.

From my part, since I'm coming at the remoting system from a different angle - I think redesigning the flow of Akka.Remote and making it such that the serialization components can all share from a common buffer pool and work in concert, rather than independently, will result in some large improvements. However, we're a bit dependent on third-party serializers like Google.Protobuf adding Span or Memory support in order to pull this off.

Anyway, that's my long-form braindump of what we're looking at for improving performance. Even more radical ideas, like changing the default transport from TCP to something else, aren't really being considered at this time but could be in the future.

In case it's not clear from what I've written here, we've given this a lot of thought and we're working on trying to fix it. What we're looking at are some bigger structural changes to Akka.Remote itself, and I think that's what will be required to fully address it in the future.

I can't make any promises around timeframes, but given that this is an issue that affects the bottom lines of our users - it's a high priority. We've made other performance fixes in the recent past (caching around Akka.Remote in 1.3.0 being the most recent I can think of) but the large leaps in performance are going to need to come from architectural improvements rather than performance tweaks.

@seungyongshim
Copy link

seungyongshim commented Sep 10, 2019

I heard that "Artery" is blaze faster than Akka.Remote.

Remoting
"classic"
Artery
Remoting
Artery
TCP Remoting
Protocol TCP
TLS+TCP
UDP
Aeron
TCP
TLS+TCP
Large messages Troublesome Dedicated
lanes
Dedicated
connection
Heartbeat and
System Messages
Prioritised Dedicated
lanes
Dedicated
connection
Benchmarked
throughput
70k msg/s 700k+ msg/s 600k+ msg/s

It is interesting for Akka.NET, is'n it?

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Sep 10, 2019 via email

@vncoelho
Copy link

@Aaronontheweb and colleagues,

First of all, thanks for all your efforts and studies. I am not sure if this thread is the best one to ask for that, but here it goes.

Currently, on the NEO Blockchain, we are using Akka actors to initialize TCP connection and managing payloads with Mailbox: internal abstract class PriorityMailbox : MailboxType, IProducesMessageQueue<PriorityMessageQueue>.

Recently, some of our colleagues mentioned by Akka P2P was slow summed with with On.Tell event being single thread.

What are your recommendations?
Recently, some members of our team started a discussion about initializing several TCP akka channels instead of using the mailbox for priorities.

Best regards,

@Aaronontheweb
Copy link
Member

closing this out - made major improvements in 1.4.0, but ultimately #4007, which we are starting work on after the 1.4.0 release, is what will get us the increase in wire throughput that we would really like to see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests