Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka v1.4 Idle CPU usage increased comparing v1.3 #4434

Closed
ok-ul-ch opened this issue May 25, 2020 · 22 comments · Fixed by #4438 or #4476
Closed

Akka v1.4 Idle CPU usage increased comparing v1.3 #4434

ok-ul-ch opened this issue May 25, 2020 · 22 comments · Fixed by #4438 or #4476
Assignees
Milestone

Comments

@ok-ul-ch
Copy link

Info:
.Net Core 3.1
Akka 1.3.18 and Akka 1.4.6
Platforms:

  • Docker for Windows (Alpine images). Docker version: 19.03.8
  • AWS EKS

Issue

After migration from 1.3.18 to 1.4.6 we've noticed that usage of Akka clustering added more CPU overhead to our deployments. After small investigation and comparison, we've found out that in newer versions Akka brings significantly more CPU usage in Idle state of the system.
Docker For Windows relative usage values:
akka-stats

AWS EKS absolute values (m5.large instance):
image

As you can see almost 4x in absolute values in EKS.

Repro:
Use code from repository, run via docker-compose, and check docker stats.
Repro contains three seed nodes with next code:

class Program
    {
        private static readonly string Config = @"
        {
            ""akka"": {
              ""loglevel"": ""Warning"",
              ""actor"": {
                ""provider"": ""cluster"",
              },
               ""remote.dot-netty.tcp.batching.enabled"":""false"" 
            }
          }
        ";


        static async Task Main(string[] args)
        {
            var bootstrapFromDocker = DockerBootstrap.BootstrapFromDocker(Config);

            using (var actorSystem = ActorSystem.Create(Environment.GetEnvironmentVariable("ACTOR_SYSTEM"), bootstrapFromDocker))
            {
                await actorSystem.WhenTerminated;
            }
        }
    }
@Aaronontheweb
Copy link
Member

@mralexes please see #4417

Issue was with the base Docker image, not with Akka.NET

@Aaronontheweb
Copy link
Member

If that other issue turns out not to be the case, please comment here and let us know - we will want to look into it if it's really an Akka.NET issue.

@ok-ul-ch
Copy link
Author

@Aaronontheweb, please check the repro again. For both 1.3.18 and 1.4.6, I'm using the same image

@Aaronontheweb Aaronontheweb reopened this May 26, 2020
@Aaronontheweb
Copy link
Member

Ok, will do

@Aaronontheweb
Copy link
Member

Aaronontheweb commented May 26, 2020

Ok, I can reproduce this:

NAME                                  CPU %               MEM USAGE / LIMIT     NET I/O
akkanet-versions-issue_seed_1_3_3_1   1.81%               25.86MiB / 1.945GiB   36.4kB / 34.6kB
akkanet-versions-issue_seed_1_4_2_1   8.05%               26.77MiB / 1.945GiB   35.4kB / 34.4kB
akkanet-versions-issue_seed_1_4_1_1   9.02%               28.77MiB / 1.945GiB   35.5kB / 34.5kB
akkanet-versions-issue_seed_1_3_2_1   1.42%               27.96MiB / 1.945GiB   36.2kB / 34.4kB
akkanet-versions-issue_seed_1_3_1_1   1.51%               26.62MiB / 1.945GiB   36.3kB / 35.4kB
akkanet-versions-issue_seed_1_4_3_1   8.04%               26.42MiB / 1.945GiB   35.5kB / 33.8kB

@Aaronontheweb
Copy link
Member

Fixed the solution so now clusters actually form:

NAME                                  CPU %               MEM USAGE / LIMIT     NET I/O
akkanet-versions-issue_seed_1_4_1_1   8.81%               29.26MiB / 1.945GiB   111kB / 112kB
akkanet-versions-issue_seed_1_3_2_1   2.35%               27.04MiB / 1.945GiB   116kB / 117kB
akkanet-versions-issue_seed_1_4_3_1   5.50%               28.25MiB / 1.945GiB   113kB / 113kB
akkanet-versions-issue_seed_1_3_3_1   4.05%               25.87MiB / 1.945GiB   116kB / 112kB
akkanet-versions-issue_seed_1_3_1_1   2.15%               27.21MiB / 1.945GiB   118kB / 118kB
akkanet-versions-issue_seed_1_4_2_1   7.74%               27.38MiB / 1.945GiB   115kB / 110kB

@Aaronontheweb
Copy link
Member

Haven't found anything obvious in the profiler that can explain the disparity in these numbers - so we're going to need to go line by line in the commit log and see which changes might be responsible for this:

1.3.18...1.4.6

I've modified your reproduction so I can copy local Akka.NET builds into the v1.4 branch and run performance comparisons that way: https://github.com/Aaronontheweb/Akka.Net-versions-issue

@Aaronontheweb
Copy link
Member

Also made this https://github.com/Aaronontheweb/AkkaIdleCpuStudy - to try to measure and profile the idle CPU utilization between v1.3.18 and v1.4.6 head to head, but I wasn't able to get any definitive results while working on that yet.

@Zetanova
Copy link
Contributor

I tested with akka 1.3 and older versions of dotnet core 3.1
All had the same issue, i concluded that it was the rebuild of the container/linux base.
But of course i can be mistaken.

This hear makes 1/3 of the CPU:
https://github.com/Azure/DotNetty/blob/47f5ec7303037f1360615d182939e04d8619a2b3/src/DotNetty.Common/Concurrency/SingleThreadEventExecutor.cs#L488-L517

Maybe the deadline is negative or very short or the emptyEvent getting signaled very frequently
with an NoOp-Task. In my test the nodes didnt make much traffic on the wire.

@Zetanova
Copy link
Contributor

if(_writer.CanSchedule)
_context.Executor.Schedule(this, _interval); // reschedule

This will most likely make the SpinWait inside DotNetty.

I think that a looped SpinWait+EventHandle in an interval of 40ms
will stay hot all the time. i will make a small test on it.

I tried to disable the buffer feature already,
but maybe just set it wrong in the config,
will try again.

My test Config was:

akka {
	remote {
		dot-netty.tcp.batching.enabled = false
	}
}

@Zetanova
Copy link
Contributor

I tried now disable and enable batching without success.

And enabled log-transport, it started to log after the first node connection
CPU went much higher with only ~10 lines / s

I don't have currently any more clues to test.

akka {
	remote {
		dot-netty.tcp {
			log-transport = true
			batching {
				enabled = true
				flush-interval = 800ms
			}
		}
	}
}

@Aaronontheweb
Copy link
Member

Aaronontheweb commented May 28, 2020 via email

@Zetanova
Copy link
Contributor

@Aaronontheweb Is the behavior the same under win?

@ok-ul-ch
Copy link
Author

Additional observation. Idle CPU usage is increasing over time.

Screenshot_1

The screenshot contains CPU usage.
On the left side of the graph is one of our services on Akka 1.4.6, on the right the same service (same code, same base image), but after rollback to 1.3.18.
We've decided to stay at 1.3.18 version, because we're running pods under resource limitations, and after 1-2 days of run CPU usage reaches the limit (you can see that on the very start of the graph, just straight line) and causing cluster association problems from time to time.

NOTE Did not check it on lower 1.4.X versions

@Zetanova
Copy link
Contributor

@mralexes Are u sure thats the same base image?
Did u pull and build both containers new?

I made the same and had in both versions of akka 1.3 and 1.4 high idle usage.
But maybe i made a misstake in my test last week.

@ok-ul-ch
Copy link
Author

ok-ul-ch commented Jun 2, 2020

@mralexes yup, I'm sure. Same image. CI pipeline always pulls layers (no caching enabled) so I'm sure that base image has been pulled all the time as well.

@Aaronontheweb
Copy link
Member

dotnet/coreclr#27990 - might be related

@Arkatufus
Copy link
Contributor

Arkatufus commented Jun 16, 2020

I might have found the cause, it is related to dotnet/coreclr#26806
The CPU usage problem was caused by the 3.1.100 SDK used to compile the Akka.NET 1.4.6 nuget package. It seemed to disappear when I recompiled the 1.4.6 source code under 3.1.105.
I'm making a fix to the build script right now.

@Aaronontheweb
Copy link
Member

So I can validate that #4475 didn't resolve the issue - I can still recreate the problem running with the latest dev branch sources.

CONTAINER ID        NAME                                  CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
6dc9ae3c0af6        akkanet-versions-issue_seed_1_3_1_1   1.95%               27.66MiB / 9.737GiB   0.28%               96.3kB / 93.8kB     0B / 0B             24
b34b3036503b        akkanet-versions-issue_seed_1_4_3_1   6.21%               27.57MiB / 9.737GiB   0.28%               89kB / 88.8kB       0B / 0B             21
3fbd0ff570ca        akkanet-versions-issue_seed_1_4_2_1   8.88%               28.02MiB / 9.737GiB   0.28%               92.4kB / 90.6kB     0B / 0B             24
a3e340ca002e        akkanet-versions-issue_seed_1_3_2_1   2.58%               26.23MiB / 9.737GiB   0.26%               96.1kB / 92.9kB     0B / 0B             23
ff3d199efa38        akkanet-versions-issue_seed_1_4_1_1   9.57%               27.68MiB / 9.737GiB   0.28%               91.9kB / 90.5kB     0B / 0B             22
051a267ac331        akkanet-versions-issue_seed_1_3_3_1   1.89%               30.29MiB / 9.737GiB   0.30%               91.1kB / 93kB       0B / 0B             23

@Aaronontheweb
Copy link
Member

With the changes I added here:

cfd1f26

CPU graph looks pretty normal again:

CONTAINER ID        NAME                                  CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
b3c02902a764        akkanet-versions-issue_seed_1_4_2_1   1.78%               24.93MiB / 9.737GiB   0.25%               351kB / 352kB       0B / 0B             22
c9bca09437ba        akkanet-versions-issue_seed_1_3_1_1   2.15%               25.45MiB / 9.737GiB   0.26%               366kB / 367kB       0B / 0B             23
ddc4b1f036ed        akkanet-versions-issue_seed_1_3_2_1   2.02%               25.48MiB / 9.737GiB   0.26%               364kB / 366kB       0B / 0B             23
b54f1603ffb4        akkanet-versions-issue_seed_1_3_3_1   1.90%               24.48MiB / 9.737GiB   0.25%               371kB / 365kB       0B / 0B             24
e7faf28516be        akkanet-versions-issue_seed_1_4_3_1   1.88%               25.48MiB / 9.737GiB   0.26%               355kB / 354kB       0B / 0B             24
270a6ef6cdba        akkanet-versions-issue_seed_1_4_1_1   1.89%               26.04MiB / 9.737GiB   0.26%               363kB / 357kB       0B / 0B             23

Looks like the issue is the way the BatchWriter stage is scheduling itself. DotNetty might be doing something funky under the hood. I'll investigate.

@Aaronontheweb
Copy link
Member

I suspect this might have something to do with it Azure/DotNetty#522

@Aaronontheweb
Copy link
Member

Well, it's simpler than that - by separating flush and write into two separate calls in order to support batching, this effectively doubles the number of operations in the DotNetty executor pipeline when batching is disabled.

I think I can work around this by changing the way the TcpHandler works when batching is disabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants