Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate arm64 robustness performance #17595

Merged
merged 2 commits into from
Mar 24, 2024

Conversation

jmhbnz
Copy link
Member

@jmhbnz jmhbnz commented Mar 18, 2024

In October last year we switched from self managed arm64 CI infrastructure from Equinix Metal to managed arm64 runners provided via the CNCF and actuated.dev.

Since then we have completed some right sizing for memory requirements for all our arm64 workflows, however we are still hitting some teething issues with CPU performance. Notably for robustness testing.

Performance issues were tracked in:

One mitigation to improve CPU performance was to disable lazyfs for arm64 which was completed in #17323.

Even with lazyfs disabled we are still seeing failures, recent examples are:

  1. https://github.com/etcd-io/etcd/actions/runs/8324216324/job/22775349104
2024-03-18T10:49:54.6347546Z     logger.go:130: 2024-03-18T10:44:54.551Z	INFO	Validating linearizable operations	{"timeout": "5m0s"}
2024-03-18T10:49:54.6353444Z     logger.go:130: 2024-03-18T10:49:54.605Z	ERROR	Linearization has timed out
  1. https://github.com/etcd-io/etcd/actions/runs/8262348764/job/22601494073
2024-03-13T10:47:07.8149258Z     logger.go:130: 2024-03-13T10:47:07.791Z	INFO	Average traffic	{"qps": 98.21620104737428}
2024-03-13T10:47:07.8150704Z     traffic.go:105: Requiring minimal 100.000000 qps for test results to be reliable, got 98.216201 qps

The above failures indicate CPU and/or disk IOPS performance bottlnecks, so this pull request increases robustness cpu cores from 8 to 12 and also enables vmmeter so we can more closely introspect the performance of the arm64 runners versus standard github amd64 runners.

cc @serathius, @alexellis

Try to prevent the failures we are seeing regularly for not meeting qps requirements.

Signed-off-by: James Blair <mail@jamesblair.net>
Signed-off-by: James Blair <mail@jamesblair.net>
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Thanks

@alexellis
Copy link
Contributor

Keep us informed via Slack please. We did extensive work for you guys in a thread over there and got no response.

One thing in particular was adding extra machines in when you have your PR storm every Monday and Tuesday. It'd be helpful if dependabot jobs were more balanced throughout the week rather than a thundering herd.

That said are you just two points off the 100 point target at the moment?

@jmhbnz
Copy link
Member Author

jmhbnz commented Mar 20, 2024

Keep us informed via Slack please. We did extensive work for you guys in a thread over there and got no response.

Thanks Alex - Thread created here https://self-actuated.slack.com/archives/C043BB2NCUW/p1710927214000339.

One thing in particular was adding extra machines in when you have your PR storm every Monday and Tuesday. It'd be helpful if dependabot jobs were more balanced throughout the week rather than a thundering herd.

The Monday PR storm will be going away once dependabot/dependabot-core#7547 beta dependabot feature becomes available which is expected to be shortly.

That said are you just two points off the 100 point target at the moment?

For that specific robustness run yes, we were very close to the required performance level.

@ahrtr ahrtr merged commit 671dabc into etcd-io:main Mar 24, 2024
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants