Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high system cpu usage on c6gn instances with high throughput loads #195

Closed
romange opened this issue Dec 3, 2021 · 11 comments
Closed

high system cpu usage on c6gn instances with high throughput loads #195

romange opened this issue Dec 3, 2021 · 11 comments

Comments

@romange
Copy link

romange commented Dec 3, 2021

when running memcached loadtest on c6gn.16xlarge I noticed that soft irq takes lots of CPU. As far as I remember it has not been before and looks like degradation in hypervisor, maybe?

I checked it with ubuntu 21.04 and 21.10, with both native ENA driver that comes with the distibution and with 2.6. It's always the same thing.

image

To reproduce (using 2 c6gn.16xlarge):

  1. /usr/bin/memcached -t 32 -m 640 -p 11211 -u memcache -l 0.0.0.0 -c 10240 on server side.
  2. memtier_benchmark -s <private_ip> -p 11211 --ratio 0:1 -t 32 -c 50 -n 2000000 -P memcache_text on the load test instance.
@romange
Copy link
Author

romange commented Dec 3, 2021

image

@akiyano
Copy link
Contributor

akiyano commented Dec 5, 2021

Hi @romange,

Thanks for reporting this.
A few questions:

  1. You show 2 different results in your 2 comments - one with 13.96% in __do_softirq the other with 28.58% - what is the difference between the 2 setups? (instance type? instance size? ami? OS distribution and version? driver version? preinstalled driver vs driver taken from github?)

  2. You say "As far as I remember". Is there a chance that your memory is with a different type of instance (not c6gn - which is a fairly new instance type)?

  3. The preinstalled driver in ubuntu has adaptive interrupt coalescing off by default, and the github 2.6.0g driver has it on. This setting should make at least some difference, as it should change the number of interrupts you are getting (thus change the overhead of handling them). Can you please try turning adaptive interrupt coalescing on and off to see if it makes a difference? Assuming the network device is ens5, the command to see if it is on:
    sudo ethtool -c ens5
    And the command to turn it on/off is:
    sudo ethtool -C ens5 adaptive-rx on/off

Thanks,
Arthur

@romange
Copy link
Author

romange commented Dec 5, 2021

Hi Arthur,

thanks for responding so quickly. Replying to each question.

  1. I have done now the exact reproduction with ubuntu 21.04, linux 5.11.0-1022-aws. c6gn.8xlarge - no custom driver, no custom ethtool settings. See htop below on a server side.
    Client side - another instance with exactly the same configuration - running
    memtier_benchmark -s <private_ip> -p 11211 --ratio 0:1 -t 32 -c 50 -n 2000000 -P memcache_text.

image

Attaching perf top output from the server:
image

> ethtool -c ens5 
Coalesce parameters for ens5:
Adaptive RX: off  TX: n/a
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: 0
rx-frames: n/a
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: 0
  1. I feel uncomfortable expanding about this in the open channel since it's related to my work in AWS up until September this year ( I was an AWS employee up until recently). If you want to hear more details pls send me an email to romange at gmail ... In any case, I am pretty sure the state of things was much better with c6gn instances in August 2021.
  2. Setting adaptive-rx to on does not change a thing. And it's expected with Memcached benchmark. Memcached benchmark puts lots of stress on the system interrupts since each client sends a ping (short message) and waits for a pong back. The high throughput is created by using 32 threads * 50 connections in each. Each ping creates an interrupt but it's not been followed by any other packet (since the client waits for a pong) there is nothing to coalesce on a server-side. It's different from iperf which sends large chunks of data in one direction.

@talawahtech
Copy link

talawahtech commented Dec 14, 2021

@romange it is possible that you are running into this issue: #159. One way to confirm would be to run the same test on a c6gn vs a c6g (or a c5n) and compare the output of dstat --cpu -y -i -I 27,28,29,30 --net-packets to see if interrupt coalescing is active.

With high-throughput benchmarks, interrupt moderation can have a big impact, even for request/response workloads, because the coalescing happens across multiple connections. It prevents all incoming packets from triggering an interrupt for x microseconds and then handles a group of them all at once. It led to a 14% performance improvement in my high-throughput HTTP benchmark: https://talawah.io/blog/extreme-http-performance-tuning-one-point-two-million/#interrupt-moderation.

@romange
Copy link
Author

romange commented Dec 15, 2021

I will check. Thanks Mark!

@romange
Copy link
Author

romange commented Dec 30, 2021

Following @talawahtech suggestion, I run the test again today in us-east-2.
To my surprise, the CPU usage looked excellent. And perf top looked healthy

Screenshot from 2021-12-30 12-42-46
@akiyano did you guys fix something or it's a random thing?
@talawahtech attaching dstat output for the protocol but I am not familiar with dstat, unfortunately.

Screenshot from 2021-12-30 12-47-15

@akiyano
Copy link
Contributor

akiyano commented Dec 30, 2021

@romange
No deliberate change was made yet to fix this issue.
Can you please explain what you did differently this time compared to the original run, which suddenly gave you good CPU usage?

@romange
Copy link
Author

romange commented Dec 30, 2021

nothing. same image, same vm. the only thing I did differently is choosing us-east-2.
I do not remember where I run before, but not in Ohio. could be us-east1 or Oregon. I suggest that you try reproducing results in different regions in US. It takes 10 mins to do a run: you just need 2 c6gn.8xlarge instances in the same zone and run the commands above. Unfortunately, memtier_benchmark can not be installed in ubuntu so you need to build it from source once and then copy each time you start a VM.

@talawahtech
Copy link

@romange based on that image it looks like you are only doing around 1800 request/packets per second. At that request rate you won't see much softirq activity.

Also, the irq numbers in dstat command that I gave you are wrong for the c6gn. I believe the irq numbers for the individual network queues are in the 40+ range for the c6gn vs the 20+ range for the c5n. You can run cat /proc/interrupts | grep eth0 to confirm and use the numbers from that output in the dstat command to see the per queue interrupt data.

@romange
Copy link
Author

romange commented Dec 30, 2021 via email

@romange
Copy link
Author

romange commented May 26, 2022

It seems that the problem has been fixed

@romange romange closed this as completed May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants