Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network performance declines when I run a high req/s HTTP benchmark between two instances on the same dedicated host #148

Closed
talawahtech opened this issue Nov 15, 2020 · 9 comments

Comments

@talawahtech
Copy link

talawahtech commented Nov 15, 2020

Background

I am testing a high performance, low latency benchmark between a c5n.4xlarge client and a c5n.xlarge server. For most of my testing I just put them both in a cluster group, and that was good enough, but for the final round of tests I wanted to guarantee that the latency between hosts doesn't change much (even across stop/starts), and I wanted to avoid any potential noisy neighbor issues. To achieve this I decided to launch both instances on the same dedicated host.

For some reason the console prevents me from launching a c5n.4xlarge and c5n.xlarge on the same dedicated host. In order to work around the issue I decided to just go ahead and launch a c5n.9xlarge for the server, but then use the EC2 CPU options configuration to restrict it to only use 4 vCPUs (2 cores, 2 threads per core). This seemed to work fine at a first glance, only 4 vCPUs were reported by the OS, only 4 network queues were reported by ethtool, and stress-ng tests showed performance to be consistent with a standalone c5n.xlarge (or just a little faster).

Issue

When I run my benchmark between the two instances on the same dedicated host, overall performance is 20% lower than expected, and p99 latency is almost twice as much as it should be.

On further investigation I realized that in this configuration (9xlarge pretending to be xlarge), hardware interrupts were no longer being distributed evenly across the 4 CPUs. Even though the OS is reporting 4 vCPUs and 4 network queues, the actual interrupts were not being distributed evenly. Additionally the distribution of the interrupts was changing while the benchmark was running even though a consistent set of connections is used for the entire benchmark.

At first I thought the issue was related to the fact that I was using EC2 CPU Options to restrict the number of CPUs, but further investigation confirms this is not the case. Instead the fact that I am running the benchmark between two instances on the same dedicated host seems to be the source of the decline in performance.

@akiyano
Copy link
Contributor

akiyano commented Nov 15, 2020

Hi @talawahtech,

We're looking into your issue and will update you in the next few days.

Thanks,
Arthur

@akiyano
Copy link
Contributor

akiyano commented Nov 15, 2020

Meanwhile please contact me via akiyano@amazon.com so we can better understand what you are seeing.
Thanks,
Arthur

@talawahtech
Copy link
Author

talawahtech commented Nov 15, 2020

Hi @akiyano,

Thanks for the quick response, here are some additional details that should help. I forgot to mention that at boot time I disable irqbalance and configure a 1 to 1 mapping between queues/IRQs/CPUs by updating /proc/irq/IRQ#/smp_affinity_list.

So queue 0 -> IRQ 27 -> CPU 0 and queue 1 -> IRQ 28 -> CPU 1 etc.

Here are some stats taken when running the benchmark on a standard c5n.xlarge vs a c5n.9xlarge modified to have 4 vCPU using CPU options. It should help illustrate the issue. In both cases the same benchmark is run after a reboot and the specified commands are run to collect stats.

As you can see the baseline case shows even distribution of packets/interrupts across queues/IRQs/CPUs while the second set of results shows much greater inconsistency and a much higher number context switches. Let me know if there is any other information I can provide.

c5n.xlarge (baseline)

Command: sudo ethtool -S eth0 | grep rx_cnt

queue_0_rx_cnt: 2984468
queue_1_rx_cnt: 2997752
queue_2_rx_cnt: 2991696
queue_3_rx_cnt: 2989499

Command: dstat -i -I 27,28,29,30 -y

-------interrupts------ ---system--
  27    28    29    30 | int   csw
3805  3836  3827  3824 |  16k 1274
3809  3832  3832  3826 |  16k 1047
3813  3833  3829  3821 |  17k 1121
3809  3838  3829  3823 |  17k 1172
3804  3835  3829  3822 |  16k 1132
3804  3832  3827  3824 |  16k 1132
3811  3838  3826  3825 |  16k 1229

Command: cat /proc/interrupts

           CPU0       CPU1       CPU2       CPU3
 27:      40459          0          0          0   PCI-MSI 81921-edge      eth0-Tx-Rx-0
 28:        100      39293          0          0   PCI-MSI 81922-edge      eth0-Tx-Rx-1
 29:         54          0      39657          0   PCI-MSI 81923-edge      eth0-Tx-Rx-2
 30:        115          0          0      40876   PCI-MSI 81924-edge      eth0-Tx-Rx-3

c5n.9xlarge (CPU options: 2 cores, 2 threads per core)

Command: sudo ethtool -S eth0 | grep rx_cnt

queue_0_rx_cnt: 2083909
queue_1_rx_cnt: 2548775
queue_2_rx_cnt: 2801675
queue_3_rx_cnt: 2424103

Command: dstat -i -I 27,28,29,30 -y

-------interrupts------ ---system--
  27    28    29    30 | int   csw
4002  3619  3603  3619 |  16k   11k
3865  3638  3599  3618 |  16k   12k
4569  3636  3614  3616 |  17k   11k
4461  3639  3596  3626 |  16k   11k
3612  3629  3600  3746 |  16k   12k
4482  3621  3595  6465 |  19k   11k
3630  3623  3593  5874 |  18k   11k

Command: cat /proc/interrupts

           CPU0       CPU1       CPU2       CPU3
 27:      43028          0          0          0   PCI-MSI 81921-edge      eth0-Tx-Rx-0
 28:         67      37577          0          0   PCI-MSI 81922-edge      eth0-Tx-Rx-1
 29:         54          0      36974          0   PCI-MSI 81923-edge      eth0-Tx-Rx-2
 30:         75          0          0      43426   PCI-MSI 81924-edge      eth0-Tx-Rx-3

@talawahtech talawahtech changed the title RSS not distributing hardware interrupts evenly when using CPU options to reduce core count Network performance declines when I run a high req/s HTTP benchmark between two instances on the same dedicated host Nov 18, 2020
@talawahtech
Copy link
Author

talawahtech commented Nov 18, 2020

Adding some more details. The benchmark in question is a hyper-optimized version of the Techempower JSON benchmark using wrk on the client, and libreactor on the server. The app, OS and networking configuration have been carefully tuned (to an extreme degree) to achieve maximum performance. Here are the ping -U times showing the latency between the instances, and the benchmark results as reported by wrk for comparison:

c5.4xlarge client and c5.xlarge server in cluster placement group

Command: sudo ping -i 0 -w 10 -s 25 -q -U server.tfb

235130 packets transmitted, 235129 received, 0% packet loss, time 9999ms
rtt min/avg/max/mdev = 0.037/0.042/0.264/0.006 ms, ipg/ewma 0.042/0.042 ms

Command: wrk --latency "http://server.tfb:8080/json" -d 10 -c 256 -t 16

Running 10s test @ http://server.tfb:8080/json
  16 threads and 256 connections
  Thread Stats   Avg     Stdev       Max       Min   +/- Stdev
    Latency   206.30us   25.03us  732.00us   71.00us   68.53%
    Req/Sec    75.34k     0.87k    78.19k    73.33k    65.97%
  Latency Distribution
     50%  205.00us
     75%  222.00us
     90%  239.00us
     99%  270.00us
  11994466 requests in 10.00s, 1.73GB read
Requests/sec: 1199438.20
Transfer/sec:    177.30MB

c5.9xlarge client (with 16 vCPUs) and c5.9xlarge server (with 4 vCPUs) on the same Dedicated Host

Command: sudo ping -i 0 -w 10 -s 25 -q -U server.tfb

287867 packets transmitted, 287866 received, 0% packet loss, time 9999ms
rtt min/avg/max/mdev = 0.029/0.034/0.136/0.006 ms, ipg/ewma 0.034/0.035 ms

Command: wrk --latency "http://server.tfb:8080/json" -d 10 -c 256 -t 16

Running 10s test @ http://server.tfb-ue2:8080/json
  16 threads and 256 connections
  Thread Stats   Avg     Stdev       Max       Min   +/- Stdev
    Latency   249.81us   85.23us    3.23ms   64.00us   78.25%
    Req/Sec    63.13k     6.00k    78.28k    51.06k    60.61%
  Latency Distribution
     50%  234.00us
     75%  283.00us
     90%  357.00us
     99%  498.00us
  10053017 requests in 10.00s, 1.45GB read
Requests/sec: 1005295.47
Transfer/sec:    148.60MB

I am working on a blog post and Cloudformation template for the benchmark which should make it possible for you to reproduce the benchmark exactly if necessary.

@talawahtech
Copy link
Author

talawahtech commented Nov 18, 2020

I would also like to point out that tests where the server is on the dedicated host, but the client isn't (or vice versa), perform as expected. Average req/s is usually in the 1.1M - 1.15M range, which is normal since the instances are not in the same cluster placement group in that scenario, and therefore the latency between them is higher.

c5.4xlarge client not on dedicated host and c5.9xlarge server (with 4 vCPUs) on dedicated host

Command: sudo ping -i 0 -w 10 -s 25 -q -U server.tfb

125445 packets transmitted, 125445 received, 0% packet loss, time 9999ms
rtt min/avg/max/mdev = 0.063/0.079/0.733/0.021 ms, ipg/ewma 0.079/0.079 ms

Command: wrk --latency "http://server.tfb:8080/json" -d 10 -c 256 -t 16

Running 10s test @ http://10.112.181.154:8080/json
  16 threads and 256 connections
  Thread Stats   Avg     Stdev       Max       Min   +/- Stdev
    Latency   284.36us    2.82ms  209.84ms   89.00us   99.92%
    Req/Sec    69.56k     0.87k    72.09k    64.45k    73.67%
  Latency Distribution
     50%  219.00us
     75%  237.00us
     90%  258.00us
     99%  376.00us
  11075403 requests in 10.00s, 1.60GB read
Requests/sec: 1107533.10
Transfer/sec:    163.71MB

So it really seems that it is only when both instances are on the same host that things don't work as expected.

@akiyano
Copy link
Contributor

akiyano commented Nov 19, 2020

Hi @talawahtech,

Thanks for the extra information.
We are trying to reproduce the issue ourselves.
Please update us once you have the blog post + cloudformation template, which will help to make sure we are hitting your exact issue.

Thanks,
Arthur

@talawahtech
Copy link
Author

Hey @akiyano,

I know I had sent you a preliminary template by email, but it looks like I forgotten to send the link to the final post. Just remembered this issue while updating another one. Here's the link in case it is of any further use: https://talawah.io/blog/extreme-http-performance-tuning-one-point-two-million/

@akiyano
Copy link
Contributor

akiyano commented Jul 29, 2021

Hi @talawahtech,

That is some awesome post!
Will look into it once we'll get to handling this one.

Thanks for the update,
Arthur

@talawahtech
Copy link
Author

That is some awesome post!

Thanks!

Will look into it once we'll get to handling this one.

Ok cool. Let me know if you have any feedback, or if is there is any other information that I can provide when you do.

@davidarinzon davidarinzon closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants