New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network performance declines when I run a high req/s HTTP benchmark between two instances on the same dedicated host #148
Comments
Hi @talawahtech, We're looking into your issue and will update you in the next few days. Thanks, |
Meanwhile please contact me via akiyano@amazon.com so we can better understand what you are seeing. |
Hi @akiyano, Thanks for the quick response, here are some additional details that should help. I forgot to mention that at boot time I disable irqbalance and configure a 1 to 1 mapping between queues/IRQs/CPUs by updating So queue 0 -> IRQ 27 -> CPU 0 and queue 1 -> IRQ 28 -> CPU 1 etc. Here are some stats taken when running the benchmark on a standard c5n.xlarge vs a c5n.9xlarge modified to have 4 vCPU using CPU options. It should help illustrate the issue. In both cases the same benchmark is run after a reboot and the specified commands are run to collect stats. As you can see the baseline case shows even distribution of packets/interrupts across queues/IRQs/CPUs while the second set of results shows much greater inconsistency and a much higher number context switches. Let me know if there is any other information I can provide. c5n.xlarge (baseline)Command:
Command:
Command:
c5n.9xlarge (CPU options: 2 cores, 2 threads per core)Command:
Command:
Command:
|
Adding some more details. The benchmark in question is a hyper-optimized version of the Techempower JSON benchmark using wrk on the client, and libreactor on the server. The app, OS and networking configuration have been carefully tuned (to an extreme degree) to achieve maximum performance. Here are the c5.4xlarge client and c5.xlarge server in cluster placement groupCommand:
Command:
c5.9xlarge client (with 16 vCPUs) and c5.9xlarge server (with 4 vCPUs) on the same Dedicated HostCommand:
Command:
I am working on a blog post and Cloudformation template for the benchmark which should make it possible for you to reproduce the benchmark exactly if necessary. |
I would also like to point out that tests where the server is on the dedicated host, but the client isn't (or vice versa), perform as expected. Average req/s is usually in the 1.1M - 1.15M range, which is normal since the instances are not in the same cluster placement group in that scenario, and therefore the latency between them is higher. c5.4xlarge client not on dedicated host and c5.9xlarge server (with 4 vCPUs) on dedicated hostCommand:
Command:
So it really seems that it is only when both instances are on the same host that things don't work as expected. |
Hi @talawahtech, Thanks for the extra information. Thanks, |
Hey @akiyano, I know I had sent you a preliminary template by email, but it looks like I forgotten to send the link to the final post. Just remembered this issue while updating another one. Here's the link in case it is of any further use: https://talawah.io/blog/extreme-http-performance-tuning-one-point-two-million/ |
Hi @talawahtech, That is some awesome post! Thanks for the update, |
Thanks!
Ok cool. Let me know if you have any feedback, or if is there is any other information that I can provide when you do. |
Background
I am testing a high performance, low latency benchmark between a c5n.4xlarge client and a c5n.xlarge server. For most of my testing I just put them both in a cluster group, and that was good enough, but for the final round of tests I wanted to guarantee that the latency between hosts doesn't change much (even across stop/starts), and I wanted to avoid any potential noisy neighbor issues. To achieve this I decided to launch both instances on the same dedicated host.
For some reason the console prevents me from launching a c5n.4xlarge and c5n.xlarge on the same dedicated host. In order to work around the issue I decided to just go ahead and launch a c5n.9xlarge for the server, but then use the EC2 CPU options configuration to restrict it to only use 4 vCPUs (2 cores, 2 threads per core). This seemed to work fine at a first glance, only 4 vCPUs were reported by the OS, only 4 network queues were reported by ethtool, and stress-ng tests showed performance to be consistent with a standalone c5n.xlarge (or just a little faster).
Issue
When I run my benchmark between the two instances on the same dedicated host, overall performance is 20% lower than expected, and p99 latency is almost twice as much as it should be.
On further investigation I realized that in this configuration (9xlarge pretending to be xlarge), hardware interrupts were no longer being distributed evenly across the 4 CPUs. Even though the OS is reporting 4 vCPUs and 4 network queues, the actual interrupts were not being distributed evenly. Additionally the distribution of the interrupts was changing while the benchmark was running even though a consistent set of connections is used for the entire benchmark.At first I thought the issue was related to the fact that I was using EC2 CPU Options to restrict the number of CPUs, but further investigation confirms this is not the case. Instead the fact that I am running the benchmark between two instances on the same dedicated host seems to be the source of the decline in performance.
The text was updated successfully, but these errors were encountered: