Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use send/recv instead of write/read when the reactor_stream is a socket #5

Closed
talawahtech opened this issue May 23, 2020 · 10 comments
Closed

Comments

@talawahtech
Copy link
Contributor

Hi,

I have been benchmarking and profiling libreactor to try and find any potential optimizations for what is already a very, very fast framework.

In my tests using the Techempower JSON benchmark running on an AWS EC2 instance, I was able to achieve a performance improvement of a little over 10% by using the send/recv functions (with the flags param set to 0) in place of write/read.

From the attached flamegraphs (see below) of the syscalls made during the test, you can see that sys_read/sys_write call several intermediate functions before finally calling inet_recvmsg and sock_sendmsg. So even though the behavior is functionally identical, there is a performance gain to be had that shows up tests like this.

I would like to create a PR for libreactor where reactor_stream_flush and reactor_stream_input conditionally use send and recv if the steam in question is known to be a socket. However I wanted to find out what your preferred approach would be.

One approach could be to add a new member to the reactor_stream struct called is_socket. That value could be initialized when reactor_stream_open is called, either by passing and additional parameter or checking it automatically using fstat/S_ISSOCK. I just wanted to get you feedback before I created a PR.

FYI I also tried using sendto and sendmsg. The performance of sendto was the same and sendmsg was actually worse.

write/read
image

send/recv
image

@fredrikwidlund
Copy link
Owner

Very interesting, and slightly surprising. There has been some time since I tweaked this framework, specially for the Techempower benchmark which is too flawed to work with, but I will try to reproduce this improvement. Appreciate the feedback and contribution!

@talawahtech
Copy link
Contributor Author

Yea, I was surprised as well. I only decided to try it after profiling another framework and noticing the difference in the syscalls used.

The Techempower tests results are definitely skewed by the fact that wrk is the bottleneck in their setup. When running my tests I make sure that the client is more than twice the size of the server. I am working on some blog posts about running high performance HTTP servers on AWS infrastructure.

Thanks for the quick response! Please let me know if I can provide any other info to help reproduce the results. I have a couple other performance optimizations that I am testing so I may open a few more issues this weekend.

@fredrikwidlund
Copy link
Owner

You can try this out: https://github.com/fredrikwidlund/libreactor_httpd

@talawahtech
Copy link
Contributor Author

@fredrikwidlund ok cool, I started testing it today, I should have some feedback by tomorrow.

@talawahtech
Copy link
Contributor Author

talawahtech commented Jun 11, 2020

@fredrikwidlund Looks good 👍 I tested it out and once I made a few modifications (ubuntu/glibc, fork/sched_setaffinity, bpf) the performance was in line with my test version with send/recv and no pthread usage.

A few things I noticed:

  1. Building using ubuntu/glibc instead of alpine/musl yields a runtime performance increase of around 5% in my tests on AWS.
  2. Turning on BPF using the code from your branch improves performance by another 2% or so. Any plans to merge it? I would definitely leave it off by default as it can degrade performance if your network driver's RSS algorithm isn't evenly distributing packets. But it is definitely something I would like to use.
  3. -fomit-frame-pointer and -static (edit) DON'T have an impact on performance in my tests.
  4. -march=native impacts performance positively when it is used on all the libraries and the user code, but negatively when it is only used on the libraries, but not the user code. I'm wondering if maybe that should be something that power user turn on using CFLAGS at build time, rather than a default.

Overall I like that more low level functions are accessible in 2.0. Even though I expect more of the higher level constructs for http and server to be built, it is nice to be able to access http_date and http_request_read if you want to do something more low level.

Speaking of http_date, I like the new function signature that allows it to be updated independently. I noticed it wasn't being updated in your example but I assume that is just because you haven't added timers to version 2 yet.

Hope this feedback is helpful.

@fredrikwidlund
Copy link
Owner

fredrikwidlund commented Jun 11, 2020

Much appreciated! It's challenging to respond to specifik benchmark results without going into detail about the specific setup. What parameters to wrk do you use? What kind of variance do you see? Have you tried using multiple wrk instances on multiple individual nodes since wrk is a bottleneck? It does typically add an interesting perspective.

Building using ubuntu/glibc instead of alpine/musl yields a runtime performance increase of around 5% in my tests on AWS.

Makes sense, Alpine is more likely to be optimized for size than speed. Also compiler version is a variable, and Alpine i still on v9 of gcc I believe which could factor in.

Turning on BPF using the code from your branch improves performance by another 2% or so. Any plans to merge it? I would definitely leave it off by default as it can degrade performance if your network driver's RSS algorithm isn't evenly distributing packets. But it is definitely something I would like to use.

Different perspectives of locality is definitely a huge factor and used in the right context BPF could potentially have a major impact on performance. It is a larger puzzle to lay though and BPF is only one piece of it. In the Techempower benchmark actually tuning the candidates for the specific hardware/virtualization/os/load configurations is actually the what in the ends makes up for most of the final performance gains in the say top 10 candidates, but this tuning is completely meaningless in a general context outside the benchmark. This could be discussed in depth but as the benchmark is fundamentally flawed in many aspects I personally feel this is a waste of time and energy.

-fomit-frame-pointer and -static have an impact performance in my tests.

I assume you mean a negative impact? Theoretically it would be interesting to understand how omitting the frame pointer could have a negative impact. Again this would depend on the test but still interesting.

-march=native impacts performance positively when it is used on all the libraries and the user code, but negatively when it is only used on the libraries, but not the user code. I'm wondering if maybe that should be something that power user turn on using CFLAGS at build time, rather than a default.

So enabling hardware optimizations actually have a huge impact on http parsing, in that it improves it by a factor of perhaps 4-5x. HTTP parsing is also what typically consumes the most cpu cycles in a request. Again interesting though in how it can have a negative impact in your tests. I'd be happy to look into this as well if you share more about how you came to this conclusion.

Regarding v2 in general it is another work in progress and as often, with me at least, design decisions change with time. Right now I'm moving core logic from libreactor to libdynamic for different reasons making the reactor layer thinner. I'll see exactly where I end up with this.

@fredrikwidlund
Copy link
Owner

Perhaps we can move this issue to https://github.com/fredrikwidlund/libreactor_httpd since it is more about a specific benchmark context?

@talawahtech
Copy link
Contributor Author

Much appreciated! It's challenging to respond to specifik benchmark results without going into detail about the specific setup. What parameters to wrk do you use? What kind of variance do you see? Have you tried using multiple wrk instances on multiple individual nodes since wrk is a bottleneck? It does typically add an interesting perspective.

In my setup, neither wrk nor the network is a bottleneck, I use a single large node rather than multiple. The wrk instance is 4x the size of the server instance and they are running in a Cluster placement group with 0.04 ms round trip latency.

Server: m5.large with 2 vCPUs
Client: c5.2xlarge with 8 vCPUs

wrk -H 'Host: server.tfb' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 128 -t 8 "http://server.tfb:8080/json"

Makes sense, Alpine is more likely to be optimized for size than speed. Also compiler version is a variable, and Alpine i still on v9 of gcc I believe which could factor in.

I was using gcc v9 on ubuntu as well.

Different perspectives of locality is definitely a huge factor and used in the right context BPF could potentially have a major impact on performance. It is a larger puzzle to lay though and BPF is only one piece of it. In the Techempower benchmark actually tuning the candidates for the specific hardware/virtualization/os/load configurations is actually the what in the ends makes up for most of the final performance gains in the say top 10 candidates, but this tuning is completely meaningless in a general context outside the benchmark. This could be discussed in depth but as the benchmark is fundamentally flawed in many aspects I personally feel this is a waste of time and energy.

I am only using the Techempower test as an established reference point that anybody can try themselves. The point of the research I am doing is to demonstrate the levels of performance that can be achieved on AWS and share some of the details of the performance tuning done at the os/network/software level. Even then, it is still a very specific micro benchmark, but I think some people may find the information interesting/useful, and I enjoy testing the limits of what is possible.

I do hope the Techempower resolves some of the flaws in their tests before the next round, but that isn't crucial to what I am doing. Either way I would love to see BPF included in libreactor 2.0, but off by default.

-fomit-frame-pointer and -static have an impact performance in my tests.

I assume you mean a negative impact? Theoretically it would be interesting to understand how omitting the frame pointer could have a negative impact. Again this would depend on the test but still interesting.

Sorry, I meant to say that those options did not have any impact, neither positive nor negative.

-march=native impacts performance positively when it is used on all the libraries and the user code, but negatively when it is only used on the libraries, but not the user code. I'm wondering if maybe that should be something that power user turn on using CFLAGS at build time, rather than a default.

So enabling hardware optimizations actually have a huge impact on http parsing, in that it improves it by a factor of perhaps 4-5x. HTTP parsing is also what typically consumes the most cpu cycles in a request. Again interesting though in how it can have a negative impact in your tests. I'd be happy to look into this as well if you share more about how you came to this conclusion.

My concern with having -march=native on by default for libdynamic and libreactor is that:

  1. If a user is building a container using a CI service like CircleCI or GitHub Actions the "native" build platform may not match the deployment platform.
  2. In my tests, if -march=native is on by default for libdynamic 2.0 and libreactor 2.0 but then I forget to use that flag when I build my server own code, I notice a drop in performance that is worse than when -march=native is removed everywhere. I tested it by removing it from this line and comparing the performance.

So I was wondering if maybe it should be a documented option for speeding up performance in libdynamic 2.0 and libreactor 2.0 rather than the default in their make files.

@talawahtech
Copy link
Contributor Author

Perhaps we can move this issue to https://github.com/fredrikwidlund/libreactor_httpd since it is more about a specific benchmark context?

Well I think this issue has been addressed by the inclusion of send/recv in the libreactor 2.0 branch. Unless you were also planning on adding it to 1.x

@fredrikwidlund
Copy link
Owner

Indeed, I meant the general discussion rather.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants