Use send/recv instead of write/read when the reactor_stream is a socket #5
I have been benchmarking and profiling libreactor to try and find any potential optimizations for what is already a very, very fast framework.
In my tests using the Techempower JSON benchmark running on an AWS EC2 instance, I was able to achieve a performance improvement of a little over 10% by using the
From the attached flamegraphs (see below) of the syscalls made during the test, you can see that sys_read/sys_write call several intermediate functions before finally calling
I would like to create a PR for libreactor where
One approach could be to add a new member to the
FYI I also tried using
The text was updated successfully, but these errors were encountered:
Very interesting, and slightly surprising. There has been some time since I tweaked this framework, specially for the Techempower benchmark which is too flawed to work with, but I will try to reproduce this improvement. Appreciate the feedback and contribution!
Yea, I was surprised as well. I only decided to try it after profiling another framework and noticing the difference in the syscalls used.
The Techempower tests results are definitely skewed by the fact that wrk is the bottleneck in their setup. When running my tests I make sure that the client is more than twice the size of the server. I am working on some blog posts about running high performance HTTP servers on AWS infrastructure.
Thanks for the quick response! Please let me know if I can provide any other info to help reproduce the results. I have a couple other performance optimizations that I am testing so I may open a few more issues this weekend.
@fredrikwidlund Looks good 👍 I tested it out and once I made a few modifications (ubuntu/glibc, fork/sched_setaffinity, bpf) the performance was in line with my test version with send/recv and no pthread usage.
A few things I noticed:
Overall I like that more low level functions are accessible in 2.0. Even though I expect more of the higher level constructs for
Hope this feedback is helpful.
Much appreciated! It's challenging to respond to specifik benchmark results without going into detail about the specific setup. What parameters to wrk do you use? What kind of variance do you see? Have you tried using multiple wrk instances on multiple individual nodes since wrk is a bottleneck? It does typically add an interesting perspective.
Makes sense, Alpine is more likely to be optimized for size than speed. Also compiler version is a variable, and Alpine i still on v9 of gcc I believe which could factor in.
Different perspectives of locality is definitely a huge factor and used in the right context BPF could potentially have a major impact on performance. It is a larger puzzle to lay though and BPF is only one piece of it. In the Techempower benchmark actually tuning the candidates for the specific hardware/virtualization/os/load configurations is actually the what in the ends makes up for most of the final performance gains in the say top 10 candidates, but this tuning is completely meaningless in a general context outside the benchmark. This could be discussed in depth but as the benchmark is fundamentally flawed in many aspects I personally feel this is a waste of time and energy.
I assume you mean a negative impact? Theoretically it would be interesting to understand how omitting the frame pointer could have a negative impact. Again this would depend on the test but still interesting.
So enabling hardware optimizations actually have a huge impact on http parsing, in that it improves it by a factor of perhaps 4-5x. HTTP parsing is also what typically consumes the most cpu cycles in a request. Again interesting though in how it can have a negative impact in your tests. I'd be happy to look into this as well if you share more about how you came to this conclusion.
Regarding v2 in general it is another work in progress and as often, with me at least, design decisions change with time. Right now I'm moving core logic from libreactor to libdynamic for different reasons making the reactor layer thinner. I'll see exactly where I end up with this.
In my setup, neither wrk nor the network is a bottleneck, I use a single large node rather than multiple. The wrk instance is 4x the size of the server instance and they are running in a Cluster placement group with 0.04 ms round trip latency.
Server: m5.large with 2 vCPUs
I was using gcc v9 on ubuntu as well.
I am only using the Techempower test as an established reference point that anybody can try themselves. The point of the research I am doing is to demonstrate the levels of performance that can be achieved on AWS and share some of the details of the performance tuning done at the os/network/software level. Even then, it is still a very specific micro benchmark, but I think some people may find the information interesting/useful, and I enjoy testing the limits of what is possible.
I do hope the Techempower resolves some of the flaws in their tests before the next round, but that isn't crucial to what I am doing. Either way I would love to see BPF included in libreactor 2.0, but off by default.
Sorry, I meant to say that those options did not have any impact, neither positive nor negative.
My concern with having
So I was wondering if maybe it should be a documented option for speeding up performance in libdynamic 2.0 and libreactor 2.0 rather than the default in their make files.
Well I think this issue has been addressed by the inclusion of send/recv in the libreactor 2.0 branch. Unless you were also planning on adding it to 1.x