-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issues with Envoy #5536
Comments
I just ran a quick test and got similar results. I ran a CPU profiler on Envoy during the test, and it highlighted part of the reason for the performance difference: wrk uses HTTP/1.1, and Envoy uses a relatively CPU-intensive HTTP/1 parser. HTTP/1 parser performance is a longstanding issue; #2880 has some context. |
What haproxy version? |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
@mattklein123 what do you think about tagging this one as “help wanted”? There definitely is performance work that would be worthwhile to do. I probably won’t have much time to contribute PRs over the next couple of months, but I’m happy to share profiling data and optimization ideas with anybody who’s interested in working on it. |
@brian-pane sure will do. |
Why not use picohttpparser? That is the fastest http parser so far. It can parse about 5 Million requests per second |
For anybody who's interested in doing some optimization work, here's a CPU profile of the latest Envoy trunk proxying small HTTP/1.1 requests and responses on Linux 4.15:
|
I ran into some similar performance issues with envoy recently using First, I tested envoy using only 1 client, 1000 request-per-second:
which is pretty comparable with native performance. But then I increased the number of clients to 100, but kept the request-per-second the same, i.e., each of the 100 clients would be doing 10 qps, and here is the result:
It's a pretty significant difference in terms of average latency and long tail latency. I used @mattklein123 @brian-pane any inkling where I should be poking around here? |
Sorry nothing immediately comes to mind. One of the first things I would probably look at is whether Fortio is itself introducing this latency. If that's not the case would need some more detailed investigation. |
So without envoy in the middle, fortio running as regular linux processes between 2 of my VMs looks like:
With a freshly compiled envoy sitting in front of the fortio server, the same experiment looks like this
There's a pretty significant increase in both average and long-tail latency after putting Envoy in there. I started Envoy by
where envoy.json looks like:
This is almost like a hello-world example of using envoy, so I'm a bit surprised that latency number went up by 3x. I just started looking at envoy, so if there's anything obviously wrong with my setup, please let me know. |
@oschaaf any thoughts on this thread? Could we use Nighthawk? |
FWIW, here is result for the same setup but switching out envoy for haproxy (using the haproxy config in the original post)
|
@htuch interesting, thanks for looping me in. I'll try to set up a repro to and analyse. It will be nice chance to give Nighthawk a spin |
I ran an experiment on a low-latency tuned system for comparing average latencies accross [1] https://gist.github.com/oschaaf/a04a1d4a9fa3b571ad80ba2449761616#file-mean-summary-md |
Can you check how many tcp connections are being opened from the proxy to upstream in all 3 cases? |
@oschaaf Looks like your fortio result using envoy and haproxy is pretty consistent with the ones I've posted earlier. So it's great this problem can be reproduced. Looking forward to seeing something that can be fixed to bring envoy's latency number down. |
So after thinking about it, my 2p is that I think sanity checking the numbers is warranted. In the tests I ran, the mean latencies are reported by the tools as follows:
Wildly different numbers are reported. Fortio and wrk2 report numbers pretty that seem pretty far away from the truth in terms of abolute latency [1]. Why could that be? We test @ 1000 rps. Given ideal request-release timings, there ought to be no overlap between request/response pairs. That's nice because queuing/batching effects could complicate our reasoning, and skew our measurements. Problem is both fortio and wrk2 rely on sleep for release timings, which isn't going to be super accurate. It is worth noting that wrk2 will sleep at a 1 ms resolution, exactly on par with our test-frequency. Also, the cpu hosting the client probably is far from saturated. This allows power state management to lower frequency, and other processes/threads may get scheduled accross. All these things can add significant noise to the experiment (except for Nighthawk, which performs a spin/yield loop when idle to avoid the cpu frequency scaling). So given even and precise pacing of request-release timeings, it looks like HAProxy is able But no, I don't feel we can confidently conclude Envoy is adding multiple ms to the mean in that scenario. Furthermore, I do think it is worth adding an option to intentionally release batched sets of requests to Nighthawk, so we can make a differential diagnosis between:
[1] Lastly, I think it's worth sharing the earlier analysis of wrk and Fortio we did. |
This is super interesting. Thanks for the detailed analysis comparing different latency measurement tools. It seems wrk and fortio generate different absolute numbers, but their results are on the same order of magnitude, i.e. a few ms. Nighthawk on the other hand produces numbers that are an order of magnitude smaller. In my environment, between the 2 VMs, ping shows about 0.5ms round trip latency, so when I get fortio result of 2ms average latency, it made some sense to me. And then putting haproxy and envoy between the 2 VMs, the round trip number further increased to 4 to 7ms. It's not ideal, but still made sense as we are adding software on the critical path. What is the ping time in your testbed setup? Nighthawk showing 0.09ms average latency is kind of bizarrely small. On my relatively idle VM, ping on the loopback device shows round trip time of 0.05ms. Are your VMs provisioned on the same host machine and the network packets don't leave the host nic? I might have have missed your testbed setup if this was mentioned somewhere. |
OK, never mind the question about the testbed setup. You're indeed using loopback device in your experiment (https://gist.github.com/oschaaf/a04a1d4a9fa3b571ad80ba2449761616#file-mean-summary-md) |
I ran the test on a z620 workstation, directly on the hardware, with the client and server running on two physically different HW cpu's, using taskset to pin them to a core. Ping times:
|
@oschaaf In fortio, when |
@huang195 yeah, above I ended with
As fortio will have to rely on the go scheduler for fairness, and because of garbage collection stalls, I think it's going to be (very) hard to do reliable and precise request-release-timing. |
@oschaaf Sorry, I must have missed that comment. I ended up doing more experiments using fortio by adding some jitters between their requests, so we don't get these mini-burst of requests. The latency result does look much better, while keeping everything else the same. Related to this topic, in the context of Istio, it adds additional filters to Envoy, e.g., to report telemetry and tracing data. Most of these additional filters are executed off the critical path, so in theory they shouldn't have a big impact on end-to-end latency. But they do. The reason is every Envoy worker thread is single-threaded and running an event loop. While the thread is in the middle of preparing a batch report, serializing and compressing it before putting it onto the wire, one or more requests could have arrived. The low-priority work of preparing batch report does not get interrupted by higher priority work of handling requests, thus could cause increase in latency. First of all, is this a correct understanding of how things work in an Envoy worker thread? If so, is there any work in the pipeline that would address such issue? |
@huang195 meantime, I've drafted an approach for |
@huang195 yeah, basically, that's the right model. It depends if the work is being done on the main thread or worker thread. But, doing a dispatcher post on worker thread A back to A's dispatcher for later processing, is just kicking the can down the road and will clearly add latency to any performance test. I know @mandarjog is an expert on Istio performance issues, he can probably chime in as well. |
For those who are interested in performance of Envoy and Istio's Envoy, I've made a flamechart: https://github.com/iandyh/istio-enovy-perf/blob/master/1.1.1.svg |
@iandyh that's really useful. Looks like ~30% of CPU cycles are being burnt on tracing/Mixer filter. This seems to match up roughly to the difference between Envoy and HAProxy based on the initial Nighthawk latency measurement. |
@iandyh thanks for sharing that. I saw something very similar to that as well - a significant amount of CPU being spent in telemetry and tracing, and http parsing. There are already a few issues opened about the performance problems of the To address latency, while the current work-around is to add more Envoy worker threads to reduce the effect of queuing and user-perceived latency, more worker threads also mean more resource usage. I'm wondering if there's a middle ground that could reduce latency by processing queued requests as quickly as possible while pushing telemetry/tracing and other non-essential work to when all the critical-path work are done. |
To set expectations, any perf gains there are going to be minimal. There is no low hanging fruit. Also, HTTP/1 is becoming increasingly irrelevant so I'm not sure how much it matters to most real world workloads.
Istio has well known performance problems that they are working to address with Mixer v2. I would recommend following up with them as this is really outside the scope of the core Envoy project.
This is possible, but as @htuch already mentioned, it's unclear to me that the extra complexity is going to yield substantial benefit since this is really just "kicking the can down the road." Fundamentally, the used features just need to use less CPU cycles. If you feel like prototyping some changes please feel free, but I think the performance gains have to be substantial to offset the increased complexity. |
@mattklein123 thanks for your response. Regarding to filters, to someone who's new to Envoy, it might not be immediately obvious that a piece of filter code that runs off the critical path could have a significant impact to requests on the critical path under certain conditions. |
We could certainly consider threads or events with priorities, but none of that would help if you're measuring performance at close to serving capacity, since there just won't be enough CPU to go around and you either indefinitely delay the stats or drop them or pay the latency cost somewhere. When operating with plenty of spare CPU capacity, I agree, this approach would yield some benefit for filters that do low priority deferred actions, but I think Mixer is one of the few we've seen in the wild that do this. |
@htuch yep, agree. What I've seen sometimes is benchmarks themselves being the cause of excessive queuing delays. This was described earlier with If mixer filter could be optimized, that would alleviate the problem. But the fact that the core eventing systems allows lower priority work to continue while there is higher priority work to do is an area that could be improved. Of course this only makes sense when there is still spare CPU capacity. |
My investigation has shown that the performance further degrades if access_log is turned on. I see about 40% degrade with wrk benchmark. Anyone else noticing significant degrades with access_log turned on? |
@codervinod is this gRPC access log or filesystem? I would expect any remote access logging where you are sending a RPC 1:1 to have this kind of impact for small requests. |
@huang195 regarding kicking the van down the road, if there is always some work waiting to be picked up, moving path length to the log filter still adds to latency. Have you done any experiments with the grpc accesslog service? Mixer client is an access log adapter that is batched but send absolutely everything with every request. I think ultimately some balance of compression, batching and selectiveness will win out. |
@mandarjog I haven't done any experiments with the grpc accesslog service yet. From what I can gather, there are some implicit constraints about running an Envoy filter due to i) the number of worker threads is fixed, ii) once a worker thread "accepts" a new connection, it cannot hand it off to another thread even if it is overloaded and the other is idling, iii) each worker is single-threaded. So to minimize performance variance, an Envoy filter should not have vastly different code execution paths. In mixer client today, the default behavior is 99 out 100 requests will go through the same code execution path, but the 100th request will go to a much longer path, i.e., protobuf serialization, compression, network I/O. This is one cause of the long-tail latency problem. But variance can be caused by other things, such as load, multitenancy, randomness in CPU scheduling, etc. The mental diagram I have about Envoy filter is that it has a critical path and a non-critical path. Policy/Quota checking and request processing/forwarding are on the critical path. Telemetry/tracing/and other stuff that happens afterwards are on the non-critical path. Due to Envoy's single-threadedness, non-critical path could sometimes impact critical path. Code on the critical path needs to run regardless. Code on the non-critical path, on the other hand, should have control points that in the most extreme case, could be short circuited completely, freeing the worker thread to work on something more pressing. Envoy internally keeps a lot of runtime stats, which could be used as triggers to these control points. |
In the scenario where latency is due to 1:1 RPC hogging CPU time, and assuming that the RPC server is on the same machine, this might be worth considering: instead of using protobuf/gRPC, use a zero-copy, (de)serialization-free message format (like Cap'n Proto) and pass messages via file descriptor over Unix domain socket (the memory for the message(s) would be allocated via |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions. |
same problem ,up to 60% cpu in stress test, is there any new solution |
Benchmarking Envoy and comparing performance against HAProxy
Setup:
LB : Envoy/HAProxy
Backend : Nginx
Benchmarking Tool : wrk (https://github.com/wg/wrk)
Envoy Config:
Concurrency = 4
HAProxy Config:
Nginx Config
Benchmark Results
Envoy
$ wrk -c100 -d60s -t10 "http://172.16.x.x:8090/" --latency
HAProxy
$ wrk -c100 -d60s -t10 "http://172.16.x.x:8080/" --latency
Note:
Could you please point out where am I going wrong? (As according to various online blogs, envoy seems to provide equivalent performance as HAProxy)
The text was updated successfully, but these errors were encountered: