Full scan least request LB alg: would it provide better performance in some cases? #4481

markvincze · 2018-09-20T14:55:09Z

Full scan least request LB algorithm

I've been using Envoy as a proxy/load balancer in front of an API with the following characteristics:

The requests the API serve are CPU-intensive, and one request more or less utilizes one CPU core completely.
The duration of the requests vary between 100ms - 5 seconds (or even a bit more in some edge cases)

Because the requests are CPU intensive, it's good to evenly distribute the load across the servers, because if we're processing many requests in parallel on one machine, the performance degrades, with the CPU being the bottleneck.

Earlier I was using a simple Kubernetes load balancer (with an underlying Google Cloud load balancer), which only provided a round robin algorithm, so I switched to Envoy to be able to use Least Request, to better distribute the load across the servers.

Introducing Envoy and switching from ROUND_ROBIN to LEAST_REQUEST indeed improved the perf (decreased the response times and made the CPU-load smoother across the machines).
However, I still saw that the load was not close to even on the instances, often 1 machine was processing 0 requests, while another was doing 5.
I saw that the Least Request implemented by Envoy provides a P2C (power of two choices) implementation, and due to what I saw I started to doubt that the P2C algorithm is achieving the same uniform distribution as a full scan would.

So I decided to test this. I implemented a dummy API (I called it PrimeCalcApi) which has similar characteristics that I described above. One request is fully using 1 CPU core, and its response times vary approximately between 100ms and 5 seconds. (It's doing some fake CPU-intensive work, calculating primes. The source is here: https://github.com/markvincze/PrimeCalcApi/tree/master/src/PrimeCalcApi.WebService)
I deployed this to our Kubernetes cluster, with 20 instances.

Then I implented a very crude and naive Full Scan Least Request LB algorithm in Envoy (here is the PR in my fork: markvincze#1).
I deployed my fork of Envoy in front of this api cluster, with 1 single instance. (I was using STRICT_DNS service discovery with a Kubernetes headless service.)

This is the full architecture.

                                                                                              +--------------+
                                                                                              |              |
                                                                                     +------> | PrimeCalcApi |
                                                                                     |        |              |
                                                                                     |        +--------------+
                                                                                     |
+-----------------------+     +------------------+     +-----------------------+     |        +--------------+
|                       |     |                  |     |                       |     |        |              |
| My machine            |     | Google Cloud LB  |     | Envoy                 |     |        | PrimeCalcApi |
| running the benchmark +---> | Public IP        +---> | (on a Kubernetes pod) +------------> |              |
|                       |     |                  |     |                       |     |        +--------------+
+-----------------------+     |                  |     |                       |     |
                              +------------------+     +-----------------------+     |              ...
                                                                                     |
                                                           1 single instance         |        +--------------+
                                                                                     |        |              |
                                                                                     +------> | PrimeCalcApi |
                                                                                              |              |
                                                                                              +--------------+

                                                                                               20 instances

Then I performed a load test in which I was continuously sending requests to this cluster through the Envoy LB. Initially 2 requests per second were sent, and I increased the rate of requests per second by 1, every 5 minutes. I was running the benchmark for 40 minutes, so at the end it was sending 8 requests per second. (The code of the benchmark is here: https://github.com/markvincze/PrimeCalcApi/tree/master/src/PrimeCalcApi.LoadTest)

(The API instances all have 1 CPU core, so as the request rate increases, once they start processing more than 1 request in parallel, their response time degrades as expected.)

I repeated this benchmark with ROUND_ROBIN, then LEAST_REQUEST, then (this is implemented in my fork) LEAST_REQUEST_FULL.

Throughout the benchmark I collected the following metrics:

Number of requests being executed ("requests in flight"), broken down by servers
Number of requests being executed, average per server
Incoming request rate (this was increasing every 5 minutes)
Error rate (normally there weren't any, but when things started to slow down, this started to show some timeouts)
Response time percentiles (0.50, 0.90 and 0.99) recorded on the servers

These were the results for ROUND_ROBIN:

Results for LEAST_REQUEST:

Results for LEAST_REQUEST_FULL:

You can see that the LEAST_REQUEST was significantly better than ROUND_ROBIN, but the LEAST_REQUEST_FULL brought a very large improvement on top of that in terms of the even distribution of the load, and thus the response times.

And here are two more images showing for the 3 different algorithms the distribution of the number of requests being executed, and the response time graphs separately.

My question is:

Are these results correct, and does a full scan least request algorithm give this much better performance in situations like this, or did I make some mistake in my benchmark?
If it's really this good, can we implement it in Envoy?

The text was updated successfully, but these errors were encountered:

htuch · 2018-09-20T20:07:39Z

@markvincze thanks for the thorough analysis here. Sorry if I missed this, but can you comment on how many servers you have as upstreams? I'd think that P2C, like any statistical algorithm, would be better at large numbers and that a full scan would be better with a small number of upstreams.

Do you see a discrepancy if you run the above experiments with a large number of upstreams (e.g. 50)?

markvincze · 2018-09-20T20:17:14Z

Hi @htuch,

This benchmark was done with 20 instances.
I'll try to repeat it with more instances, let's say 60, and will post the results.

mattklein123 · 2018-09-20T20:19:40Z

FWIW, I think having a full scan algorithm in certain cases like yours (long lived/expensive requests) make sense. These are cases where the probablistic P2C algorithm will not do well. I would point out that there is also an option to have an option to switch P2C to PNC (meaning allow the user to specify N for > 2 choices). This might provide most of the benefit without a full scan.

markvincze · 2018-09-21T13:39:28Z

@htuch,

I did the same benchmark with 60 API pods instead of 20. (I increased the benchmark request rate also accordingly.)
I see the same difference between the 3 algorithms:

ROUND_ROBIN:

LEAST_REQUEST:

LEAST_REQUEST_FULL:

htuch · 2018-09-21T16:46:51Z

@markvincze yeah, I think either adding a PNC or full scan variant of P2C seems the way to go. The fact that the problem doesn't become better as you scale up the hosts seems surprising to me, presumably at some sufficiently high number of hosts the two would converge or we have some bug?

ggreenway · 2018-09-21T16:56:10Z

I think you need a large number of requests per host to have it even out, so a smaller number of hosts would probably show more even results.

markvincze · 2018-09-24T10:59:33Z

@htuch,

I'd be surprised if this was a bug. My intuition was to expect results like this, although this is a pretty specific scenario, because

Requests are 100% CPU-intensive
Long request duration, with high variance

What could also make my benchmark results this extreme is that the 60 pods I'm testing with are all 1-core instances. So as soon as a node is processing not 1, but 2 requests in parallel, then its response time degrades almost twofold. Which I think can quickly start a "chain reaction" and slow the whole cluster down.
And the full scan algorithm is the only one which keeps—as long as possible—all the nodes processing at most 1 request at the same time.

I dont think in a real life API (where we rarely have 100% pure CPU intensive requests like this) we would get this extreme improvement with full scan.
I might try to repeat the benchmark one more time with a cluster of machines with more cores than 1 to see if that decreases the difference between the different LB algorithms.

htuch · 2018-09-24T22:49:49Z

Yes, the fact that your backends are basically only capable of effectively handling a single request at full capacity and are CPU bound would lend itself to being more problematic. It seems totally reasonable to do full-scan as an alternative LB for least-loaded.

tonya11en · 2018-10-02T06:33:37Z

I'm a bit late here, but it makes total sense that at 8 requests per second at the experiments peak, the overhead of a full scan would be trivial for 60 instances given the timescale for which each request is running.

Have you tested the results of Matt's suggestion of using more than 2 choices on your benchmark?

stale · 2018-11-01T07:22:42Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale · 2018-11-08T07:28:25Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

htuch added the question Questions that are neither investigations, bugs, nor enhancements label Sep 20, 2018

htuch self-assigned this Sep 20, 2018

tonya11en mentioned this issue Oct 4, 2018

load balancer: Option for power of N choices #4593

Merged

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 1, 2018

stale bot closed this as completed Nov 8, 2018

htuch mentioned this issue Feb 21, 2019

envoy with poor performance when dealing with load balance gRPC traffic #5700

Closed

tonya11en mentioned this issue Dec 4, 2023

Start Full Scan from a random index for Least Request LB. #31146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full scan least request LB alg: would it provide better performance in some cases? #4481

Full scan least request LB alg: would it provide better performance in some cases? #4481

markvincze commented Sep 20, 2018 •

edited

Loading

htuch commented Sep 20, 2018

markvincze commented Sep 20, 2018

mattklein123 commented Sep 20, 2018

markvincze commented Sep 21, 2018 •

edited

Loading

htuch commented Sep 21, 2018

ggreenway commented Sep 21, 2018

markvincze commented Sep 24, 2018

htuch commented Sep 24, 2018

tonya11en commented Oct 2, 2018 •

edited

Loading

stale bot commented Nov 1, 2018

stale bot commented Nov 8, 2018

Full scan least request LB alg: would it provide better performance in some cases? #4481

Full scan least request LB alg: would it provide better performance in some cases? #4481

Comments

markvincze commented Sep 20, 2018 • edited Loading

Full scan least request LB algorithm

htuch commented Sep 20, 2018

markvincze commented Sep 20, 2018

mattklein123 commented Sep 20, 2018

markvincze commented Sep 21, 2018 • edited Loading

htuch commented Sep 21, 2018

ggreenway commented Sep 21, 2018

markvincze commented Sep 24, 2018

htuch commented Sep 24, 2018

tonya11en commented Oct 2, 2018 • edited Loading

stale bot commented Nov 1, 2018

stale bot commented Nov 8, 2018

markvincze commented Sep 20, 2018 •

edited

Loading

markvincze commented Sep 21, 2018 •

edited

Loading

tonya11en commented Oct 2, 2018 •

edited

Loading