Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Full scan least request LB alg: would it provide better performance in some cases? #4481
Full scan least request LB algorithm
I've been using Envoy as a proxy/load balancer in front of an API with the following characteristics:
Because the requests are CPU intensive, it's good to evenly distribute the load across the servers, because if we're processing many requests in parallel on one machine, the performance degrades, with the CPU being the bottleneck.
Earlier I was using a simple Kubernetes load balancer (with an underlying Google Cloud load balancer), which only provided a round robin algorithm, so I switched to Envoy to be able to use Least Request, to better distribute the load across the servers.
Introducing Envoy and switching from
So I decided to test this. I implemented a dummy API (I called it PrimeCalcApi) which has similar characteristics that I described above. One request is fully using 1 CPU core, and its response times vary approximately between 100ms and 5 seconds. (It's doing some fake CPU-intensive work, calculating primes. The source is here: https://github.com/markvincze/PrimeCalcApi/tree/master/src/PrimeCalcApi.WebService)
Then I implented a very crude and naive Full Scan Least Request LB algorithm in Envoy (here is the PR in my fork: markvincze#1).
This is the full architecture.
Then I performed a load test in which I was continuously sending requests to this cluster through the Envoy LB. Initially 2 requests per second were sent, and I increased the rate of requests per second by 1, every 5 minutes. I was running the benchmark for 40 minutes, so at the end it was sending 8 requests per second. (The code of the benchmark is here: https://github.com/markvincze/PrimeCalcApi/tree/master/src/PrimeCalcApi.LoadTest)
(The API instances all have 1 CPU core, so as the request rate increases, once they start processing more than 1 request in parallel, their response time degrades as expected.)
I repeated this benchmark with
Throughout the benchmark I collected the following metrics:
These were the results for
You can see that the
And here are two more images showing for the 3 different algorithms the distribution of the number of requests being executed, and the response time graphs separately.
My question is:
@markvincze thanks for the thorough analysis here. Sorry if I missed this, but can you comment on how many servers you have as upstreams? I'd think that P2C, like any statistical algorithm, would be better at large numbers and that a full scan would be better with a small number of upstreams.
Do you see a discrepancy if you run the above experiments with a large number of upstreams (e.g. 50)?
FWIW, I think having a full scan algorithm in certain cases like yours (long lived/expensive requests) make sense. These are cases where the probablistic P2C algorithm will not do well. I would point out that there is also an option to have an option to switch P2C to PNC (meaning allow the user to specify N for > 2 choices). This might provide most of the benefit without a full scan.
@markvincze yeah, I think either adding a PNC or full scan variant of P2C seems the way to go. The fact that the problem doesn't become better as you scale up the hosts seems surprising to me, presumably at some sufficiently high number of hosts the two would converge or we have some bug?
I'd be surprised if this was a bug. My intuition was to expect results like this, although this is a pretty specific scenario, because
What could also make my benchmark results this extreme is that the 60 pods I'm testing with are all 1-core instances. So as soon as a node is processing not 1, but 2 requests in parallel, then its response time degrades almost twofold. Which I think can quickly start a "chain reaction" and slow the whole cluster down.
I dont think in a real life API (where we rarely have 100% pure CPU intensive requests like this) we would get this extreme improvement with full scan.
I'm a bit late here, but it makes total sense that at 8 requests per second at the experiments peak, the overhead of a full scan would be trivial for 60 instances given the timescale for which each request is running.
Have you tested the results of Matt's suggestion of using more than 2 choices on your benchmark?