Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unlucky requests can get needlessly dropped #35

Open
grzkv opened this issue Jan 16, 2019 · 11 comments
Assignees
Labels

Comments

@grzkv
Copy link
Member

@grzkv grzkv commented Jan 16, 2019

Our requests limiter works like a semaphore now. The requests are not processed in a FIFO queue but are picked randomly from an unordered pool. The requests have a timeout. This means, that some requests can get unlucky and will not be picked up for longer than needed, and will be timed out.

Example

Say, we have 10 requests to be processed, each takes 1 second, but every second we get a new request in. The number of requests to be processed remains constant i.e. 10. Requests are processed one-at-a-time.

Intuitively, the waiting time for a request should be 9 seconds.

What we have now

Requests are picked randomly. In this case, the probability that a request is picked is 0.9 at each processing cycle. The chance that a request will be in the queue for >30 seconds is 0.9^29 ~ 5%. This is much longer than needed, and chances are that the request will be timed out and dropped.

What would be nice to have

Requests go into a FIFO queue. This way, each request waits for 9 seconds, then it is processed. The waiting time is more predictable and fair.

@grzkv grzkv self-assigned this Jan 16, 2019
@grzkv

This comment has been minimized.

Copy link
Member Author

@grzkv grzkv commented Jan 17, 2019

Here's an illustration of what happens https://play.golang.org/p/V4riLAtcUAq

@grzkv grzkv added api zipper labels Jan 17, 2019
@ksurent

This comment has been minimized.

Copy link
Contributor

@ksurent ksurent commented Mar 2, 2019

If you're looking to optimise throughput in scenarios with many concurrent requests, there're some systems out there that adopted LIFO queues. Could be interesting to see if this strategy can be useful here.

@grzkv

This comment has been minimized.

Copy link
Member Author

@grzkv grzkv commented Mar 4, 2019

@ksurent Thanks for the suggestions. We will be able to compare after we do the refactoring.

I think, in this case, we need something simple. Just a classic limited-length requests queue instead of a semaphore. Or something like that.

@ksurent

This comment has been minimized.

Copy link
Contributor

@ksurent ksurent commented Mar 4, 2019

Sure, I was just suggesting a different queueing discipline (LIFO vs FIFO).

@gunnihinn

This comment has been minimized.

Copy link
Contributor

@gunnihinn gunnihinn commented Mar 5, 2019

You actually don't want a FIFO queue.

Under normal, happy conditions it doesn't matter what type of queue you have because almost everything gets served.

When you're under heavy load and need to start shedding some to survive, using a FIFO queue can cause you to time out almost all the requests sitting in your queue if you pick a few heavy requests from it in a row. If you instead pick requests randomly, or using a FILO queue, you increase the expected number of requests you can serve even under that load.

IIRC, the first Google SRE book has some discussion of this in one of the chapters on load shedding or cascading failures. Andre Vereira at Booking also knows some stuff about this if you want a second face-to-face opinion.

@gunnihinn

This comment has been minimized.

Copy link
Contributor

@gunnihinn gunnihinn commented Mar 5, 2019

I can prove this with an Etch-A-Sketch (so to speak): https://github.com/gunnihinn/queuesim

@grzkv

This comment has been minimized.

Copy link
Member Author

@grzkv grzkv commented Mar 8, 2019

@gunnihinn @ksurent

Thanks for all the insight you shared. This is a much more interesting problem than it seems at first glance. The suggestion to go with FIFO looked to be the simplest approach but probably is not the best one.

I will invest time to study the topic in depth and go through all the references (and the model) you gave.

@ericherman

This comment has been minimized.

Copy link
Contributor

@ericherman ericherman commented Mar 22, 2019

@gunnihinn I would be interested in seeing a ring-buffer added to the model.

Ring-buffers are sometimes found in soft-realtime applications like audio streams because during periods of overload, the dropped data is experienced as a "hickup" in the stream, but the content continues to (largely) be sensible. By contrast, dropping data non-contiguously is more likely to create nonsense and noise. Perhaps using a ring-buffer would be a sensible approach to shedding in this case?

(Another aspect which can be an advantage is that the buffer is allocated at the start and remains constant size for the duration of execution, however I guess that is probably less important here.)

@grzkv

This comment has been minimized.

Copy link
Member Author

@grzkv grzkv commented Mar 29, 2019

@gunnihinn

I can prove this with an Etch-A-Sketch (so to speak): gunnihinn/queuesim

This model does not include the case described in the first post. Also, it operates only on bounded queues, while our current "queue" is not bounded (I am not claiming it is a good thing though). The queue limiting happens by timeouts application, thus yielding a CoDel queue.

Correct me if I am getting something wrong.

While going through various materials, I agree that FIFO most likely is not the best solution. Currently, @ericherman's proposal of ring-buffer looks best to me.

@gunnihinn

This comment has been minimized.

Copy link
Contributor

@gunnihinn gunnihinn commented Apr 1, 2019

This model does not include the case described in the first post.

Indeed; I only wanted to point out that FIFO queues are maybe not the best option, and that analyzing the different options is a somewhat nontrivial task. It's your show, go with whatever performs best in whatever tests you setup.

Currently, @ericherman's proposal of ring-buffer looks best to me.

Well... is it better than the one you already have?

@grzkv

This comment has been minimized.

Copy link
Member Author

@grzkv grzkv commented Apr 1, 2019

@gunnihinn

Well... is it better than the one you already have?

Currently we make a goroutine for each request. These are waiting in "queue" for the semaphore. The problem is that there can be very many of goroutines in queue (up to 300k in one of our setups). This takes a lot of memory (8kB x 300k = 2.4GB, 8kB being initial goroutine stack size) and slows down GCs up to 1 second. This also creates contention between goroutines that slows everything down.

We need to move queuing from goroutines to some data structure. It is the discussion here about which one. Also, we need to make it bounded. Currently, it's not. Because of this, we sometimes get explosions in the number of goroutines.

It is clear that FIFO is not a good bounded structure (thanks for the model!). LIFO is not bounded, neither is a pool with random selection. We would need to make some bounding strategy on top (as it is described in the resources shared above). Compared to these, ring buffer seems to be the simplest solution. It is also naturally resilient.

But this is just a guess, we will see when we get to test these. Just explaining my thinking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.