-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent threads from being stuck in DynamicBatcher #1915
Conversation
Ensure input batch can be safely fed with new samples at any time Remove the waiter mechanism Use a safer way to generate a thread ID
HI @cbensimon, |
Hi @miguelvr,
I fixed the max batch size to 32 Then, the lower the batch interval is set, the quicker the threads are going to be stuck forever waiting for their respective predictions. Setting the batch interval around 500ms makes the bug appear after a few minutes of API stress loading (64 concurrent workers) To reproduce the bug quickly, setting the batch interval to a lower value like 1ms is the best option I inspected the threads using pystuck and saw the threads were stuck waiting for their respective predictions |
@cbensimon thanks for your response. We will try to reproduce it and we'll get back to you! |
@cbensimon this is really awesome!! Thanks for catching this bug!! So, to re-summarize the problem at hand, we can start by looking at the following code-block: def _enqueue_request(self, **kwargs):
"""
Enqueue sample for batch inference. This is a blocking method.
"""
thread_id = td.get_ident()
self.waiter.wait()
self.samples[thread_id] = kwargs
try:
self.barrier.wait()
except td.BrokenBarrierError:
pass Let's assume the waiter is set and prediction requests are coming in:
The above situation can lead With this change in the code, a new problem arises - the max batch size is no longer enforced. So for that, I tweaked the code a bit to pick the thread IDs with the smallest IDs in case there are more samples than the max batch size. And since the number of threads is a known number, we can be sure that the queue doesn't grow to enormous sizes. Let me know what you think about this! Also added a test script (the one which I also used for testing). This isn't the final one - it's just temporary. We shall convert that to a unit test in our test suite. |
Hi @RobertLucian, the library is great, that's a pleasure to contribute ! Yes, I think that your view on the problem is right I think that the max batch size is already enforced as long as it equals num threads (which is currently mandatory) : each thread can add a sample to the queue, one at a time, therefore the sample queue cannot exceed the number of threads (for each thread the cycle is : add a sample, wait for the prediction, prediction is ready so the sample is deleted, add another sample, etc..) But it is totally ok to manually enforce this anyway Additional note : with the new way of generating "thread" ID (itertools.count), I realize it actually generates a "sample" ID, so maybe that it would be more explicit to rename |
@cbensimon let me know if the unit tests look good to you too. Thanks again for your effort!! |
When server-side batching is enabled, threads get stuck waiting for the predictions :
While the batch prediction is in progress, threads add samples to the queue (despite the waiter mechanism).
At the end of the batch prediction, samples are totally swiped out and thus, the respective threads are stuck forever waiting for their prediction
This fix prevents the samples from being totally deleted (only deletes the predicted samples)
As the waiter mechanism doesn't seem to be working, this fix also clears it (@miguelvr)
This fix also provides a safer and more explicit way to assign an ID to a thread (prevents thread ID recycling)
This bug has a huge impact on performance when server-side batching is enabled, rapidly dropping the number of available threads from
max_batch_size
to something close to zerochecklist:
andmake test
make lint
allpython-predictor-cpu and python-predictor-gpu images, restart operator, and re-deploy APIs)