Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent threads from being stuck in DynamicBatcher #1915

Merged
merged 11 commits into from
Mar 5, 2021

Conversation

cbensimon
Copy link
Contributor

@cbensimon cbensimon commented Feb 28, 2021

When server-side batching is enabled, threads get stuck waiting for the predictions :
While the batch prediction is in progress, threads add samples to the queue (despite the waiter mechanism).
At the end of the batch prediction, samples are totally swiped out and thus, the respective threads are stuck forever waiting for their prediction

This fix prevents the samples from being totally deleted (only deletes the predicted samples)
As the waiter mechanism doesn't seem to be working, this fix also clears it (@miguelvr)
This fix also provides a safer and more explicit way to assign an ID to a thread (prevents thread ID recycling)

This bug has a huge impact on performance when server-side batching is enabled, rapidly dropping the number of available threads from max_batch_size to something close to zero


checklist:

  • run make test and make lint
  • test manually (i.e. build/push all python-predictor-cpu and python-predictor-gpu images, restart operator, and re-deploy APIs)

Ensure input batch can be safely fed with new samples at any time
Remove the waiter mechanism
Use a safer way to generate a thread ID
@CLAassistant
Copy link

CLAassistant commented Feb 28, 2021

CLA assistant check
All committers have signed the CLA.

@miguelvr
Copy link
Collaborator

miguelvr commented Mar 1, 2021

HI @cbensimon,
Can you elaborate as to when the dynamic batcher hangs? We had tested it and haven't observed any hanging behavior. Is it in an edge case or all the time?

@cbensimon
Copy link
Contributor Author

Hi @miguelvr,
This happens after some amount of time.
I tested it with a dummy predictor function :

def predict(self, payload):
        time.sleep(0.2)
        return payload

I fixed the max batch size to 32

Then, the lower the batch interval is set, the quicker the threads are going to be stuck forever waiting for their respective predictions.

Setting the batch interval around 500ms makes the bug appear after a few minutes of API stress loading (64 concurrent workers)

To reproduce the bug quickly, setting the batch interval to a lower value like 1ms is the best option

I inspected the threads using pystuck and saw the threads were stuck waiting for their respective predictions

@miguelvr
Copy link
Collaborator

miguelvr commented Mar 1, 2021

@cbensimon thanks for your response. We will try to reproduce it and we'll get back to you!

@RobertLucian
Copy link
Member

RobertLucian commented Mar 3, 2021

@cbensimon this is really awesome!! Thanks for catching this bug!!

So, to re-summarize the problem at hand, we can start by looking at the following code-block:

def _enqueue_request(self, **kwargs):
        """
        Enqueue sample for batch inference. This is a blocking method.
        """
        thread_id = td.get_ident()

        self.waiter.wait()
        self.samples[thread_id] = kwargs
        try:
            self.barrier.wait()
        except td.BrokenBarrierError:
            pass

Let's assume the waiter is set and prediction requests are coming in:

  1. A bunch of requests will be appended to the self.samples list until the barrier is waited n times.
  2. There's a chance that when the last breaking wait is being waited on, there are still some that may have made it past the self.waiter.wait() instruction, got added to the self.samples list all while at the same time, the self.samples list already got passed to the self._make_batch method. If that were to happen, I then think that at the end of the batch, the samples that didn't get processed (by having been passed on to the self._make_batch method) will get removed and never get to be processed.

The above situation can lead _get_prediction to hang indefinitely on the respective threads. Do you confirm this?


With this change in the code, a new problem arises - the max batch size is no longer enforced. So for that, I tweaked the code a bit to pick the thread IDs with the smallest IDs in case there are more samples than the max batch size. And since the number of threads is a known number, we can be sure that the queue doesn't grow to enormous sizes.

Let me know what you think about this!


Also added a test script (the one which I also used for testing). This isn't the final one - it's just temporary. We shall convert that to a unit test in our test suite.

@RobertLucian RobertLucian self-requested a review March 3, 2021 20:52
@RobertLucian RobertLucian added the bug Something isn't working label Mar 3, 2021
@cbensimon
Copy link
Contributor Author

cbensimon commented Mar 4, 2021

Hi @RobertLucian, the library is great, that's a pleasure to contribute !

Yes, I think that your view on the problem is right


I think that the max batch size is already enforced as long as it equals num threads (which is currently mandatory) : each thread can add a sample to the queue, one at a time, therefore the sample queue cannot exceed the number of threads (for each thread the cycle is : add a sample, wait for the prediction, prediction is ready so the sample is deleted, add another sample, etc..)

But it is totally ok to manually enforce this anyway


Additional note : with the new way of generating "thread" ID (itertools.count), I realize it actually generates a "sample" ID, so maybe that it would be more explicit to rename thread_id* to sample_id*

@RobertLucian
Copy link
Member

  • Converted the test script to a unit test.
  • Refactored the thread_id* symbols to sample_id*.

@cbensimon let me know if the unit tests look good to you too. Thanks again for your effort!!

@RobertLucian RobertLucian merged commit 0b1b649 into cortexlabs:master Mar 5, 2021
@vishalbollu vishalbollu added this to the 0.31 milestone Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants