Prevent threads from being stuck in DynamicBatcher #1915

cbensimon · 2021-02-28T12:13:04Z

When server-side batching is enabled, threads get stuck waiting for the predictions :
While the batch prediction is in progress, threads add samples to the queue (despite the waiter mechanism).
At the end of the batch prediction, samples are totally swiped out and thus, the respective threads are stuck forever waiting for their prediction

This fix prevents the samples from being totally deleted (only deletes the predicted samples)
As the waiter mechanism doesn't seem to be working, this fix also clears it (@miguelvr)
This fix also provides a safer and more explicit way to assign an ID to a thread (prevents thread ID recycling)

This bug has a huge impact on performance when server-side batching is enabled, rapidly dropping the number of available threads from max_batch_size to something close to zero

checklist:

run ~~make test~~ and make lint
test manually (i.e. build/push ~~all~~ python-predictor-cpu and python-predictor-gpu images, restart operator, and re-deploy APIs)

Ensure input batch can be safely fed with new samples at any time Remove the waiter mechanism Use a safer way to generate a thread ID

CLAassistant · 2021-02-28T12:13:09Z

All committers have signed the CLA.

miguelvr · 2021-03-01T09:41:33Z

HI @cbensimon,
Can you elaborate as to when the dynamic batcher hangs? We had tested it and haven't observed any hanging behavior. Is it in an edge case or all the time?

cbensimon · 2021-03-01T09:50:28Z

Hi @miguelvr,
This happens after some amount of time.
I tested it with a dummy predictor function :

def predict(self, payload):
        time.sleep(0.2)
        return payload

I fixed the max batch size to 32

Then, the lower the batch interval is set, the quicker the threads are going to be stuck forever waiting for their respective predictions.

Setting the batch interval around 500ms makes the bug appear after a few minutes of API stress loading (64 concurrent workers)

To reproduce the bug quickly, setting the batch interval to a lower value like 1ms is the best option

I inspected the threads using pystuck and saw the threads were stuck waiting for their respective predictions

miguelvr · 2021-03-01T09:53:44Z

@cbensimon thanks for your response. We will try to reproduce it and we'll get back to you!

RobertLucian · 2021-03-03T20:49:58Z

@cbensimon this is really awesome!! Thanks for catching this bug!!

So, to re-summarize the problem at hand, we can start by looking at the following code-block:

def _enqueue_request(self, **kwargs):
        """
        Enqueue sample for batch inference. This is a blocking method.
        """
        thread_id = td.get_ident()

        self.waiter.wait()
        self.samples[thread_id] = kwargs
        try:
            self.barrier.wait()
        except td.BrokenBarrierError:
            pass

Let's assume the waiter is set and prediction requests are coming in:

A bunch of requests will be appended to the self.samples list until the barrier is waited n times.
There's a chance that when the last breaking wait is being waited on, there are still some that may have made it past the self.waiter.wait() instruction, got added to the self.samples list all while at the same time, the self.samples list already got passed to the self._make_batch method. If that were to happen, I then think that at the end of the batch, the samples that didn't get processed (by having been passed on to the self._make_batch method) will get removed and never get to be processed.

The above situation can lead _get_prediction to hang indefinitely on the respective threads. Do you confirm this?

With this change in the code, a new problem arises - the max batch size is no longer enforced. So for that, I tweaked the code a bit to pick the thread IDs with the smallest IDs in case there are more samples than the max batch size. And since the number of threads is a known number, we can be sure that the queue doesn't grow to enormous sizes.

Let me know what you think about this!

Also added a test script (the one which I also used for testing). This isn't the final one - it's just temporary. We shall convert that to a unit test in our test suite.

cbensimon · 2021-03-04T10:14:26Z

Hi @RobertLucian, the library is great, that's a pleasure to contribute !

Yes, I think that your view on the problem is right

I think that the max batch size is already enforced as long as it equals num threads (which is currently mandatory) : each thread can add a sample to the queue, one at a time, therefore the sample queue cannot exceed the number of threads (for each thread the cycle is : add a sample, wait for the prediction, prediction is ready so the sample is deleted, add another sample, etc..)

But it is totally ok to manually enforce this anyway

Additional note : with the new way of generating "thread" ID (itertools.count), I realize it actually generates a "sample" ID, so maybe that it would be more explicit to rename thread_id* to sample_id*

pkg/cortex/serve/cortex_internal/lib/api/batching.py

dev/tests/dynamic_batcher_test.py

RobertLucian · 2021-03-04T23:03:07Z

Converted the test script to a unit test.
Refactored the thread_id* symbols to sample_id*.

@cbensimon let me know if the unit tests look good to you too. Thanks again for your effort!!

pkg/cortex/serve/cortex_internal.requirements.txt

Prevent threads from being stuck in DynamicBatcher

acb0f03

Ensure input batch can be safely fed with new samples at any time Remove the waiter mechanism Use a safer way to generate a thread ID

RobertLucian added 5 commits March 3, 2021 01:24

Merge branch 'master' into patch-1

2f54f24

Merge branch 'master' into patch-1

2e5e6eb

Enforce max batch size

d2b1897

Don't use relative imports

f0df347

Add test file for dynamic batcher for dev (to be added to test suite)

4420a71

RobertLucian self-requested a review March 3, 2021 20:52

RobertLucian added the bug Something isn't working label Mar 3, 2021

RobertLucian approved these changes Mar 3, 2021

View reviewed changes

miguelvr reviewed Mar 4, 2021

View reviewed changes

pkg/cortex/serve/cortex_internal/lib/api/batching.py Outdated Show resolved Hide resolved

dev/tests/dynamic_batcher_test.py Outdated Show resolved Hide resolved

RobertLucian added 3 commits March 4, 2021 23:30

Merge branch 'master' into patch-1

f41cf00

Rename thread_ids to sample_ids

ea75ffa

Convert manual dynamic_batcher_test to unit test

d179a11

Missing word in python comment

d4327cf

miguelvr reviewed Mar 5, 2021

View reviewed changes

pkg/cortex/serve/cortex_internal.requirements.txt Show resolved Hide resolved

miguelvr approved these changes Mar 5, 2021

View reviewed changes

Merge branch 'master' into patch-1

e404227

RobertLucian merged commit 0b1b649 into cortexlabs:master Mar 5, 2021

vishalbollu added this to the 0.31 milestone Mar 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent threads from being stuck in DynamicBatcher #1915

Prevent threads from being stuck in DynamicBatcher #1915

cbensimon commented Feb 28, 2021 •

edited

Loading

CLAassistant commented Feb 28, 2021 •

edited

Loading

miguelvr commented Mar 1, 2021

cbensimon commented Mar 1, 2021

miguelvr commented Mar 1, 2021

RobertLucian commented Mar 3, 2021 •

edited

Loading

cbensimon commented Mar 4, 2021 •

edited

Loading

RobertLucian commented Mar 4, 2021

Prevent threads from being stuck in DynamicBatcher #1915

Prevent threads from being stuck in DynamicBatcher #1915

Conversation

cbensimon commented Feb 28, 2021 • edited Loading

CLAassistant commented Feb 28, 2021 • edited Loading

miguelvr commented Mar 1, 2021

cbensimon commented Mar 1, 2021

miguelvr commented Mar 1, 2021

RobertLucian commented Mar 3, 2021 • edited Loading

cbensimon commented Mar 4, 2021 • edited Loading

RobertLucian commented Mar 4, 2021

cbensimon commented Feb 28, 2021 •

edited

Loading

CLAassistant commented Feb 28, 2021 •

edited

Loading

RobertLucian commented Mar 3, 2021 •

edited

Loading

cbensimon commented Mar 4, 2021 •

edited

Loading