Gunicorn preload flag not working with PyTorch library #2478

hsilveiro · 2020-12-17T18:47:58Z

Hello,
We have been developing a FastAPI application where we use some external libraries to perform some NLP tasks, such as tokenization. On top of this, we are launching the service with Gunicorn so that we can parallelize the requests.
However, we are having difficulties using Stanza with Gunicorn’s preload flag active.
It is a requirement to use this flag because since Stanza models can be large, we want the models to be loaded only once, in the Gunicorn master process. This way, Gunicorn workers can access the models that were previously loaded in the master process.

The difficulties that we are facing resumes on the fact that Gunicorn workers hang when trying to make an inference over a given model (that was loaded initially by the master process).

We’ve done some research and debugging but we weren’t able to find a solution. However, we noticed that the worker hangs when the code reaches the prediction step on PyTorch.
Although we are talking about Stanza, this problem also occurred with Sentence Transformers library. And both of them are using the PyTorch library.

Following, I’ll present more details:

Environment:

FastApi version: 0.54.2
Gunicorn version: 20.0.4
Uvicorn version: 0.12.3
Python version: 3.7
Stanza version: 1.1.1
OS: macOS Catalina 10.15.6

Steps executed:

Gunicorn command:

gunicorn --workers 1 --worker-class uvicorn.workers.UvicornWorker --max-requests=0 --max-requests-jitter=0 --timeout=120 --keep-alive=2 \
     --log-level=info --access-logfile - --preload -b 0.0.0.0:8010 my_app:app

The code that will run before launching the workers

def initialize_application() -> None:
     ...
     model = stanza.Pipeline(
                lang=cls._TOKENIZER_MODEL_LANGUAGES[language],
                package=cls._MODEL_TYPE[language],
                processors=cls._TOKENIZER_MODEL,
                tokenize_no_ssplit=True,
            )

This way, the model can be loaded only once, in the master process.
Once the required workers are launched, they should have access to the previous model, without having to load it by themselves (saving computational resources).

The problem happens when we receive a request that will make use of the model that was initially loaded. The worker that will be responsible for handling the request, won’t be able to use the model for inference. As so, the worker will be hanged until the timeout occurs.

After analyzing the code and debugging it, we reached the following step until the code stopped working:

Our code has a call to the process() method, class Pipeline, on the core.py file of Stanza.
That line calls the specific process() method, in this case from the tokenize_process.py, class TokenizeProcessor
Which calls the PyTorch code, output_predictions() method, from the utils.py
After some steps, it reaches the model.py file still in PyTorch, class Tokenizer(nn.Module), forward(self, x, feats) method, in the following line: nontok = F.logsigmoid(-tok0). It seems that this line is calling some C++ code where we didn’t investigate any further.

Of course, if we remove the --preload flag, everything will run smoothly. Removing it is something that we want to avoid because of the added computational resources that will be necessary (the models will be duplicated in every worker).

We looked through several other issues that could be related to this one, such as:
#2157
tiangolo/fastapi#2425
tiangolo/fastapi#596
#2124
and others...

After trying multiple solutions, it wasn’t possible to solve the issue. Do you have any suggestions to handle this? Or other tests that I can perform to give you more information?

Thanks in advance.

P.S.: I also opened issues on the Stanza and PyTorch github pages:

The text was updated successfully, but these errors were encountered:

jamadden · 2020-12-17T18:58:33Z

Is the underlying C library known to be fork-safe? Not all libraries can survive a fork. For example, if they hold a lock at the time of forking, it will never be unlocked in the child processes so they will simply stop, unable to acquire the lock.

On macOS, many system libraries are not fork safe. Theres at least one environment variable you can set (export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES) that helps with some of that, but IIRC what that does is change from an outright crash into a maybe-it-works-maybe-it-doesn't situation, depending on the state of the process. (google that variable for more.)

Getting a native stack trace of a hung worker could provide insight. On most platforms, I would suggest using py-spy, but py-spy can't get native stack traces on macOS. You could try using Activity Monitor to "Sample Process" the hung worker; that might reveal something.

jamadden · 2020-12-17T19:00:14Z

I'll add that some libraries offer APIs to call after a fork to fix up the process state (e.g., gevent has some after_fork or reinit functions—it mostly arranges to call those automatically but sometimes they have to be called manually). You might look for one of those and if you find it call it in a gunicorn hook.

hsilveiro · 2020-12-22T18:17:33Z

Is the underlying C library known to be fork-safe? Not all libraries can survive a fork. For example, if they hold a lock at the time of forking, it will never be unlocked in the child processes so they will simply stop, unable to acquire the lock.

On macOS, many system libraries are not fork safe. Theres at least one environment variable you can set (export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES) that helps with some of that, but IIRC what that does is change from an outright crash into a maybe-it-works-maybe-it-doesn't situation, depending on the state of the process. (google that variable for more.)

Getting a native stack trace of a hung worker could provide insight. On most platforms, I would suggest using py-spy, but py-spy can't get native stack traces on macOS. You could try using Activity Monitor to "Sample Process" the hung worker; that might reveal something.

Hi again,
I tested the following things based on what you said:

Updated the environment variable OBJC_DISABLE_INITIALIZE_FORK_SAFETY to YES -> The problem remained.
Used Activity Monitor to "Sample Process" the hung worker -> Looked at where the code stopped in the C code part but couldn't figure out the reason why.

I already tried using gevent and the problem remains. However, I'll look at the functions that you indicated: after_fork or reinit and see I have better results.

If you have any more suggestions, let me know.

Thanks!

jamadden · 2020-12-22T18:28:44Z

Posting the stack sample you observed would help others take a look.

hsilveiro · 2020-12-22T18:32:28Z

Sure, here is the exported file from the "Sample process" on the hung worker:
sample_worker_process_preload.txt

jamadden · 2020-12-22T19:30:12Z

The last lines of the stack trace are very enlightening:

1 _PyMethodDef_RawFastCallKeywords  (in Python) + 685  [0x1033275ed]
2 torch::autograd::THPVariable_log_sigmoid(_object*, _object*, _object*)  (in libtorch_python.dylib) + 299  [0x114a3ec1b]
...
3 at::native::(anonymous namespace)::log_sigmoid_cpu_kernel(at::Tensor&, at::Tensor&, at::Tensor const&)  (in libtorch_cpu.dylib) + 994  [0x118f31f52]
4 at::internal::_parallel_run(long long, long long, long long, std::__1::function<void (long long, long long, unsigned long)> const&)  (in libtorch_cpu.dylib) + 1160  [0x11586b358]
5 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)  (in libc++.1.dylib) + 18  [0x7fff6a323592]
6 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698 [0x7fff6d255425]
7 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff6d194882]

The first two lines tell us that the Python code has called THPVariable_log_sigmoid(). The internal details of that function wind up at line 4, parallel_run; the name is highly suggestive that this will want to do something with threads.

Sure enough, lines 5 through 7 show this thread trying to wait for a low-level threading primitive to become available. If that never happens, this thread never proceeds. The process hangs.

This goes back to what I suggested initially:

Is the underlying C library known to be fork-safe? Not all libraries can survive a fork For example, if they hold a lock at the time of forking, it will never be unlocked in the child processes so they will simply stop, unable to acquire the lock.

It looks like that's exactly what has happened. The master gunicorn process used some API in libtorch that acquired a lock; when the process forked, that lock is still locked, and there is no way to unlock it.

If libtorch is meant to be fork-safe, there would be some way to either avoid taking that lock in the master process, or some way to reset the state in the child process. You can look for APIs to do that.

Otherwise, you may have to experiment to find out exactly how much it is safe to do in the the master process while still avoiding this problem. I recommend making sure that there are no extra threads running at the time of the fork — be sure to shutdown/cleanup/whatever all uses of this library before the fork.

tilgovi · 2021-03-08T18:56:22Z

Thanks for opening this issue. It should help others who have similar problems.

Thank you, Jason, for the clear diagnosis and helpful links.

At this time, I think there's nothing for Gunicorn to do here, and I will close this issue. Please let us know if that is a mistake.

reuben · 2022-01-04T19:28:26Z

FWIW there's a workaround for this if your goal is to prevent worst case latency on the first request to a worker: "preload" manually in your application factory. Do a forward pass yourself with some example data after you create the app instance. You won't be able to re-use pages across processes but it's an effective way to warm up the worker pool. Note that worker boot is subject to the same timeout as request handling, so you might need to bump --timeout.

ibraheem-tuffaha · 2022-07-06T11:34:42Z

I had a similar problem with PyTorch when running more than 1 worker.
I simple workaround was to increase number of threads to more than 1.

lsmith77 · 2023-04-18T12:07:51Z

it seems like PyTorch shouldn't be run this way:
#2608 (comment)

mmathys · 2023-10-04T13:38:46Z

Thanks for the analyis, had the same issue.

ciliamadani · 2023-12-28T13:59:19Z

@hsilveiro I'm in the same situation, how did you solve this ?

mmathys · 2023-12-29T06:05:37Z

@ciliamadani Maybe I can comment as well. I ended up initializing PyTorch after the workers have been forked. I used the post-fork hook.

lsmith77 · 2023-12-29T12:39:26Z

@ciliamadani Maybe I can comment as well. I ended up initializing PyTorch after the workers have been forked. I used the post-fork hook.

so in other words, you by-passed the preload behavor for your pytorch models

loretoparisi · 2024-03-07T17:29:01Z

@ciliamadani Maybe I can comment as well. I ended up initializing PyTorch after the workers have been forked. I used the post-fork hook.

@mmathys How did you achieved that?

I have tried

def post_fork(server, worker):
    if not hasattr( worker.app, 'backend'):
        my_shared_backend = Backend(config)
        worker.app.backend = my_shared_backend

I can see that the worker.app is a <gunicorn.app.wsgiapp.WSGIApplication object at 0x7f84f00ca460> shared instance, but the hasattr check always fail

mmathys · 2024-03-07T18:10:20Z

Actually it ended up not working @loretoparisi. Don't remember why

loretoparisi · 2024-03-07T19:01:43Z

Actually it ended up not working @loretoparisi. Don't remember why

yes thanks, in my understanding there's no other (or better) way than the preload flag with gUnicorn. Using Tornado + asyncio instead it works fine.

mmathys · 2024-03-07T22:43:34Z

Thanks @loretoparisi for the hint! We still have this issue, will try out Tornado (or another alternative framework)

hsilveiro mentioned this issue Dec 18, 2020

Gunicorn preload flag not working with Stanza stanfordnlp/stanza#570

Closed

jamadden mentioned this issue Feb 3, 2021

Gunicorn doesn't work on macos high sierra #1701

Closed

tilgovi closed this as completed Mar 8, 2021

jamadden mentioned this issue Aug 6, 2021

[BUG] Gunicorn preload_app issue #2621

Closed

paolomagnani-mxm mentioned this issue Mar 7, 2024

CollectionEncoder blocking on encoder N passages stanford-futuredata/ColBERT#322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gunicorn preload flag not working with PyTorch library #2478

Gunicorn preload flag not working with PyTorch library #2478

hsilveiro commented Dec 17, 2020 •

edited

jamadden commented Dec 17, 2020

jamadden commented Dec 17, 2020

hsilveiro commented Dec 22, 2020 •

edited

jamadden commented Dec 22, 2020

hsilveiro commented Dec 22, 2020

jamadden commented Dec 22, 2020

tilgovi commented Mar 8, 2021

reuben commented Jan 4, 2022

ibraheem-tuffaha commented Jul 6, 2022

lsmith77 commented Apr 18, 2023

mmathys commented Oct 4, 2023

ciliamadani commented Dec 28, 2023

mmathys commented Dec 29, 2023 •

edited

lsmith77 commented Dec 29, 2023

loretoparisi commented Mar 7, 2024

mmathys commented Mar 7, 2024

loretoparisi commented Mar 7, 2024

mmathys commented Mar 7, 2024

Gunicorn preload flag not working with PyTorch library #2478

Gunicorn preload flag not working with PyTorch library #2478

Comments

hsilveiro commented Dec 17, 2020 • edited

jamadden commented Dec 17, 2020

jamadden commented Dec 17, 2020

hsilveiro commented Dec 22, 2020 • edited

jamadden commented Dec 22, 2020

hsilveiro commented Dec 22, 2020

jamadden commented Dec 22, 2020

tilgovi commented Mar 8, 2021

reuben commented Jan 4, 2022

ibraheem-tuffaha commented Jul 6, 2022

lsmith77 commented Apr 18, 2023

mmathys commented Oct 4, 2023

ciliamadani commented Dec 28, 2023

mmathys commented Dec 29, 2023 • edited

lsmith77 commented Dec 29, 2023

loretoparisi commented Mar 7, 2024

mmathys commented Mar 7, 2024

loretoparisi commented Mar 7, 2024

mmathys commented Mar 7, 2024

hsilveiro commented Dec 17, 2020 •

edited

hsilveiro commented Dec 22, 2020 •

edited

mmathys commented Dec 29, 2023 •

edited