gunicorn, torch, and cuda

### Bug Description

As found in https://github.com/clamsproject/app-whisper-wrapper/issues/24#issuecomment-2372234824, when a CLAMS app runs in HTTP + production mode (`app.py --production`) with CUDA device support, it runs over gunicorn wsgi with multiple workers. It seems that some torch-based CLAMS apps running under the scenario spawns multiple python processes and load multiple copies of the torch model to memory, resulting in OOM errors at some points. 



### Reproduction steps

1. pick a computer with a CUDA device (nvidia gpu).
2. run whisper wrapper v10 (https://apps.clams.ai/whisper-wrapper/v10/) in the production mode.
3. send multiple POST (annotate) requests simultaneously or with short gaps between.
4. watch VRAM saturation via e.g. `nvidia-smi` or a similar monitoring util.

### Expected behavior

The app should reuse the already-loaded checkpoint/model in the memory. Instead, the app loads the model for each request and then doesn't release the model after the process is completed. 

### Log output

_No response_

### Screenshots

_No response_

### Additional context

Also, it's very likely that this issue shares the same root cause with https://github.com/clamsproject/app-doctr-wrapper/pull/6. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gunicorn, torch, and cuda #243

Bug Description

Reproduction steps

Expected behavior

Log output

Screenshots

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gunicorn, torch, and cuda #243

Description

Bug Description

Reproduction steps

Expected behavior

Log output

Screenshots

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions