-
Notifications
You must be signed in to change notification settings - Fork 1
Closed
Labels
🐛BSomething isn't workingSomething isn't working
Description
Bug Description
As found in clamsproject/app-whisper-wrapper#24 (comment), when a CLAMS app runs in HTTP + production mode (app.py --production) with CUDA device support, it runs over gunicorn wsgi with multiple workers. It seems that some torch-based CLAMS apps running under the scenario spawns multiple python processes and load multiple copies of the torch model to memory, resulting in OOM errors at some points.
Reproduction steps
- pick a computer with a CUDA device (nvidia gpu).
- run whisper wrapper v10 (https://apps.clams.ai/whisper-wrapper/v10/) in the production mode.
- send multiple POST (annotate) requests simultaneously or with short gaps between.
- watch VRAM saturation via e.g.
nvidia-smior a similar monitoring util.
Expected behavior
The app should reuse the already-loaded checkpoint/model in the memory. Instead, the app loads the model for each request and then doesn't release the model after the process is completed.
Log output
No response
Screenshots
No response
Additional context
Also, it's very likely that this issue shares the same root cause with clamsproject/app-doctr-wrapper#6.
Metadata
Metadata
Assignees
Labels
🐛BSomething isn't workingSomething isn't working
Type
Projects
Status
Done