Skip to content

gunicorn, torch, and cuda #243

@keighrim

Description

@keighrim

Bug Description

As found in clamsproject/app-whisper-wrapper#24 (comment), when a CLAMS app runs in HTTP + production mode (app.py --production) with CUDA device support, it runs over gunicorn wsgi with multiple workers. It seems that some torch-based CLAMS apps running under the scenario spawns multiple python processes and load multiple copies of the torch model to memory, resulting in OOM errors at some points.

Reproduction steps

  1. pick a computer with a CUDA device (nvidia gpu).
  2. run whisper wrapper v10 (https://apps.clams.ai/whisper-wrapper/v10/) in the production mode.
  3. send multiple POST (annotate) requests simultaneously or with short gaps between.
  4. watch VRAM saturation via e.g. nvidia-smi or a similar monitoring util.

Expected behavior

The app should reuse the already-loaded checkpoint/model in the memory. Instead, the app loads the model for each request and then doesn't release the model after the process is completed.

Log output

No response

Screenshots

No response

Additional context

Also, it's very likely that this issue shares the same root cause with clamsproject/app-doctr-wrapper#6.

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐛BSomething isn't working

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions