Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to unload/reload model from VRAM/RAM after IDLE timeout #196

Open
v3DJG6GL opened this issue Feb 15, 2024 · 5 comments
Open

Comments

@v3DJG6GL
Copy link

First of all thanks for this great project!

Description

I would like to have an option to set an idle time after which the model is unloaded from RAM/VRAM.

Background:

I have several applications that use the VRAM of my GPU, one of these is LocalAI.
Since I don't have unlimited VRAM, these applications have to share the available memory among themselves.
Luckily, since some time LocalAI has implemented a watchdog functionality that can be used to unload the model after a specified idle timeout. I'd love to have some similar functionality for whisper-asr-webservice
For now, whisper-asr-webservice is occupying 1/3rd of my VRAM although it is used only from time to time.

@LuisMalhadas
Copy link

I'd like to point out that it implies energy savings as well.

@thfrei
Copy link

thfrei commented Apr 14, 2024

Wouldn't it be this feature?
mudler/LocalAI#1341

@v3DJG6GL
Copy link
Author

Wouldn't it be this feature? mudler/LocalAI#1341

Yes, that's the PR I also linked up there.

@TigerWolf
Copy link

I have this same problem and would really like this implemented. Can I help at all?

@Deathproof76
Copy link

Deathproof76 commented Jul 13, 2024

I've found a slimmed down version of subgen (which is specifically for generating subtitles for Plex or through bazarr by connecting directly to them) called slim-bazarr-subgen, which pretty much does this (it only connects to bazarr uses latest faster-whisper, takes about 20ish seconds for a 22min audio file on a rtx 3090 with large distil v3 with int8_bfloat16).

Disclaimer: Not a coder so just guessing and interpreting from limited knowledge.

This slim version seems to use a task queue approach which more or less "deletes" the model (purges it from vram) when it's done with its tasks and then reloads the model back into vram when a new task is queued. The model reload process takes less than a few seconds on my system (most likely depending if you put it on an ssd or /dev/shm for example). When it's unloaded it only takes up about ~200mb vram for the main process. Maybe someone more knowledgeable could take a look at the main script. It doesn't seem overly complicated to implement for someone with more experience. In comparison it would take me more than a week of fumbling about and I sadly don't have the resources to take on the responsibility right now, I'm counting on you kind strangers out there!🙏 It would be fantastic to have this implemented in whisper-asr!

some excerpts from the main script:

def start_model():
    global model
    if model is None:
        logging.debug("Model was purged, need to re-create")
        model = stable_whisper.load_faster_whisper(whisper_model, download_root=model_location, device=transcribe_device, cpu_threads=whisper_threads, num_workers=concurrent_transcriptions, compute_type=compute_type)

....

def delete_model():
    if task_queue.qsize() == 0:
        global model
        logging.debug("Queue is empty, clearing/releasing VRAM")
        model = None
        gc.collect()

....

    finally:
        task_queue.task_done()
        delete_model()

Btw: If you're interested in running slim-bazarr-subgen yourself but are still running on Ubuntu 22.04 (I was on 23.10 but the same might apply) here's a modified dockerfile with an older cuda version as you otherwise might get problems due to the new libs/drivers not being available:

dockerfile

FROM nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get -y upgrade
RUN apt-get install -y python3-pip libcudnn8
RUN apt-get clean

RUN apt remove -y --allow-remove-essential cuda-compat-12-3 cuda-cudart-12-3 cuda-cudart-dev-12-3 cuda-keyring cuda-libraries-12-3 cuda-libraries-dev-12-3 cuda-nsight-compute-12-3 cuda-nvml-dev-12-3 cuda-nvprof-12-3 cuda-nvtx-12-3 ncurses-base ncurses-bin e2fsprogs
RUN apt autoremove -y

COPY requirements.txt /requirements.txt
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT [ "/entrypoint.sh" ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants