-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
difficulties involved with inference on a consumer GPU #4
Comments
note: I have discovered that it is still possible to OOM on a 3090 after repeated inference. May be hitting a memleak somewhere. |
Thanks for your help on this! We included your updated code now. For running on a consumer GPU, we plan to train a Vicuna 7B version of our method in two days. I think this should reduce the pressure of the GPU memory dramatically |
@Snowad14 Complete guess -- you might've loaded the wrong llama weights. Check that you are using Vicuna-13B-v0? |
I had the laziness to download those of llama to then make the conversion so I had download which seem to be the version (https://huggingface.co/Ejafa/vicuna_13B_vanilla) directly converted of the old version but it is perhaps the bad weights. I'll try to convert it myself |
So is there a way to load vicuna with 8bit ? I mean, will it work on a 3090 under Windows? :D |
@TsuTikgiau It would even better to use llama.cpp with ggml format which compress 7B into 4B, and 12B into 7B. Plus, ALL OF THEM runs on the CPU with vectorized instruction with comparable performace when running on 3090 (due to its nerfied AI inference computation capacity) https://huggingface.co/eachadea/legacy-ggml-vicuna-7b-4bit Already a popular project built on CPU inferencing, which worked like charm : https://github.com/nomic-ai/gpt4all-ui |
@cibernicola the default setting of the demo will load Vicuna with 8bit now and cost less than 24GB GPU memory if you keep the beam width as 1 |
@yhyu13 Thank you for refering this to us! We will consider this when we have time these days :D |
how to put VIT-L back to gpu?The model is running extremely slowly now while the vram usage is only 17g/24g 3090. |
that cannot be the problem. ViT-L is supposed to only take a single pass. 99% of the work should be done in vicuna alone with the autoregressive loop. |
Merge pull request #376 from TsuTikgiau/main
Here are the problems I've found today:
To address these problems, I have created a fork where:
The text was updated successfully, but these errors were encountered: