Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difficulties involved with inference on a consumer GPU #4

Open
152334H opened this issue Apr 17, 2023 · 12 comments
Open

difficulties involved with inference on a consumer GPU #4

152334H opened this issue Apr 17, 2023 · 12 comments

Comments

@152334H
Copy link
Contributor

152334H commented Apr 17, 2023

Here are the problems I've found today:

  1. vicuna is loaded as fp16. This is a problem for obvious reasons (13B * 2 > all consumer GPUs).
  2. beam search default of 5 beams consumes a lot of vram during generation

To address these problems, I have created a fork where:

  • vicuna is loaded with 8bit
  • num_beams is set to 1 by default
  • I also put ViT-L on CPU (as fp32), because the encoder only needs 1 pass
@152334H
Copy link
Contributor Author

152334H commented Apr 17, 2023

note: I have discovered that it is still possible to OOM on a 3090 after repeated inference. May be hitting a memleak somewhere.

@TsuTikgiau
Copy link
Collaborator

Thanks for your help on this! We included your updated code now. For running on a consumer GPU, we plan to train a Vicuna 7B version of our method in two days. I think this should reduce the pressure of the GPU memory dramatically

@Snowad14
Copy link

I have not been able to try on the site or without optimizing but I have very... particular results.

Capture d’écran 2023-04-17 180810
Capture d’écran 2023-04-17 181121

@152334H
Copy link
Contributor Author

152334H commented Apr 17, 2023

@Snowad14 Complete guess -- you might've loaded the wrong llama weights. Check that you are using Vicuna-13B-v0?

@Snowad14
Copy link

I had the laziness to download those of llama to then make the conversion so I had download which seem to be the version (https://huggingface.co/Ejafa/vicuna_13B_vanilla) directly converted of the old version but it is perhaps the bad weights. I'll try to convert it myself

@cibernicola
Copy link

cibernicola commented Apr 18, 2023

So is there a way to load vicuna with 8bit ?

I mean, will it work on a 3090 under Windows? :D

@yhyu13
Copy link

yhyu13 commented Apr 18, 2023

@TsuTikgiau It would even better to use llama.cpp with ggml format which compress 7B into 4B, and 12B into 7B. Plus, ALL OF THEM runs on the CPU with vectorized instruction with comparable performace when running on 3090 (due to its nerfied AI inference computation capacity)

https://huggingface.co/eachadea/legacy-ggml-vicuna-7b-4bit

Already a popular project built on CPU inferencing, which worked like charm : https://github.com/nomic-ai/gpt4all-ui

@TsuTikgiau
Copy link
Collaborator

@cibernicola the default setting of the demo will load Vicuna with 8bit now and cost less than 24GB GPU memory if you keep the beam width as 1

@TsuTikgiau
Copy link
Collaborator

TsuTikgiau commented Apr 18, 2023

@yhyu13 Thank you for refering this to us! We will consider this when we have time these days :D

@zxcvbn114514
Copy link

how to put VIT-L back to gpu?The model is running extremely slowly now while the vram usage is only 17g/24g 3090.

@152334H
Copy link
Contributor Author

152334H commented Apr 22, 2023

that cannot be the problem. ViT-L is supposed to only take a single pass. 99% of the work should be done in vicuna alone with the autoregressive loop.

@zxcvbn114514
Copy link

that cannot be the problem. ViT-L is supposed to only take a single pass. 99% of the work should be done in vicuna alone with the autoregressive loop.
image
So it's normal for the 3090 to run at about 80 watts when the model is running?

TsuTikgiau added a commit that referenced this issue Oct 16, 2023
Merge pull request #376 from TsuTikgiau/main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants