-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference on GPU #4
Comments
According to my napkin math, even the smallest model with 7B parameters will probably take close to 30GB of space. 8GB is unlikely to suffice. But I have no access to the weights yet, it's just my rough guess. |
Could be possible with https://github.com/FMInference/FlexGen |
This project looks amazing 🤩. However, in its example, it seems like a 6.7B OPT model would still need at least 15GB of GPU memory. So, the chances are mere 🥲. I would so wanna run it on my 3080 10GB. |
Flexgen only supports opt models |
With KoboldAi I was able to run GPT J 6b on my 8gb 3070 ti by offloading the model to my ram |
7B in float16 with be 14GB and if quantized to uint8 could be as low as 7GB. But on the graphics cards, from what I've tried with other models it can take 2x the VRAM. My guess is that 32GB would be the minimum but some clever person may be able to run it with 16GB VRAM. But the question is, how fast would it be? If it is one character per second then it would not be that useful! |
Can I use the model on Intel iRIS Xe graphics card? I appreciate, if possible to as well recommend which libraries to use. |
How fast was it? |
The 7B model generates quickly on a 3090ti (~30 seconds for ~500 tokens, ~17 tokens/s), much faster than the ChatGPT interface. It is using ~14GB VRAM during generation. This is also with Recording.2023-03-02.225512.mp4See my fork for the code for rolling generation and the Gradio interface. |
Trying to run the 7B model in Colab with 15GB GPU is failing. Is there a way to configure this to be using fp16 or thats already baked into the existing model. |
I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here: https://github.com/modular-ml/wrapyfi-examples_llama and have a readme with the instructions on how to do it: LLaMA with WrapyfiWrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon! This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! How to?
|
@fabawi Good work. 👍 |
@bjoernpl Works great, thanks! Have you tried changing the gradio interface to use the gradio chatbot component? |
Thank you! Works great. |
I think this doesn't quite fit, since LLama is not fine-tuned for chatbot-like capabilities. I think it would definitely be possible (even if it probably doesn't work too well) to use it as a chatbot with some clever prompting. Might be worth a try, thanks for the idea and the feedback. |
Closing this issue - great work @fabawi !! |
Is it possible to host this locally on an RTX3XXX or 4XXX with 8GB just to test?
The text was updated successfully, but these errors were encountered: