Inference on GPU #4

sarmientoj24 · 2023-02-24T19:12:18Z

Is it possible to host this locally on an RTX3XXX or 4XXX with 8GB just to test?

dizys · 2023-02-24T20:52:02Z

According to my napkin math, even the smallest model with 7B parameters will probably take close to 30GB of space. 8GB is unlikely to suffice. But I have no access to the weights yet, it's just my rough guess.

ekiwi111 · 2023-02-24T22:35:45Z

Could be possible with https://github.com/FMInference/FlexGen

dizys · 2023-02-24T23:10:49Z

Could be possible with https://github.com/FMInference/FlexGen

This project looks amazing 🤩. However, in its example, it seems like a 6.7B OPT model would still need at least 15GB of GPU memory. So, the chances are mere 🥲. I would so wanna run it on my 3080 10GB.

kir152 · 2023-02-25T09:11:31Z

Flexgen only supports opt models

CyberTimon · 2023-02-25T10:46:18Z

With KoboldAi I was able to run GPT J 6b on my 8gb 3070 ti by offloading the model to my ram

pauldog · 2023-02-25T19:02:23Z

7B in float16 with be 14GB and if quantized to uint8 could be as low as 7GB. But on the graphics cards, from what I've tried with other models it can take 2x the VRAM.

My guess is that 32GB would be the minimum but some clever person may be able to run it with 16GB VRAM.

But the question is, how fast would it be? If it is one character per second then it would not be that useful!

tmgthb · 2023-02-25T19:48:30Z

Can I use the model on Intel iRIS Xe graphics card?

I appreciate, if possible to as well recommend which libraries to use.

pauldog · 2023-02-27T05:43:36Z

With KoboldAi I was able to run GPT J 6b on my 8gb 3070 ti by offloading the model to my ram

How fast was it?

bjoernpl · 2023-03-02T22:05:52Z

7B in float16 with be 14GB and if quantized to uint8 could be as low as 7GB. But on the graphics cards, from what I've tried with other models it can take 2x the VRAM.

My guess is that 32GB would be the minimum but some clever person may be able to run it with 16GB VRAM.

But the question is, how fast would it be? If it is one character per second then it would not be that useful!

The 7B model generates quickly on a 3090ti (~30 seconds for ~500 tokens, ~17 tokens/s), much faster than the ChatGPT interface. It is using ~14GB VRAM during generation. This is also with batch_size=1, meaning theoretical throughput is higher than this.

Recording.2023-03-02.225512.mp4

See my fork for the code for rolling generation and the Gradio interface.

doublebishop · 2023-03-02T22:54:04Z

Trying to run the 7B model in Colab with 15GB GPU is failing. Is there a way to configure this to be using fp16 or thats already baked into the existing model.
*update: Using batch_size=2 seems to make it work in Colab+ with GPU

fabawi · 2023-03-03T23:06:03Z

I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here:

https://github.com/modular-ml/wrapyfi-examples_llama

and have a readme with the instructions on how to do it:

LLaMA with Wrapyfi

Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM

currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!

This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon!
UPDATE: Tested on Two 3080 Tis as well!!!

How to?

Replace all instances of <YOUR_IP> and before running the scripts
Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:

git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .

Install Wrapyfi with the same environment:

git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]

Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:

cd wrapyfi/standalone 
python zeromq_proxy_broker.py --comm_type pubsubpoll

Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):

CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

Now start the second instance (within this repo and env) :

CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

You will now see the output on both terminals
EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,

### (replace 10.0.0.101 with <YOUR_IP> ###

# step 4 modification 
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll

# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

pauldog · 2023-03-03T23:38:52Z

@fabawi Good work. 👍

neuhaus · 2023-03-04T10:48:24Z

See my fork for the code for rolling generation and the Gradio interface.

@bjoernpl Works great, thanks!

Have you tried changing the gradio interface to use the gradio chatbot component?

robertavram-md · 2023-03-04T14:28:12Z

Thank you! Works great.

bjoernpl · 2023-03-04T17:05:14Z

Have you tried changing the gradio interface to use the gradio chatbot component?

I think this doesn't quite fit, since LLama is not fine-tuned for chatbot-like capabilities. I think it would definitely be possible (even if it probably doesn't work too well) to use it as a chatbot with some clever prompting. Might be worth a try, thanks for the idea and the feedback.

Add a T5 tokenizer

jspisak · 2023-08-24T04:08:30Z

Closing this issue - great work @fabawi !!

richardmon mentioned this issue Mar 15, 2023

GPU usage cocktailpeanut/dalai#15

Open

This was referenced Mar 20, 2023

Is there a way to fine-tune this model? #221

Closed

Is there a way to fine tune the llama.7B model? cocktailpeanut/dalai#166

Open

Liyang90 referenced this issue in Liyang90/llama Apr 26, 2023

Merge pull request #4 from pytorch-tpu/alanwaketan/tokenizer

f7b1af0

Add a T5 tokenizer

alxspiker mentioned this issue May 13, 2023

how can i use gpu to run? zylon-ai/private-gpt#79

Closed

jspisak closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference on GPU #4

Inference on GPU #4

sarmientoj24 commented Feb 24, 2023

dizys commented Feb 24, 2023

ekiwi111 commented Feb 24, 2023

dizys commented Feb 24, 2023

kir152 commented Feb 25, 2023

CyberTimon commented Feb 25, 2023

pauldog commented Feb 25, 2023

tmgthb commented Feb 25, 2023 •

edited

pauldog commented Feb 27, 2023

bjoernpl commented Mar 2, 2023

doublebishop commented Mar 2, 2023 •

edited

fabawi commented Mar 3, 2023

pauldog commented Mar 3, 2023

neuhaus commented Mar 4, 2023 •

edited

robertavram-md commented Mar 4, 2023

bjoernpl commented Mar 4, 2023 •

edited

jspisak commented Aug 24, 2023

Inference on GPU #4

Inference on GPU #4

Comments

sarmientoj24 commented Feb 24, 2023

dizys commented Feb 24, 2023

ekiwi111 commented Feb 24, 2023

dizys commented Feb 24, 2023

kir152 commented Feb 25, 2023

CyberTimon commented Feb 25, 2023

pauldog commented Feb 25, 2023

tmgthb commented Feb 25, 2023 • edited

pauldog commented Feb 27, 2023

bjoernpl commented Mar 2, 2023

doublebishop commented Mar 2, 2023 • edited

fabawi commented Mar 3, 2023

LLaMA with Wrapyfi

How to?

pauldog commented Mar 3, 2023

neuhaus commented Mar 4, 2023 • edited

robertavram-md commented Mar 4, 2023

bjoernpl commented Mar 4, 2023 • edited

jspisak commented Aug 24, 2023

tmgthb commented Feb 25, 2023 •

edited

doublebishop commented Mar 2, 2023 •

edited

neuhaus commented Mar 4, 2023 •

edited

bjoernpl commented Mar 4, 2023 •

edited