difficulties involved with inference on a consumer GPU #4

152334H · 2023-04-17T07:12:12Z

Here are the problems I've found today:

vicuna is loaded as fp16. This is a problem for obvious reasons (13B * 2 > all consumer GPUs).
beam search default of 5 beams consumes a lot of vram during generation

To address these problems, I have created a fork where:

vicuna is loaded with 8bit
num_beams is set to 1 by default
I also put ViT-L on CPU (as fp32), because the encoder only needs 1 pass

152334H · 2023-04-17T07:31:02Z

note: I have discovered that it is still possible to OOM on a 3090 after repeated inference. May be hitting a memleak somewhere.

TsuTikgiau · 2023-04-17T14:07:58Z

Thanks for your help on this! We included your updated code now. For running on a consumer GPU, we plan to train a Vicuna 7B version of our method in two days. I think this should reduce the pressure of the GPU memory dramatically

Snowad14 · 2023-04-17T16:13:24Z

I have not been able to try on the site or without optimizing but I have very... particular results.

152334H · 2023-04-17T16:59:41Z

@Snowad14 Complete guess -- you might've loaded the wrong llama weights. Check that you are using Vicuna-13B-v0?

Snowad14 · 2023-04-17T17:23:08Z

I had the laziness to download those of llama to then make the conversion so I had download which seem to be the version (https://huggingface.co/Ejafa/vicuna_13B_vanilla) directly converted of the old version but it is perhaps the bad weights. I'll try to convert it myself

cibernicola · 2023-04-18T10:53:12Z

So is there a way to load vicuna with 8bit ?

I mean, will it work on a 3090 under Windows? :D

yhyu13 · 2023-04-18T11:25:56Z

@TsuTikgiau It would even better to use llama.cpp with ggml format which compress 7B into 4B, and 12B into 7B. Plus, ALL OF THEM runs on the CPU with vectorized instruction with comparable performace when running on 3090 (due to its nerfied AI inference computation capacity)

https://huggingface.co/eachadea/legacy-ggml-vicuna-7b-4bit

Already a popular project built on CPU inferencing, which worked like charm : https://github.com/nomic-ai/gpt4all-ui

TsuTikgiau · 2023-04-18T19:40:13Z

@cibernicola the default setting of the demo will load Vicuna with 8bit now and cost less than 24GB GPU memory if you keep the beam width as 1

TsuTikgiau · 2023-04-18T19:41:54Z

@yhyu13 Thank you for refering this to us! We will consider this when we have time these days :D

zxcvbn114514 · 2023-04-22T02:21:26Z

how to put VIT-L back to gpu?The model is running extremely slowly now while the vram usage is only 17g/24g 3090.

152334H · 2023-04-22T02:55:39Z

that cannot be the problem. ViT-L is supposed to only take a single pass. 99% of the work should be done in vicuna alone with the autoregressive loop.

zxcvbn114514 · 2023-04-22T03:23:41Z

that cannot be the problem. ViT-L is supposed to only take a single pass. 99% of the work should be done in vicuna alone with the autoregressive loop.

So it's normal for the 3090 to run at about 80 watts when the model is running?

Merge pull request #376 from TsuTikgiau/main

This was referenced Apr 18, 2023

How to load model with load_8_bit #24

Closed

GPU Memory of A100 #20

Closed

cibernicola mentioned this issue Apr 18, 2023

Windows not supported #28

Open

TsuTikgiau added a commit that referenced this issue Oct 16, 2023

Merge pull request #4 from Vision-CAIR/main

44a7a18

Merge pull request #376 from TsuTikgiau/main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

difficulties involved with inference on a consumer GPU #4

difficulties involved with inference on a consumer GPU #4

152334H commented Apr 17, 2023

152334H commented Apr 17, 2023

TsuTikgiau commented Apr 17, 2023

Snowad14 commented Apr 17, 2023

152334H commented Apr 17, 2023

Snowad14 commented Apr 17, 2023

cibernicola commented Apr 18, 2023 •

edited

Loading

yhyu13 commented Apr 18, 2023

TsuTikgiau commented Apr 18, 2023

TsuTikgiau commented Apr 18, 2023 •

edited

Loading

zxcvbn114514 commented Apr 22, 2023

152334H commented Apr 22, 2023

zxcvbn114514 commented Apr 22, 2023

difficulties involved with inference on a consumer GPU #4

difficulties involved with inference on a consumer GPU #4

Comments

152334H commented Apr 17, 2023

152334H commented Apr 17, 2023

TsuTikgiau commented Apr 17, 2023

Snowad14 commented Apr 17, 2023

152334H commented Apr 17, 2023

Snowad14 commented Apr 17, 2023

cibernicola commented Apr 18, 2023 • edited Loading

yhyu13 commented Apr 18, 2023

TsuTikgiau commented Apr 18, 2023

TsuTikgiau commented Apr 18, 2023 • edited Loading

zxcvbn114514 commented Apr 22, 2023

152334H commented Apr 22, 2023

zxcvbn114514 commented Apr 22, 2023

cibernicola commented Apr 18, 2023 •

edited

Loading

TsuTikgiau commented Apr 18, 2023 •

edited

Loading