How to run with `-ngl` parameter? #268

albertoZurini · 2023-05-23T11:27:05Z

Is your feature request related to a problem? Please describe.
I have a low VRAM GPU and would like to execute the python binding. I can run LLaMA, thanks to https://gist.github.com/rain-1/8cc12b4b334052a21af8029aa9c4fafc . But I didn't understand if this is possible with this binding.

Describe the solution you'd like
I want to run 13B model on my 3060.

Describe alternatives you've considered
https://gist.github.com/rain-1/8cc12b4b334052a21af8029aa9c4fafc

Additional context

gjmulder · 2023-05-23T11:47:05Z

Use a 5_1 quantized model. This allows you to load the largest model on your GPU with the smallest amount of quality loss.
Set n_ctx as you want. Note that increasing this parameter increases quality at the cost of performance (tokens per second) and VRAM.
Run without the ngl parameter and see how much free VRAM you have.
Increment ngl=NN until you are using almost all your VRAM.

With a 7B model and an 8K context I can fit all the layers on the GPU in 6GB of VRAM. Similarly, the 13B model will fit in 11GB of VRAM:

llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 8196
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 1979.59 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 4633 MB
...................................................................................................
llama_init_from_file: kv self size  = 4098.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

$ nvidia-smi | grep python3
|    0   N/A  N/A   2222222      C   python3                                    5998MiB |

albertoZurini · 2023-05-23T19:11:26Z

I get this error when setting that parameter:

llama.cpp: loading model from LLaMA/13B/ggml-model-f16-q4_0.bin
terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpectedly reached end of file
Aborted (core dumped)

In particular it hangs on line 259 of llama_cpp.py.

gjmulder · 2023-05-23T20:04:59Z

It looks like your model file is corrupt. Does it work with llama.cpp/main ?

albertoZurini · 2023-05-24T08:41:09Z

Yes, it does. Could you please give me any hint on how to debug the python binding better?

gjmulder · 2023-05-24T12:15:22Z

That's really strange. The error 'std::runtime_error' looks like a C++ error.

DebuggingWithGdb maybe?

albertoZurini · 2023-05-24T14:03:21Z

I think this can be an error due to the encoding of the file, because I've tried to download a pre-quantized model from https://huggingface.co/eachadea/ggml-vicuna-13b-1.1/tree/main and running it in Docker, but I'm getting segmentation fault there as well:

llama.cpp: loading model from /models/ggml-old-vic13b-q4_0.bin
Illegal instruction (core dumped)

But this is strange, I did follow the steps at the link I've sent on the first post in the issue and for llama.cpp they work just fine, I don't know how it can be that they don't work for the python binding. How did you prepare the data?

gjmulder · 2023-05-24T14:29:51Z

Try cloning llama-cpp-python, building the package locally as per the README.md, and then to verify whether the llama.cpp pulled in via llama-cpp-python works:

$ cd llama-cpp-python
$ cd vendor/llama.cpp
$ make -j
$ ./main -m /models/ggml-old-vic13b-q4_0.bin

albertoZurini · 2023-05-28T15:48:51Z

Thank you a lot, that was the issue: I was quantizing using a different version of llama.cpp.

Huge · 2023-05-29T11:00:32Z

I have struggled with the same trouble last week. It is caused by ggerganov/llama.cpp#1305 breaking change of lamma.cpp rolled out recently, right?

gjmulder added the hardware Hardware specific issue label May 23, 2023

gjmulder added the performance label May 23, 2023

albertoZurini closed this as completed May 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run with `-ngl` parameter? #268

How to run with `-ngl` parameter? #268

albertoZurini commented May 23, 2023

gjmulder commented May 23, 2023 •

edited

albertoZurini commented May 23, 2023 •

edited

gjmulder commented May 23, 2023

albertoZurini commented May 24, 2023 •

edited

gjmulder commented May 24, 2023

albertoZurini commented May 24, 2023 •

edited

gjmulder commented May 24, 2023

albertoZurini commented May 28, 2023

Huge commented May 29, 2023

How to run with -ngl parameter? #268

How to run with -ngl parameter? #268

Comments

albertoZurini commented May 23, 2023

gjmulder commented May 23, 2023 • edited

albertoZurini commented May 23, 2023 • edited

gjmulder commented May 23, 2023

albertoZurini commented May 24, 2023 • edited

gjmulder commented May 24, 2023

albertoZurini commented May 24, 2023 • edited

gjmulder commented May 24, 2023

albertoZurini commented May 28, 2023

Huge commented May 29, 2023

How to run with `-ngl` parameter? #268

How to run with `-ngl` parameter? #268

gjmulder commented May 23, 2023 •

edited

albertoZurini commented May 23, 2023 •

edited

albertoZurini commented May 24, 2023 •

edited

albertoZurini commented May 24, 2023 •

edited