Llama crashes instead of raising an Exception when loading a model too heavy for my GPU (cuBLAS) #374

Julescorbara · 2023-06-14T15:56:14Z

I'm trying to let a user select and load a large model on GPU using cuBLAS. If the model is too big to fit on VRAM, i'd expect an exception to be raised, that i could catch to proceed accordingly.
However, currently, when running llm=Llama(model_path=[Heavy model path])

the python script simply crashes with the error
CUDA error 2 at C:\Users\<User>\AppData\Local\Temp\pip-install-[...]\llama-cpp-python_[...]\vendor\llama.cpp\ggml-cuda.cu:1752: out of memory
Is it expected behaviour ? Is there any way to catch the error before the crash ?

Steps to Reproduce

Here's the minimal script i tried to run:

from llama_cpp import Llama
try:
    llm = Llama(model_path=r"C:\[...]\models\7b_ggml\ggml-model-f16.bin", n_gpu_layers=256)
except:
    print("This error should be raised")
else:
    print("everything worked")

Failure Logs

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Laptop GPU
llama.cpp: loading model from C:\[...]\ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2292.09 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 12865 MB
..........................................................CUDA error 2 at C:\[...]\AppData\Local\Temp\pip-install-ef_ea710\llama-cpp-python_b8fa201991fb409785cc828b5a130017\vendor\llama.cpp\ggml-cuda.cu:1752: out of memory

Process finished with exit code 1

The text was updated successfully, but these errors were encountered:

gjmulder · 2023-06-14T19:01:24Z

The issue is that the crash is occurring in libllama.so as can be seen by the reference to vendor\llama.cpp\ggml-cuda.cu.

To have llama.cpp handle this CUDA error more gracefully you will need to log an issue here

Delete this for now to avoid confusion since it contains some wrong checksums from the old tokenizer format Re-add after abetlen#374 is resolved

gjmulder added llama.cpp Problem with llama.cpp shared lib model Model specific issue labels Jun 14, 2023

oobabooga mentioned this issue Oct 5, 2023

AMD thread oobabooga/text-generation-webui#3759

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama crashes instead of raising an Exception when loading a model too heavy for my GPU (cuBLAS) #374

Llama crashes instead of raising an Exception when loading a model too heavy for my GPU (cuBLAS) #374

Julescorbara commented Jun 14, 2023

gjmulder commented Jun 14, 2023 •

edited

Loading

Llama crashes instead of raising an Exception when loading a model too heavy for my GPU (cuBLAS) #374

Llama crashes instead of raising an Exception when loading a model too heavy for my GPU (cuBLAS) #374

Comments

Julescorbara commented Jun 14, 2023

Steps to Reproduce

Failure Logs

gjmulder commented Jun 14, 2023 • edited Loading

gjmulder commented Jun 14, 2023 •

edited

Loading