Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama crashes instead of raising an Exception when loading a model too heavy for my GPU (cuBLAS) #374

Open
Julescorbara opened this issue Jun 14, 2023 · 1 comment
Labels
llama.cpp Problem with llama.cpp shared lib model Model specific issue

Comments

@Julescorbara
Copy link

I'm trying to let a user select and load a large model on GPU using cuBLAS. If the model is too big to fit on VRAM, i'd expect an exception to be raised, that i could catch to proceed accordingly.
However, currently, when running llm=Llama(model_path=[Heavy model path])

the python script simply crashes with the error
CUDA error 2 at C:\Users\<User>\AppData\Local\Temp\pip-install-[...]\llama-cpp-python_[...]\vendor\llama.cpp\ggml-cuda.cu:1752: out of memory
Is it expected behaviour ? Is there any way to catch the error before the crash ?

Steps to Reproduce

Here's the minimal script i tried to run:

from llama_cpp import Llama
try:
    llm = Llama(model_path=r"C:\[...]\models\7b_ggml\ggml-model-f16.bin", n_gpu_layers=256)
except:
    print("This error should be raised")
else:
    print("everything worked")

Failure Logs

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Laptop GPU
llama.cpp: loading model from C:\[...]\ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2292.09 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 12865 MB
..........................................................CUDA error 2 at C:\[...]\AppData\Local\Temp\pip-install-ef_ea710\llama-cpp-python_b8fa201991fb409785cc828b5a130017\vendor\llama.cpp\ggml-cuda.cu:1752: out of memory

Process finished with exit code 1
@gjmulder gjmulder added llama.cpp Problem with llama.cpp shared lib model Model specific issue labels Jun 14, 2023
@gjmulder
Copy link
Contributor

gjmulder commented Jun 14, 2023

The issue is that the crash is occurring in libllama.so as can be seen by the reference to vendor\llama.cpp\ggml-cuda.cu.

To have llama.cpp handle this CUDA error more gracefully you will need to log an issue here

antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this issue Oct 30, 2023
Delete this for now to avoid confusion since it contains some wrong checksums from the old tokenizer format
Re-add after abetlen#374 is resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llama.cpp Problem with llama.cpp shared lib model Model specific issue
Projects
None yet
Development

No branches or pull requests

2 participants