You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to let a user select and load a large model on GPU using cuBLAS. If the model is too big to fit on VRAM, i'd expect an exception to be raised, that i could catch to proceed accordingly.
However, currently, when running llm=Llama(model_path=[Heavy model path])
the python script simply crashes with the error CUDA error 2 at C:\Users\<User>\AppData\Local\Temp\pip-install-[...]\llama-cpp-python_[...]\vendor\llama.cpp\ggml-cuda.cu:1752: out of memory
Is it expected behaviour ? Is there any way to catch the error before the crash ?
Steps to Reproduce
Here's the minimal script i tried to run:
from llama_cpp import Llama
try:
llm = Llama(model_path=r"C:\[...]\models\7b_ggml\ggml-model-f16.bin", n_gpu_layers=256)
except:
print("This error should be raised")
else:
print("everything worked")
Failure Logs
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Laptop GPU
llama.cpp: loading model from C:\[...]\ggml-model-f16.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 2292.09 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 12865 MB
..........................................................CUDA error 2 at C:\[...]\AppData\Local\Temp\pip-install-ef_ea710\llama-cpp-python_b8fa201991fb409785cc828b5a130017\vendor\llama.cpp\ggml-cuda.cu:1752: out of memory
Process finished with exit code 1
The text was updated successfully, but these errors were encountered:
I'm trying to let a user select and load a large model on GPU using cuBLAS. If the model is too big to fit on VRAM, i'd expect an exception to be raised, that i could catch to proceed accordingly.
However, currently, when running
llm=Llama(model_path=[Heavy model path])
the python script simply crashes with the error
CUDA error 2 at C:\Users\<User>\AppData\Local\Temp\pip-install-[...]\llama-cpp-python_[...]\vendor\llama.cpp\ggml-cuda.cu:1752: out of memory
Is it expected behaviour ? Is there any way to catch the error before the crash ?
Steps to Reproduce
Here's the minimal script i tried to run:
Failure Logs
The text was updated successfully, but these errors were encountered: