Skip to content

CUDA error 209: no kernel image is available for execution on the device #4215

Description

@Philipp-Sc

Hi,

I could not find any solution online regarding the following issue:

Expected Behavior

./llama.cpp/build/bin/main -m zephyr-7b-beta.Q5_K_S.gguf --log-disable --ctx-size 8192  --threads 4 --n-gpu-layers 4 -p "Test"

Fails with no kernel image is available for execution on the device.
That said when I watch nvidia-smi I can see the VRAM gets populated but then the error is thrown.

Generally I can use the GPU for inference with the huggingface transformers library without issues.

I tested both make and cmake:
cd llama.cpp;pip install -r requirements.txt;mkdir build;cd build;cmake .. -DLLAMA_CUBLAS=ON;cmake --build . --config Release

Current Behavior

CUDA error 209 at /home/user/Documents/devops/llama.cpp/ggml-cuda.cu:6701: no kernel image is available for execution on the device

Environment and Context

                                                                                                                                      
Sat Nov 25 12:35:35 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce 940MX           Off | 00000000:02:00.0 Off |                  N/A |
| N/A   40C    P8              N/A / 200W |      4MiB /  2048MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       657      G   /usr/lib/Xorg                                 2MiB |
+---------------------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
  • Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Architektur:                       x86_64
  CPU Operationsmodus:             32-bit, 64-bit
  Adressgrößen:                    39 bits physical, 48 bits virtual
  Byte-Reihenfolge:                Little Endian
CPU(s):                            8
  Liste der Online-CPU(s):         0-7
Anbieterkennung:                   GenuineIntel
  Modellname:                      Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
    Prozessorfamilie:              6
    Modell:                        94
    Thread(s) pro Kern:            2
    Kern(e) pro Sockel:            4
    Sockel:                        1
    Stepping:                      3
    Skalierung der CPU(s):         25%
    Maximale Taktfrequenz der CPU: 3600,0000
    Minimale Taktfrequenz der CPU: 800,0000
    BogoMIPS:                      5401,81
    Markierungen:                  fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_
                                   good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
                                    xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdsee
                                   d adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities
Virtualisierungsfunktionen:        
  Virtualisierung:                 VT-x
Caches (Gesamtsumme):              
  L1d:                             128 KiB (4 Instanzen)
  L1i:                             128 KiB (4 Instanzen)
  L2:                              1 MiB (4 Instanzen)
  L3:                              8 MiB (1 Instanz)
NUMA:                              
  NUMA-Knoten:                     1
  NUMA-Knoten0 CPU(s):             0-7
Schwachstellen:                    
  Gather data sampling:            Vulnerable: No microcode
  Itlb multihit:                   KVM: Mitigation: VMX disabled
  L1tf:                            Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                             Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:                        Mitigation; PTI
  Mmio stale data:                 Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:                        Mitigation; IBRS
  Spec rstack overflow:            Not affected
  Spec store bypass:               Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                      Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:                      Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                           Mitigation; Microcode
  Tsx async abort:                 Mitigation; TSX disabled
  • Operating System, e.g. for Linux:

$ uname -a

Linux manjaro 6.6.1-1-MANJARO #1 SMP PREEMPT_DYNAMIC Thu Nov  9 04:27:49 UTC 2023 x86_64 GNU/Linux
  • SDK version, e.g. for Linux:
$ python3 --version
Python 3.11.5
$ make --version
GNU Make 4.4.1
$ g++ --version
g++ (GCC) 13.2.1 20230801
$ cmake --version
cmake version 3.27.8

Failure Logs

CUDA_VISIBLE_DEVICES=0 ./llama.cpp/build/bin/main -m zephyr-7b-beta.Q5_K_S.gguf --log-disable --ctx-size 8192  --threads 4 --n-gpu-layers 4 -p "Test"                          
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce 940MX, compute capability 5.0
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from zephyr-7b-beta.Q5_K_S.gguf (version unknown)
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
...
llama_model_loader: - tensor  290:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                       llama.rope.freq_base f32     
llama_model_loader: - kv  11:                          general.file_type u32     
llama_model_loader: - kv  12:                       tokenizer.ggml.model str     
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32     
llama_model_loader: - kv  20:               general.quantization_version u32     
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = unknown
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-05
llm_load_print_meta: f_clamp_kqv      = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q5_K - Small
llm_load_print_meta: model params     = 7,24 B
llm_load_print_meta: model size       = 4,65 GiB (5,52 BPW) 
llm_load_print_meta: general.name   = huggingfaceh4_zephyr-7b-beta
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 2 '</s>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0,10 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 4193,46 MB
llm_load_tensors: offloading 4 repeating layers to GPU
llm_load_tensors: offloaded 4/35 layers to GPU
llm_load_tensors: VRAM used: 572,12 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1024,00 MB
llama_new_context_with_model: compute buffer total size = 558,13 MB
llama_new_context_with_model: VRAM scratch buffer: 552,00 MB
llama_new_context_with_model: total VRAM used: 1124,13 MB (model: 572,12 MB, context: 552,00 MB)

CUDA error 209 at /home/user/Documents/devops/llama.cpp/ggml-cuda.cu:6701: no kernel image is available for execution on the device
current device: 0

Please let me know if you have any ideas on how to fix this issue.
Thanks in advance,
Philipp-Sc

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions