Hi,
I could not find any solution online regarding the following issue:
Expected Behavior
./llama.cpp/build/bin/main -m zephyr-7b-beta.Q5_K_S.gguf --log-disable --ctx-size 8192 --threads 4 --n-gpu-layers 4 -p "Test"
Fails with no kernel image is available for execution on the device.
That said when I watch nvidia-smi I can see the VRAM gets populated but then the error is thrown.
Generally I can use the GPU for inference with the huggingface transformers library without issues.
I tested both make and cmake:
cd llama.cpp;pip install -r requirements.txt;mkdir build;cd build;cmake .. -DLLAMA_CUBLAS=ON;cmake --build . --config Release
Current Behavior
CUDA error 209 at /home/user/Documents/devops/llama.cpp/ggml-cuda.cu:6701: no kernel image is available for execution on the device
Environment and Context
Sat Nov 25 12:35:35 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce 940MX Off | 00000000:02:00.0 Off | N/A |
| N/A 40C P8 N/A / 200W | 4MiB / 2048MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 657 G /usr/lib/Xorg 2MiB |
+---------------------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
- Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
Architektur: x86_64
CPU Operationsmodus: 32-bit, 64-bit
Adressgrößen: 39 bits physical, 48 bits virtual
Byte-Reihenfolge: Little Endian
CPU(s): 8
Liste der Online-CPU(s): 0-7
Anbieterkennung: GenuineIntel
Modellname: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
Prozessorfamilie: 6
Modell: 94
Thread(s) pro Kern: 2
Kern(e) pro Sockel: 4
Sockel: 1
Stepping: 3
Skalierung der CPU(s): 25%
Maximale Taktfrequenz der CPU: 3600,0000
Minimale Taktfrequenz der CPU: 800,0000
BogoMIPS: 5401,81
Markierungen: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_
good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdsee
d adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities
Virtualisierungsfunktionen:
Virtualisierung: VT-x
Caches (Gesamtsumme):
L1d: 128 KiB (4 Instanzen)
L1i: 128 KiB (4 Instanzen)
L2: 1 MiB (4 Instanzen)
L3: 8 MiB (1 Instanz)
NUMA:
NUMA-Knoten: 1
NUMA-Knoten0 CPU(s): 0-7
Schwachstellen:
Gather data sampling: Vulnerable: No microcode
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Meltdown: Mitigation; PTI
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Mitigation; IBRS
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Srbds: Mitigation; Microcode
Tsx async abort: Mitigation; TSX disabled
- Operating System, e.g. for Linux:
$ uname -a
Linux manjaro 6.6.1-1-MANJARO #1 SMP PREEMPT_DYNAMIC Thu Nov 9 04:27:49 UTC 2023 x86_64 GNU/Linux
- SDK version, e.g. for Linux:
$ python3 --version
Python 3.11.5
$ make --version
GNU Make 4.4.1
$ g++ --version
g++ (GCC) 13.2.1 20230801
$ cmake --version
cmake version 3.27.8
Failure Logs
CUDA_VISIBLE_DEVICES=0 ./llama.cpp/build/bin/main -m zephyr-7b-beta.Q5_K_S.gguf --log-disable --ctx-size 8192 --threads 4 --n-gpu-layers 4 -p "Test"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce 940MX, compute capability 5.0
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from zephyr-7b-beta.Q5_K_S.gguf (version unknown)
llama_model_loader: - tensor 0: token_embd.weight q5_K [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
...
llama_model_loader: - tensor 290: output_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: llama.rope.freq_base f32
llama_model_loader: - kv 11: general.file_type u32
llama_model_loader: - kv 12: tokenizer.ggml.model str
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 20: general.quantization_version u32
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q5_K: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = unknown
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: f_norm_eps = 0,0e+00
llm_load_print_meta: f_norm_rms_eps = 1,0e-05
llm_load_print_meta: f_clamp_kqv = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: freq_base_train = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q5_K - Small
llm_load_print_meta: model params = 7,24 B
llm_load_print_meta: model size = 4,65 GiB (5,52 BPW)
llm_load_print_meta: general.name = huggingfaceh4_zephyr-7b-beta
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 2 '</s>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0,10 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 4193,46 MB
llm_load_tensors: offloading 4 repeating layers to GPU
llm_load_tensors: offloaded 4/35 layers to GPU
llm_load_tensors: VRAM used: 572,12 MB
...................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: freq_base = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 1024,00 MB
llama_new_context_with_model: compute buffer total size = 558,13 MB
llama_new_context_with_model: VRAM scratch buffer: 552,00 MB
llama_new_context_with_model: total VRAM used: 1124,13 MB (model: 572,12 MB, context: 552,00 MB)
CUDA error 209 at /home/user/Documents/devops/llama.cpp/ggml-cuda.cu:6701: no kernel image is available for execution on the device
current device: 0
Please let me know if you have any ideas on how to fix this issue.
Thanks in advance,
Philipp-Sc
Hi,
I could not find any solution online regarding the following issue:
Expected Behavior
./llama.cpp/build/bin/main -m zephyr-7b-beta.Q5_K_S.gguf --log-disable --ctx-size 8192 --threads 4 --n-gpu-layers 4 -p "Test"Fails with
no kernel image is available for execution on the device.That said when I watch nvidia-smi I can see the VRAM gets populated but then the error is thrown.
Generally I can use the GPU for inference with the huggingface transformers library without issues.
I tested both make and cmake:
cd llama.cpp;pip install -r requirements.txt;mkdir build;cd build;cmake .. -DLLAMA_CUBLAS=ON;cmake --build . --config ReleaseCurrent Behavior
CUDA error 209 at /home/user/Documents/devops/llama.cpp/ggml-cuda.cu:6701: no kernel image is available for execution on the deviceEnvironment and Context
$ lscpuArchitektur: x86_64 CPU Operationsmodus: 32-bit, 64-bit Adressgrößen: 39 bits physical, 48 bits virtual Byte-Reihenfolge: Little Endian CPU(s): 8 Liste der Online-CPU(s): 0-7 Anbieterkennung: GenuineIntel Modellname: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz Prozessorfamilie: 6 Modell: 94 Thread(s) pro Kern: 2 Kern(e) pro Sockel: 4 Sockel: 1 Stepping: 3 Skalierung der CPU(s): 25% Maximale Taktfrequenz der CPU: 3600,0000 Minimale Taktfrequenz der CPU: 800,0000 BogoMIPS: 5401,81 Markierungen: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_ good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdsee d adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities Virtualisierungsfunktionen: Virtualisierung: VT-x Caches (Gesamtsumme): L1d: 128 KiB (4 Instanzen) L1i: 128 KiB (4 Instanzen) L2: 1 MiB (4 Instanzen) L3: 8 MiB (1 Instanz) NUMA: NUMA-Knoten: 1 NUMA-Knoten0 CPU(s): 0-7 Schwachstellen: Gather data sampling: Vulnerable: No microcode Itlb multihit: KVM: Mitigation: VMX disabled L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Mds: Mitigation; Clear CPU buffers; SMT vulnerable Meltdown: Mitigation; PTI Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Retbleed: Mitigation; IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Srbds: Mitigation; Microcode Tsx async abort: Mitigation; TSX disabled$ uname -aLinux manjaro 6.6.1-1-MANJARO #1 SMP PREEMPT_DYNAMIC Thu Nov 9 04:27:49 UTC 2023 x86_64 GNU/LinuxFailure Logs
Please let me know if you have any ideas on how to fix this issue.
Thanks in advance,
Philipp-Sc