-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Labels
Nvidia GPUIssues specific to Nvidia GPUsIssues specific to Nvidia GPUsbugSomething isn't workingSomething isn't workingggmlchanges relating to the ggml tensor library for machine learningchanges relating to the ggml tensor library for machine learningmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Description
Name and Version
llama-b7172-bin-win-cuda-12.4-x64 public release
Operating systems
Windows
GGML backends
CUDA
Hardware
RTX 4090
Models
https://huggingface.co/unsloth/GLM-4-32B-0414-GGUF/blob/main/GLM-4-32B-0414-Q2_K.gguf
Problem description & steps to reproduce
Finally I am able reproduce this in llama.cpp itself. Can be reproduced on RTX 4090 and 5090 in latest CUDA build
Instructions:
- Download llama-b7172-bin-win-cuda-12.4-x64 from releases
- Download this model: https://huggingface.co/unsloth/GLM-4-32B-0414-GGUF/blob/main/GLM-4-32B-0414-Q2_K.gguf (multiple quants are affected, but it's important that you are able to do a full offload so I picked a small one)
- Run this CLI:
llama-server.exe -ngl 99 -m GLM-4-32B-0414-Q2_K.gguf - Send "tell me a joke" to the AI. Result is garbled nonsense
- Now exit and relaunch, offload 50 layers instead,
llama-server.exe -ngl 50 -m GLM-4-32B-0414-Q2_K.gguf - Send "tell me a joke" to the AI. Result is coherent.
- The sweet spot seems to be (59 layers = coherent, 60+ layers = incoherent)
- Flash attention must be enabled (it is on by default now, so I did not include it).
- This is an old bug that occurred at least one month ago and probably longer, likely after a FA refactor.
- Works without issue if FA is off or using vulkan backend.
For additional reference (optional, supplementary information distinct from this issue): LostRuins#1837
First Bad Commit
No response
Relevant log output
C:\test\llama-b7172-bin-win-cuda-12.4-x64>llama-server.exe -ngl 60 -m C:\test\GLM-4-32B-0414-Q2_K.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\test\llama-b7172-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\test\llama-b7172-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\test\llama-b7172-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this)
←[0mbuild: 7172 (b78db3bd5) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 24, n_threads_batch = 24, total_threads = 32
system_info: n_threads = 24 (n_threads_batch = 24) / 32 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model 'C:\test\GLM-4-32B-0414-Q2_K.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 Laptop GPU) (0000:01:00.0) - 15046 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 613 tensors from C:\test\GLM-4-32B-0414-Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = glm4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Glm-4-32B-0414
llama_model_loader: - kv 3: general.version str = 0414
llama_model_loader: - kv 4: general.basename str = Glm-4-32B-0414
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 32B
llama_model_loader: - kv 7: general.license str = mit
llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = GLM 4 32B 0414
llama_model_loader: - kv 11: general.base_model.0.version str = 0414
llama_model_loader: - kv 12: general.base_model.0.organization str = THUDM
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/THUDM/GLM-4-32...
llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 15: general.languages arr[str,2] = ["zh", "en"]
llama_model_loader: - kv 16: glm4.block_count u32 = 61
llama_model_loader: - kv 17: glm4.context_length u32 = 32768
llama_model_loader: - kv 18: glm4.embedding_length u32 = 6144
llama_model_loader: - kv 19: glm4.feed_forward_length u32 = 23040
llama_model_loader: - kv 20: glm4.attention.head_count u32 = 48
llama_model_loader: - kv 21: glm4.attention.head_count_kv u32 = 2
llama_model_loader: - kv 22: glm4.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 23: glm4.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 24: glm4.attention.key_length u32 = 128
llama_model_loader: - kv 25: glm4.attention.value_length u32 = 128
llama_model_loader: - kv 26: glm4.rope.dimension_count u32 = 64
llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 28: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,318088] = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 151336
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 151330
llama_model_loader: - kv 34: tokenizer.ggml.eot_token_id u32 = 151336
llama_model_loader: - kv 35: tokenizer.ggml.unknown_token_id u32 = 151329
llama_model_loader: - kv 36: tokenizer.ggml.bos_token_id u32 = 151329
llama_model_loader: - kv 37: tokenizer.chat_template str = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 10
llama_model_loader: - kv 40: quantize.imatrix.file str = GLM-4-32B-0414-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 41: quantize.imatrix.dataset str = unsloth_calibration_GLM-4-32B-0414.txt
llama_model_loader: - kv 42: quantize.imatrix.entries_count u32 = 366
llama_model_loader: - kv 43: quantize.imatrix.chunks_count u32 = 724
llama_model_loader: - type f32: 245 tensors
llama_model_loader: - type q2_K: 184 tensors
llama_model_loader: - type q3_K: 122 tensors
llama_model_loader: - type q4_K: 61 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q2_K - Medium
print_info: file size = 11.44 GiB (3.02 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
←[0mload: printing all EOG tokens:
load: - 151329 ('<|endoftext|>')
load: - 151336 ('<|user|>')
load: special tokens cache size = 14
load: token to piece cache size = 0.9710 MB
print_info: arch = glm4
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 6144
print_info: n_embd_inp = 6144
print_info: n_layer = 61
print_info: n_head = 48
print_info: n_head_kv = 2
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 24
print_info: n_embd_k_gqa = 256
print_info: n_embd_v_gqa = 256
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 23040
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = 32B
print_info: model params = 32.57 B
print_info: general.name = Glm-4-32B-0414
print_info: vocab type = BPE
print_info: n_vocab = 151552
print_info: n_merges = 318088
print_info: BOS token = 151329 '<|endoftext|>'
print_info: EOS token = 151336 '<|user|>'
print_info: EOT token = 151336 '<|user|>'
print_info: UNK token = 151329 '<|endoftext|>'
print_info: PAD token = 151330 '[MASK]'
print_info: LF token = 198 '─è'
print_info: EOG token = 151329 '<|endoftext|>'
print_info: EOG token = 151336 '<|user|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 60 repeating layers to GPU
load_tensors: offloaded 60/62 layers to GPU
load_tensors: CPU_Mapped model buffer size = 1195.15 MiB
load_tensors: CUDA0 model buffer size = 10518.75 MiB
..............................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
←[0mllama_context: CPU output buffer size = 2.31 MiB
llama_kv_cache: CPU KV buffer size = 4.00 MiB
llama_kv_cache: CUDA0 KV buffer size = 240.00 MiB
llama_kv_cache: size = 244.00 MiB ( 4096 cells, 61 layers, 4/1 seqs), K (f16): 122.00 MiB, V (f16): 122.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CUDA0 compute buffer size = 1036.44 MiB
llama_context: CUDA_Host compute buffer size = 20.01 MiB
llama_context: graph nodes = 2081
llama_context: graph splits = 16 (with bs=512), 3 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|user|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
←[0msrv init: initializing slots, n_slots = 4
slot init: id 0 | task -1 | new slot, n_ctx = 4096
slot init: id 1 | task -1 | new slot, n_ctx = 4096
slot init: id 2 | task -1 | new slot, n_ctx = 4096
slot init: id 3 | task -1 | new slot, n_ctx = 4096
srv init: prompt cache is enabled, size limit: 8192 MiB
←[0msrv init: use `--cache-ram 0` to disable the prompt cache
←[0msrv init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
←[0msrv init: thinking = 0
init: chat template, chat_template: [gMASK]<sop>
{%- if tools -%}
<|system|>
# 可用工具
{% for tool in tools %}
{%- set function = tool.function if tool.get("function") else tool %}
## {{ function.name }}
{{ function | tojson(indent=4)|string }}
在调用上述函数时,请使用 Json 格式表示调用的参数。
{%- endfor %}
{%- endif -%}
{%- for msg in messages %}
{%- if msg.role == 'system' %}
<|system|>
{{ msg.content }}
{%- endif %}
{%- endfor %}
{%- for message in messages if message.role != 'system' %}
{%- set role = message['role'] %}
{%- set content = message['content'] %}
{%- set meta = message.get("metadata", "") %}
{%- if role == 'user' %}
<|user|>
{{ content }}
{%- elif role == 'assistant' and not meta %}
<|assistant|>
{{ content }}
{%- elif role == 'assistant' and meta %}
<|assistant|>{{ meta }}
{{ content }}
{%- elif role == 'observation' %}
<|observation|>
{{ content }}
{%- endif %}
{%- endfor %}
{% if add_generation_prompt %}<|assistant|>{% endif %}, example_format: '[gMASK]<sop><|system|>
You are a helpful assistant<|user|>
Hello<|assistant|>
Hi there<|user|>
How are you?<|assistant|>'
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv update_slots: all slots are idle
srv update_slots: all slots are idle
srv log_server_r: request: GET /slots 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 1 | processing task
slot update_slots: id 3 | task 1 | new prompt, n_ctx_slot = 4096, n_keep = 0, task.n_tokens = 9
slot update_slots: id 3 | task 1 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 1 | prompt processing progress, n_tokens = 9, batch.n_tokens = 9, progress = 1.000000
slot update_slots: id 3 | task 1 | prompt done, n_tokens = 9, batch.n_tokens = 9
srv log_server_r: request: GET /props 127.0.0.1 200
srv log_server_r: request: GET /slots 127.0.0.1 200
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv stop: cancel task, id_task = 1
←[0mslot release: id 3 | task 1 | stop processing: n_tokens = 52, truncated = 0
srv update_slots: all slots are idleJohannesGaessler
Metadata
Metadata
Assignees
Labels
Nvidia GPUIssues specific to Nvidia GPUsIssues specific to Nvidia GPUsbugSomething isn't workingSomething isn't workingggmlchanges relating to the ggml tensor library for machine learningchanges relating to the ggml tensor library for machine learningmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)