-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Closed
Labels
bug-unconfirmedcritical severityUsed to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)
Description
What happened?
When trying to use a Mamba model, in this case falcon-mamba-7b-Q4_K_S.gguf, there is a Segment fault:
$ ./llama-cli -m models/falcon-mamba-7b-Q4_K_S.gguf -ngl 33 --no-warmup --prompt '"What is LoRA?"' -n 10
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4070)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i7-1260P)
build: 3997 (dea5e860) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070) - 11743 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 643 tensors from models/falcon-mamba-7b-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mamba
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.basename str = falcon-mamba
llama_model_loader: - kv 3: general.size_label str = 7B
llama_model_loader: - kv 4: general.license str = other
llama_model_loader: - kv 5: general.license.name str = falcon-mamba-7b-license
llama_model_loader: - kv 6: general.license.link str = https://falconllm.tii.ae/falcon-mamba...
llama_model_loader: - kv 7: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 8: general.datasets arr[str,2] = ["tiiuae/falcon-refinedweb", "Hugging...
llama_model_loader: - kv 9: mamba.context_length u32 = 1048576
llama_model_loader: - kv 10: mamba.embedding_length u32 = 4096
llama_model_loader: - kv 11: mamba.feed_forward_length u32 = 0
llama_model_loader: - kv 12: mamba.attention.head_count u32 = 0
llama_model_loader: - kv 13: mamba.block_count u32 = 64
llama_model_loader: - kv 14: mamba.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 15: mamba.ssm.inner_size u32 = 8192
llama_model_loader: - kv 16: mamba.ssm.state_size u32 = 16
llama_model_loader: - kv 17: mamba.ssm.time_step_rank u32 = 256
llama_model_loader: - kv 18: mamba.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 19: mamba.ssm.dt_b_c_rms bool = true
llama_model_loader: - kv 20: general.file_type u32 = 14
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = falcon
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,65024] = [">>TITLE<<", ">>ABSTRACT<<", ">>INTR...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,65024] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,64784] = ["Ġ t", "Ġ a", "i n", "h e", "r e",...
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 11
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 11
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - kv 30: quantize.imatrix.file str = /models_out/falcon-mamba-7b-GGUF/falc...
llama_model_loader: - kv 31: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 32: quantize.imatrix.entries_count i32 = 256
llama_model_loader: - kv 33: quantize.imatrix.chunks_count i32 = 139
llama_model_loader: - type f32: 385 tensors
llama_model_loader: - type q4_K: 257 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 12
llm_load_vocab: token to piece cache size = 0.3884 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mamba
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 65024
llm_load_print_meta: n_merges = 64784
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 1048576
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 64
llm_load_print_meta: n_head = 0
llm_load_print_meta: n_head_kv = 0
llm_load_print_meta: n_rot = 0
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 0
llm_load_print_meta: n_embd_head_v = 0
llm_load_print_meta: n_gqa = 0
llm_load_print_meta: n_embd_k_gqa = 0
llm_load_print_meta: n_embd_v_gqa = 0
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 0
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = -1
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 1048576
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 4
llm_load_print_meta: ssm_d_inner = 8192
llm_load_print_meta: ssm_d_state = 16
llm_load_print_meta: ssm_dt_rank = 256
llm_load_print_meta: ssm_dt_b_c_rms = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 7.27 B
llm_load_print_meta: model size = 3.91 GiB (4.62 BPW)
llm_load_print_meta: general.name = n/a
llm_load_print_meta: BOS token = 0 '>>TITLE<<'
llm_load_print_meta: EOS token = 11 '<|endoftext|>'
llm_load_print_meta: EOT token = 11 '<|endoftext|>'
llm_load_print_meta: PAD token = 11 '<|endoftext|>'
llm_load_print_meta: LF token = 138 'Ä'
llm_load_print_meta: EOG token = 11 '<|endoftext|>'
llm_load_print_meta: max token length = 130
Segmentation fault (core dumped)This is where the segmentation fault happend:
(gdb) r
Starting program: /home/danbev/work/ai/llama.cpp/llama-cli -m models/falcon-mamba-7b-Q4_K_S.gguf -ngl 33 --no-warmup --prompt \"What\ is\ LoRA\?\" -n 10
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffcee00000 (LWP 3344514)]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4070)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i7-1260P)
[New Thread 0x7fffcd200000 (LWP 3344521)]
build: 3997 (dea5e860) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
[New Thread 0x7fffc1400000 (LWP 3344522)]
[New Thread 0x7fffc0a00000 (LWP 3344523)]
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4070) - 11743 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 643 tensors from /home/danbev/work/ai/learning-ai/fundamentals/llama.cpp/models/falcon-mamba-7b-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mamba
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.basename str = falcon-mamba
llama_model_loader: - kv 3: general.size_label str = 7B
llama_model_loader: - kv 4: general.license str = other
llama_model_loader: - kv 5: general.license.name str = falcon-mamba-7b-license
llama_model_loader: - kv 6: general.license.link str = https://falconllm.tii.ae/falcon-mamba...
llama_model_loader: - kv 7: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 8: general.datasets arr[str,2] = ["tiiuae/falcon-refinedweb", "Hugging...
llama_model_loader: - kv 9: mamba.context_length u32 = 1048576
llama_model_loader: - kv 10: mamba.embedding_length u32 = 4096
llama_model_loader: - kv 11: mamba.feed_forward_length u32 = 0
llama_model_loader: - kv 12: mamba.attention.head_count u32 = 0
llama_model_loader: - kv 13: mamba.block_count u32 = 64
llama_model_loader: - kv 14: mamba.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 15: mamba.ssm.inner_size u32 = 8192
llama_model_loader: - kv 16: mamba.ssm.state_size u32 = 16
llama_model_loader: - kv 17: mamba.ssm.time_step_rank u32 = 256
llama_model_loader: - kv 18: mamba.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 19: mamba.ssm.dt_b_c_rms bool = true
llama_model_loader: - kv 20: general.file_type u32 = 14
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = falcon
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,65024] = [">>TITLE<<", ">>ABSTRACT<<", ">>INTR...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,65024] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,64784] = ["Ġ t", "Ġ a", "i n", "h e", "r e",...
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 11
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 11
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - kv 30: quantize.imatrix.file str = /models_out/falcon-mamba-7b-GGUF/falc...
llama_model_loader: - kv 31: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 32: quantize.imatrix.entries_count i32 = 256
llama_model_loader: - kv 33: quantize.imatrix.chunks_count i32 = 139
llama_model_loader: - type f32: 385 tensors
llama_model_loader: - type q4_K: 257 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 12
llm_load_vocab: token to piece cache size = 0.3884 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mamba
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 65024
llm_load_print_meta: n_merges = 64784
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 1048576
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 64
llm_load_print_meta: n_head = 0
llm_load_print_meta: n_head_kv = 0
llm_load_print_meta: n_rot = 0
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 0
llm_load_print_meta: n_embd_head_v = 0
llm_load_print_meta: n_gqa = 0
llm_load_print_meta: n_embd_k_gqa = 0
llm_load_print_meta: n_embd_v_gqa = 0
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 0
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = -1
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 1048576
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 4
llm_load_print_meta: ssm_d_inner = 8192
llm_load_print_meta: ssm_d_state = 16
llm_load_print_meta: ssm_dt_rank = 256
llm_load_print_meta: ssm_dt_b_c_rms = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 7.27 B
llm_load_print_meta: model size = 3.91 GiB (4.62 BPW)
llm_load_print_meta: general.name = n/a
llm_load_print_meta: BOS token = 0 '>>TITLE<<'
llm_load_print_meta: EOS token = 11 '<|endoftext|>'
llm_load_print_meta: EOT token = 11 '<|endoftext|>'
llm_load_print_meta: PAD token = 11 '<|endoftext|>'
llm_load_print_meta: LF token = 138 'Ä'
llm_load_print_meta: EOG token = 11 '<|endoftext|>'
llm_load_print_meta: max token length = 130
Thread 1 "llama-cli" received signal SIGSEGV, Segmentation fault.
0x00005555558a8af7 in ggml_is_3d (tensor=0x0) at ggml/src/ggml.c:3556
3556 return tensor->ne[3] == 1;
(gdb) bt
#0 0x00005555558a8af7 in ggml_is_3d (tensor=0x0) at ggml/src/ggml.c:3556
#1 0x00005555558b2d9f in ggml_ssm_conv (ctx=0x555565272f48 <g_state+200>, sx=0x0, c=0x555566f17b00) at ggml/src/ggml.c:7266
#2 0x0000555555957a9a in weight_buft_supported (hparams=..., w=0x555566f17b00, op=GGML_OP_SSM_CONV,
buft=0x5555651ea020 <ggml_backend_cuda_host_buffer_type::ggml_backend_cuda_buffer_type_host>, dev=0x555565f50490)
at src/llama.cpp:7166
#3 0x0000555555957da4 in select_weight_buft (model=..., tensor=0x555566f17b00, op=GGML_OP_SSM_CONV,
buft_list=std::vector of length 2, capacity 2 = {...}) at src/llama.cpp:7200
#4 0x0000555555958baa in operator() (__closure=0x7fffffffb610, tn=..., ne=std::initializer_list of length 2 = {...},
flags=0) at src/llama.cpp:7485
#5 0x0000555555969277 in llm_load_tensors (ml=..., model=..., n_gpu_layers=33, split_mode=LLAMA_SPLIT_MODE_LAYER,
main_gpu=0, tensor_split=0x7fffffffc7b4, use_mlock=false, progress_callback=0x5555559872a9 <_FUN(float, void*)>,
progress_callback_user_data=0x7fffffffb8e0) at src/llama.cpp:8435
#6 0x00005555559754ad in llama_model_load (
fname="/home/danbev/work/ai/learning-ai/fundamentals/llama.cpp/models/falcon-mamba-7b-Q4_K_S.gguf", model=...,
params=...) at src/llama.cpp:9235
#7 0x000055555598795c in llama_load_model_from_file (
path_model=0x555565f5b3a0 "/home/danbev/work/ai/learning-ai/fundamentals/llama.cpp/models/falcon-mamba-7b-Q4_K_S.gguf",
params=...) at src/llama.cpp:19358
#8 0x0000555555b04fe6 in common_init_from_params (params=...) at common/common.cpp:836
#9 0x0000555555bd1b09 in main (argc=10, argv=0x7fffffffdab8) at examples/main/main.cpp:200
(gdb) up
(gdb) up
(gdb) p w->name
$1 = "blk.0.ssm_conv1d.weight", '\000' <repeats 40 times>This call is coming from weight_buft_supported:
static bool weight_buft_supported(const llama_hparams & hparams, ggml_tensor * w, ggml_op op, ggml_backend_buffer_type_t buft, ggml_backend_dev_t dev) {
...
switch (op) {
...
case GGML_OP_SSM_CONV:
{
// TODO: ggml_ssm_conv(ctx, conv_x, model.layers[il].ssm_conv1d);
op_tensor = ggml_ssm_conv(ctx, nullptr, w);
} break;
}Seeing the TODO here this might be expected but I wanted to raise this just in case.
Name and Version
$ ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4070)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i7-1260P)
version: 3997 (dea5e860)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnuWhat operating system are you seeing the problem on?
Linux
Relevant log output
No response
Metadata
Metadata
Assignees
Labels
bug-unconfirmedcritical severityUsed to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)