Skip to content

Eval bug: Segmentation fault (core dumped) for Gemma 2 series #17426

@peterplv

Description

@peterplv

Name and Version

llama-cli --version
version: 7048 (0cfb191)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 3060 12GB, CUDA Version: 12.5

Models

gemma-2-2b-it-Q8_0.gguf
gemma-2-9b-it-Q6_K.gguf

Problem description & steps to reproduce

After a recent update, when loading Gemma 2 series models (2b, 9b), I get the error:
Segmentation fault (core dumped).

The same models worked on the previous build (downloaded a few weeks ago, can't remember the exact version).
It's definitely not a memory issue (at least for the 2b model, that's not a problem); it's some kind of bug. Again, the same models worked good with the same settings before, and there was still plenty of memory left.

I tried setting the context to --ctx-size 4096, result was the same. Only thing that helped was - reducing the number of layers on the GPU by -1, for example:
previously, it ran like this (all layers on the GPU):
llama-cli -ngl 27 --ctx-size 4096 -m gemma2_2b_it_q8.gguf
result: Segmentation fault (core dumped)
now:
llama-cli -ngl 26 --ctx-size 4096 -m gemma2_2b_it_q8.gguf
result: ok.

Or the 9b model:
llama-cli -ngl 43 --ctx-size 4096 -m gemma2_9b_it_q6k.gguf
result: Segmentation fault (core dumped)
now:
llama-cli -ngl 42 --ctx-size 4096 -m gemma2_9b_it_q6k.gguf
result: ok.

I haven't had such issues with other models yet, including newer and larger ones, such as gemma3_12b_it_q8.gguf, its only with gemma2 models, for example:
https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/resolve/main/gemma-2-2b-it-Q8_0.gguf
https://huggingface.co/bartowski/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-Q6_K.gguf

First Bad Commit

No response

Relevant log output

llama-cli -ngl 27 --ctx-size 4096 -m gemma2_2b_it_q8.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 7048 (0cfb19166) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) (0000:04:00.0) - 11808 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 288 tensors from gemma2_2b_it_q8.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 2.6B
llama_model_loader: - kv   4:                            general.license str              = gemma
llama_model_loader: - kv   5:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   6:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   7:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv   8:                         gemma2.block_count u32              = 26
llama_model_loader: - kv   9:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  10:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  11:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:    gemma2.attention.layer_norm_rms_epsilon f32              = 0,000001
llama_model_loader: - kv  13:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  14:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  15:                          general.file_type u32              = 7
llama_model_loader: - kv  16:              gemma2.attn_logit_softcapping f32              = 50,000000
llama_model_loader: - kv  17:             gemma2.final_logit_softcapping f32              = 30,000000
llama_model_loader: - kv  18:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000,000000, -1000,000000, -1000,00...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  26:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  30:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q8_0:  183 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 2,59 GiB (8,50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 107 ('<end_of_turn>')
load: special tokens cache size = 249
load: token to piece cache size = 1,6014 MB
print_info: arch             = gemma2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 2304
print_info: n_embd_inp       = 2304
print_info: n_layer          = 26
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 4096
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0,0e+00
print_info: f_norm_rms_eps   = 1,0e-06
print_info: f_clamp_kqv      = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale    = 0,0e+00
print_info: f_attn_scale     = 6,2e-02
print_info: n_ff             = 9216
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000,0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: model type       = 2B
print_info: model params     = 2,61 B
print_info: general.name     = Models
print_info: vocab type       = SPM
print_info: n_vocab          = 256000
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 107 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 227 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 107 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 26 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 27/27 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   597,66 MiB
load_tensors:        CUDA0 model buffer size =  2649,78 MiB
..................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 10000,0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0,98 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache:      CUDA0 KV buffer size =   208,00 MiB
llama_kv_cache: size =  208,00 MiB (  4096 cells,  13 layers,  1/1 seqs), K (f16):  104,00 MiB, V (f16):  104,00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 4096 cells
llama_kv_cache:      CUDA0 KV buffer size =   208,00 MiB
llama_kv_cache: size =  208,00 MiB (  4096 cells,  13 layers,  1/1 seqs), K (f16):  104,00 MiB, V (f16):  104,00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CUDA0 compute buffer size =   504,50 MiB
llama_context:  CUDA_Host compute buffer size =    20,52 MiB
llama_context: graph nodes  = 948
llama_context: graph splits = 2
common_init_from_params: added <eos> logit bias = -inf
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Segmentation fault (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions