Skip to content

Eval bug: Jina embeddings v2 base code crashes with GGML_ASSERT(ggml_can_mul_mat(a, b)) failed #16392

@theo77186

Description

@theo77186

Name and Version

llama.cpp latest master

Operating systems

Linux

GGML backends

CUDA

Hardware

CPU: Ryzen 9 5950x
CPU: RTX 4060 Ti 16GB + 3060 12GB (tested on the latter)

Models

https://huggingface.co/jinaai/jina-embeddings-v2-base-code at f16

Problem description & steps to reproduce

When launching the server with llama-server -m jina-embeddings-v2-base-code-f16.gguf -ts 0,1 -b 8192 -ub 8192 --embeddings, the server crashes on start with the following error:
ggml.c:3021: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

the -ts argument can be removed in systems with a single GPU.

First Bad Commit

introduced by commit 9777032 (#15696)

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 6334 (9777032dc) with cc (Debian 15.2.0-4) 15.2.0 for x86_64-linux-gnu
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8081, http threads: 31
main: loading model
srv    load_model: loading model 'jina-embeddings-v2-base-code.f16.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 14856 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3060) - 11822 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 268 tensors from jina-embeddings-v2-base-code.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = jina-bert-v2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Jina Bert v2 Qk Post Norm
llama_model_loader: - kv   3:                       general.organization str              = Jinaai
llama_model_loader: - kv   4:                         general.size_label str              = 160M
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                      general.dataset.count u32              = 1
llama_model_loader: - kv   7:                     general.dataset.0.name str              = C4
llama_model_loader: - kv   8:             general.dataset.0.organization str              = Allenai
llama_model_loader: - kv   9:                 general.dataset.0.repo_url str              = https://huggingface.co/allenai/c4
llama_model_loader: - kv  10:                               general.tags arr[str,6]       = ["sentence-transformers", "feature-ex...
llama_model_loader: - kv  11:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  12:                   jina-bert-v2.block_count u32              = 12
llama_model_loader: - kv  13:                jina-bert-v2.context_length u32              = 8192
llama_model_loader: - kv  14:              jina-bert-v2.embedding_length u32              = 768
llama_model_loader: - kv  15:           jina-bert-v2.feed_forward_length u32              = 3072
llama_model_loader: - kv  16:          jina-bert-v2.attention.head_count u32              = 12
llama_model_loader: - kv  17:  jina-bert-v2.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv  18:                          general.file_type u32              = 1
llama_model_loader: - kv  19:              jina-bert-v2.attention.causal bool             = false
llama_model_loader: - kv  20:                  jina-bert-v2.pooling_type u32              = 1
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = jina-v2-code
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,61056]   = ["<s>", "<pad>", "</s>", "<unk>", "<m...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,61056]   = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,60795]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  30:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  32:               tokenizer.ggml.mask_token_id u32              = 4
llama_model_loader: - kv  33:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - type  f32:  183 tensors
llama_model_loader: - type  f16:   85 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 305.98 MiB (16.01 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 5
load: token to piece cache size = 0.3623 MB
print_info: arch             = jina-bert-v2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 768
print_info: n_layer          = 12
print_info: n_head           = 12
print_info: n_head_kv        = 12
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 768
print_info: n_embd_v_gqa     = 768
print_info: f_norm_eps       = 1.0e-12
print_info: f_norm_rms_eps   = 0.0e+00
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 8.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 0
print_info: pooling type     = 1
print_info: rope type        = -1
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: model type       = 137M
print_info: model params     = 160.28 M
print_info: general.name     = Jina Bert v2 Qk Post Norm
print_info: vocab type       = BPE
print_info: n_vocab          = 61056
print_info: n_merges         = 60795
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 2 '</s>'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 4 '<mask>'
print_info: LF token         = 203 'Ċ'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 512
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors:        CUDA1 model buffer size =   216.53 MiB
load_tensors:   CPU_Mapped model buffer size =    89.45 MiB
........................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 2048
llama_context: causal_attn   = 0
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:  CUDA_Host  output buffer size =     0.24 MiB
/home/theo/llama-quant/llama.cpp/ggml/src/ggml.c:3021: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
[New LWP 3738220]
[New LWP 3738219]
[New LWP 3738218]
[New LWP 3738217]
[New LWP 3738216]
[New LWP 3738215]
[New LWP 3738214]
[New LWP 3738213]
[New LWP 3738212]
[New LWP 3738211]
[New LWP 3738210]
[New LWP 3738209]
[New LWP 3738208]
[New LWP 3738207]
[New LWP 3738206]
[New LWP 3738205]
[New LWP 3738204]
[New LWP 3738203]
[New LWP 3738202]
[New LWP 3738201]
[New LWP 3738200]
[New LWP 3738199]
[New LWP 3738198]
[New LWP 3738197]
[New LWP 3738196]
[New LWP 3738195]
[New LWP 3738194]
[New LWP 3738193]
[New LWP 3738192]
[New LWP 3738191]
[New LWP 3738190]
[New LWP 3738189]
[New LWP 3738188]
[New LWP 3738187]
[New LWP 3738186]
[New LWP 3738185]
[New LWP 3738184]
[New LWP 3738177]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: Aucun fichier ou dossier de ce nom
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007f794d37c668 in __internal_syscall_cancel (a1=a1@entry=3738221, a2=a2@entry=0, a3=a3@entry=0, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: Aucun fichier ou dossier de ce nom
#2  0x00007f794d37c6ad in __syscall_cancel (a1=a1@entry=3738221, a2=a2@entry=0, a3=a3@entry=0, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007f794d3e7787 in __GI___wait4 (pid=pid@entry=3738221, stat_loc=stat_loc@entry=0x0, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: Aucun fichier ou dossier de ce nom
#4  0x00007f794d3e77b7 in __GI___waitpid (pid=pid@entry=3738221, stat_loc=stat_loc@entry=0x0, options=options@entry=0) at ./posix/waitpid.c:38
warning: 38     ./posix/waitpid.c: Aucun fichier ou dossier de ce nom
#5  0x00007f794d8b5db3 in ggml_print_backtrace () at /home/theo/llama-quant/llama.cpp/ggml/src/ggml.c:196
196             waitpid(child_pid, NULL, 0);
#6  0x00007f794d8b5eff in ggml_abort (file=file@entry=0x7f794d8fa4b8 "/home/theo/llama-quant/llama.cpp/ggml/src/ggml.c", line=line@entry=3021, fmt=fmt@entry=0x7f794d8f8093 "GGML_ASSERT(%s) failed") at /home/theo/llama-quant/llama.cpp/ggml/src/ggml.c:230
230             ggml_print_backtrace();
#7  0x00007f794d8b93b3 in ggml_mul_mat (ctx=<optimized out>, a=<optimized out>, b=<optimized out>) at /home/theo/llama-quant/llama.cpp/ggml/src/ggml.c:3021
3021        GGML_ASSERT(ggml_can_mul_mat(a, b));
#8  0x00007f794d9d9830 in llm_graph_context::build_pooling (this=0x55f22f576c90, cls=<optimized out>, cls_b=<optimized out>, cls_out=<optimized out>, cls_out_b=<optimized out>) at /home/theo/llama-quant/llama.cpp/src/llama-graph.cpp:1823
1823                    cur = ggml_mul_mat(ctx0, ggml_cont(ctx0, ggml_transpose(ctx0, inp)), inp_mean);
#9  0x00007f794da0c474 in llama_model::build_graph (this=0x55f22d9f98d0, params=...) at /home/theo/llama-quant/llama.cpp/src/llama-model.cpp:18983
18983       llm->build_pooling(cls, cls_b, cls_out, cls_out_b);
#10 0x00007f794d9af103 in llama_context::graph_reserve (this=this@entry=0x55f22ea0e940, n_tokens=n_tokens@entry=1, n_seqs=n_seqs@entry=1, n_outputs=<optimized out>, n_outputs@entry=0, mctx=mctx@entry=0x0, split_only=split_only@entry=true) at /home/theo/llama-quant/llama.cpp/src/llama-context.cpp:1398
1398        auto * gf = model.build_graph(gparams);
#11 0x00007f794d9b2245 in llama_context::llama_context (this=0x55f22ea0e940, model=..., params=...) at /usr/include/c++/15/bits/unique_ptr.h:471
471           get() const noexcept
#12 0x00007f794d9b27ac in llama_init_from_model (model=0x55f22d9f98d0, params=...) at /home/theo/llama-quant/llama.cpp/src/llama-context.cpp:2328
2328            auto * ctx = new llama_context(*model, params);
#13 0x000055f21ae1c803 in common_init_from_params (params=...) at /home/theo/llama-quant/llama.cpp/common/common.cpp:913
913         llama_context * lctx = llama_init_from_model(model, cparams);
#14 0x000055f21ad0f4e1 in server_context::load_model (this=this@entry=0x7ffd9aa73310, params=...) at /home/theo/llama-quant/llama.cpp/tools/server/server.cpp:2022
2022            llama_init = common_init_from_params(params_base);
#15 0x000055f21acb303f in main (argc=<optimized out>, argv=<optimized out>) at /home/theo/llama-quant/llama.cpp/tools/server/server.cpp:5058
5058        if (!ctx_server.load_model(params)) {
[Inferior 1 (process 3738145) detached]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions