-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Closed
Labels
Description
Name and Version
llama.cpp latest master
Operating systems
Linux
GGML backends
CUDA
Hardware
CPU: Ryzen 9 5950x
CPU: RTX 4060 Ti 16GB + 3060 12GB (tested on the latter)
Models
https://huggingface.co/jinaai/jina-embeddings-v2-base-code at f16
Problem description & steps to reproduce
When launching the server with llama-server -m jina-embeddings-v2-base-code-f16.gguf -ts 0,1 -b 8192 -ub 8192 --embeddings
, the server crashes on start with the following error:
ggml.c:3021: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
the -ts argument can be removed in systems with a single GPU.
First Bad Commit
introduced by commit 9777032 (#15696)
Relevant log output
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 6334 (9777032dc) with cc (Debian 15.2.0-4) 15.2.0 for x86_64-linux-gnu
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8081, http threads: 31
main: loading model
srv load_model: loading model 'jina-embeddings-v2-base-code.f16.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 14856 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3060) - 11822 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 268 tensors from jina-embeddings-v2-base-code.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = jina-bert-v2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Jina Bert v2 Qk Post Norm
llama_model_loader: - kv 3: general.organization str = Jinaai
llama_model_loader: - kv 4: general.size_label str = 160M
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.dataset.count u32 = 1
llama_model_loader: - kv 7: general.dataset.0.name str = C4
llama_model_loader: - kv 8: general.dataset.0.organization str = Allenai
llama_model_loader: - kv 9: general.dataset.0.repo_url str = https://huggingface.co/allenai/c4
llama_model_loader: - kv 10: general.tags arr[str,6] = ["sentence-transformers", "feature-ex...
llama_model_loader: - kv 11: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 12: jina-bert-v2.block_count u32 = 12
llama_model_loader: - kv 13: jina-bert-v2.context_length u32 = 8192
llama_model_loader: - kv 14: jina-bert-v2.embedding_length u32 = 768
llama_model_loader: - kv 15: jina-bert-v2.feed_forward_length u32 = 3072
llama_model_loader: - kv 16: jina-bert-v2.attention.head_count u32 = 12
llama_model_loader: - kv 17: jina-bert-v2.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 18: general.file_type u32 = 1
llama_model_loader: - kv 19: jina-bert-v2.attention.causal bool = false
llama_model_loader: - kv 20: jina-bert-v2.pooling_type u32 = 1
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = jina-v2-code
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,61056] = ["<s>", "<pad>", "</s>", "<unk>", "<m...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,61056] = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,60795] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 30: tokenizer.ggml.seperator_token_id u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 32: tokenizer.ggml.mask_token_id u32 = 4
llama_model_loader: - kv 33: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = true
llama_model_loader: - type f32: 183 tensors
llama_model_loader: - type f16: 85 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 305.98 MiB (16.01 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 2 ('</s>')
load: special tokens cache size = 5
load: token to piece cache size = 0.3623 MB
print_info: arch = jina-bert-v2
print_info: vocab_only = 0
print_info: n_ctx_train = 8192
print_info: n_embd = 768
print_info: n_layer = 12
print_info: n_head = 12
print_info: n_head_kv = 12
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 1.0e-12
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 8.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 3072
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 0
print_info: pooling type = 1
print_info: rope type = -1
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 8192
print_info: rope_finetuned = unknown
print_info: model type = 137M
print_info: model params = 160.28 M
print_info: general.name = Jina Bert v2 Qk Post Norm
print_info: vocab type = BPE
print_info: n_vocab = 61056
print_info: n_merges = 60795
print_info: BOS token = 0 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 3 '<unk>'
print_info: SEP token = 2 '</s>'
print_info: PAD token = 1 '<pad>'
print_info: MASK token = 4 '<mask>'
print_info: LF token = 203 'Ċ'
print_info: EOG token = 2 '</s>'
print_info: max token length = 512
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors: CUDA1 model buffer size = 216.53 MiB
load_tensors: CPU_Mapped model buffer size = 89.45 MiB
........................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 0
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: CUDA_Host output buffer size = 0.24 MiB
/home/theo/llama-quant/llama.cpp/ggml/src/ggml.c:3021: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
[New LWP 3738220]
[New LWP 3738219]
[New LWP 3738218]
[New LWP 3738217]
[New LWP 3738216]
[New LWP 3738215]
[New LWP 3738214]
[New LWP 3738213]
[New LWP 3738212]
[New LWP 3738211]
[New LWP 3738210]
[New LWP 3738209]
[New LWP 3738208]
[New LWP 3738207]
[New LWP 3738206]
[New LWP 3738205]
[New LWP 3738204]
[New LWP 3738203]
[New LWP 3738202]
[New LWP 3738201]
[New LWP 3738200]
[New LWP 3738199]
[New LWP 3738198]
[New LWP 3738197]
[New LWP 3738196]
[New LWP 3738195]
[New LWP 3738194]
[New LWP 3738193]
[New LWP 3738192]
[New LWP 3738191]
[New LWP 3738190]
[New LWP 3738189]
[New LWP 3738188]
[New LWP 3738187]
[New LWP 3738186]
[New LWP 3738185]
[New LWP 3738184]
[New LWP 3738177]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56 ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: Aucun fichier ou dossier de ce nom
#0 __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56 in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1 0x00007f794d37c668 in __internal_syscall_cancel (a1=a1@entry=3738221, a2=a2@entry=0, a3=a3@entry=0, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49 ./nptl/cancellation.c: Aucun fichier ou dossier de ce nom
#2 0x00007f794d37c6ad in __syscall_cancel (a1=a1@entry=3738221, a2=a2@entry=0, a3=a3@entry=0, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75 in ./nptl/cancellation.c
#3 0x00007f794d3e7787 in __GI___wait4 (pid=pid@entry=3738221, stat_loc=stat_loc@entry=0x0, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: Aucun fichier ou dossier de ce nom
#4 0x00007f794d3e77b7 in __GI___waitpid (pid=pid@entry=3738221, stat_loc=stat_loc@entry=0x0, options=options@entry=0) at ./posix/waitpid.c:38
warning: 38 ./posix/waitpid.c: Aucun fichier ou dossier de ce nom
#5 0x00007f794d8b5db3 in ggml_print_backtrace () at /home/theo/llama-quant/llama.cpp/ggml/src/ggml.c:196
196 waitpid(child_pid, NULL, 0);
#6 0x00007f794d8b5eff in ggml_abort (file=file@entry=0x7f794d8fa4b8 "/home/theo/llama-quant/llama.cpp/ggml/src/ggml.c", line=line@entry=3021, fmt=fmt@entry=0x7f794d8f8093 "GGML_ASSERT(%s) failed") at /home/theo/llama-quant/llama.cpp/ggml/src/ggml.c:230
230 ggml_print_backtrace();
#7 0x00007f794d8b93b3 in ggml_mul_mat (ctx=<optimized out>, a=<optimized out>, b=<optimized out>) at /home/theo/llama-quant/llama.cpp/ggml/src/ggml.c:3021
3021 GGML_ASSERT(ggml_can_mul_mat(a, b));
#8 0x00007f794d9d9830 in llm_graph_context::build_pooling (this=0x55f22f576c90, cls=<optimized out>, cls_b=<optimized out>, cls_out=<optimized out>, cls_out_b=<optimized out>) at /home/theo/llama-quant/llama.cpp/src/llama-graph.cpp:1823
1823 cur = ggml_mul_mat(ctx0, ggml_cont(ctx0, ggml_transpose(ctx0, inp)), inp_mean);
#9 0x00007f794da0c474 in llama_model::build_graph (this=0x55f22d9f98d0, params=...) at /home/theo/llama-quant/llama.cpp/src/llama-model.cpp:18983
18983 llm->build_pooling(cls, cls_b, cls_out, cls_out_b);
#10 0x00007f794d9af103 in llama_context::graph_reserve (this=this@entry=0x55f22ea0e940, n_tokens=n_tokens@entry=1, n_seqs=n_seqs@entry=1, n_outputs=<optimized out>, n_outputs@entry=0, mctx=mctx@entry=0x0, split_only=split_only@entry=true) at /home/theo/llama-quant/llama.cpp/src/llama-context.cpp:1398
1398 auto * gf = model.build_graph(gparams);
#11 0x00007f794d9b2245 in llama_context::llama_context (this=0x55f22ea0e940, model=..., params=...) at /usr/include/c++/15/bits/unique_ptr.h:471
471 get() const noexcept
#12 0x00007f794d9b27ac in llama_init_from_model (model=0x55f22d9f98d0, params=...) at /home/theo/llama-quant/llama.cpp/src/llama-context.cpp:2328
2328 auto * ctx = new llama_context(*model, params);
#13 0x000055f21ae1c803 in common_init_from_params (params=...) at /home/theo/llama-quant/llama.cpp/common/common.cpp:913
913 llama_context * lctx = llama_init_from_model(model, cparams);
#14 0x000055f21ad0f4e1 in server_context::load_model (this=this@entry=0x7ffd9aa73310, params=...) at /home/theo/llama-quant/llama.cpp/tools/server/server.cpp:2022
2022 llama_init = common_init_from_params(params_base);
#15 0x000055f21acb303f in main (argc=<optimized out>, argv=<optimized out>) at /home/theo/llama-quant/llama.cpp/tools/server/server.cpp:5058
5058 if (!ctx_server.load_model(params)) {
[Inferior 1 (process 3738145) detached]