Embedding fails to run on vulkan backend #7130

Adriankhl · 2024-05-07T19:24:20Z

System information: Windows 11, cpu amd 7840u with 780m apu

Vulkan build: cmake .. -GNinja -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DLLAMA_VULKAN=1 -DLLAMA_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release
CPU build: cmake .. -GNinja -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DLLAMA_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release

Model: https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/tree/main

I think something is wrong with the support of embedding models.

Observations:

main runs fine on vulkan backend, with a normal LLM model such as llama 3
embedding works on CPU backend with embedding models such as All-MiniLM
embedding "works" on vulkan backend with a normal LLM model such as llama 3, though the output is not meaningful
embedding fails to run on CPU backend with the following log with embedding models such as All-MiniLM

main: build = 2794 (628b2991)
main: built with Clang 18.1.4 for
main: seed  = 1715115389
llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = all-MiniLM-L6-v2
llama_model_loader: - kv   2:                           bert.block_count u32              = 6
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 384
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 1536
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 17
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   63 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q5_1:   28 tensors
llama_model_loader: - type q8_0:    3 tensors
llama_model_loader: - type q5_K:    4 tensors
llama_model_loader: - type q6_K:    2 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 384
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 6
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 32
llm_load_print_meta: n_embd_head_v    = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 384
llm_load_print_meta: n_embd_v_gqa     = 384
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 1536
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 22M
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 22.57 M
llm_load_print_meta: model size       = 19.99 MiB (7.43 BPW)
llm_load_print_meta: general.name     = all-MiniLM-L6-v2
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.05 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/7 layers to GPU
llm_load_tensors:        CPU buffer size =    19.99 MiB
............................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     4.50 MiB
llama_new_context_with_model: KV self size  =    4.50 MiB, K (f16):    2.25 MiB, V (f16):    2.25 MiB
WARNING: failed to allocate 0.00 MB of pinned memory
GGML_ASSERT: C:\Users\adriankhl\git\learn\llama.cpp\ggml-backend.c:100: base != NULL && "backend buffer base cannot be NULL"

The text was updated successfully, but these errors were encountered:

Adriankhl · 2024-05-07T20:17:16Z

llama.cpp/llama.cpp

Line 11229 in b6aa670

float * output_base = (float *) ggml_backend_buffer_get_base(lctx.buf_output);

It happens right here

Adriankhl · 2024-05-07T20:33:23Z

Also reproducible using the exe from the release page

teleprint-me · 2024-05-07T20:45:26Z

Does the same issue happen with the server? Or is it just isolated to main?

Adriankhl · 2024-05-07T20:59:54Z

Does the same issue happen with the server? Or is it just isolated to main?

Same error when I run .\bin\server.exe -m ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf --embedding

Adriankhl · 2024-05-13T05:04:53Z

Let me summarize the investigation so far

malloc 0 size issue:

With my OS and PC setting, embedding computation always try to first allocate buffer with 0 size here:

llama.cpp/llama.cpp

Line 11222 in b6aa670

    
           lctx.buf_output = ggml_backend_buft_alloc_buffer(llama_default_buffer_type_cpu(true), new_size);

Because of size += TENSOR_ALIGNMENT, size is always bigger than 0 for cpu backend (not sure if this is the correct behaviour though). So cpu backend can always allocate a buffer successsfully.

llama.cpp/ggml-backend.c

Lines 625 to 631 in b228aba

    
           GGML_CALL static ggml_backend_buffer_t ggml_backend_cpu_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) { 
        
               size += TENSOR_ALIGNMENT;   // malloc may return an address that is not aligned 
        
               void * data = malloc(size); // TODO: use GGML_ALIGNED_MALLOC (move to ggml-impl.h) 
        
               if (data == NULL) { 
        
                   fprintf(stderr, "%s: failed to allocate buffer of size %zu\n", __func__, size); 
        
                   return NULL; 
        
               }

For vulkan backend, ptr is still nullptr here after ggml_vk_host_malloc if size is 0.

llama.cpp/ggml-vulkan.cpp

Lines 6031 to 6043 in b228aba

    
           GGML_CALL static ggml_backend_buffer_t ggml_backend_vk_host_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) { 
        
           #ifdef GGML_VULKAN_DEBUG 
        
               std::cerr << "ggml_backend_vk_host_buffer_type_alloc_buffer(" << size << ")" << std::endl; 
        
           #endif 
        
               void * ptr = nullptr; 
        
               try { 
        
                   ptr = ggml_vk_host_malloc(&vk_instance.contexts[0], size); 
        
               } catch (vk::SystemError& e) { 
        
                   std::cerr << "ggml_vulkan: Failed to allocate pinned memory." << std::endl; 
        
                   std::cerr << "ggml_vulkan: " << e.what() << std::endl; 
        
                   // fallback to cpu buffer 
        
                   return ggml_backend_buft_alloc_buffer(ggml_backend_cpu_buffer_type(), size); 
        
               }

And because ggml_vk_host_malloc runs successfully, it doesn't throw an exception, which causes problem later on.

I can "fix" the issue above by throwing an exception to fallback to cpu buffer

        ptr = ggml_vk_host_malloc(&vk_instance.contexts[0], size);
        if (ptr == nullptr) {
            throw vk::InitializationFailedError("Null Pointer");
        }

Embedding works for a short prompt

.\bin\embedding.exe -m ..\..\..\models\mxbai-embed-large-v1.Q5_K_M.gguf --log-disable -p "Good weather`nI love cat"

Log 1

main: build = 2864 (cbf7589) main: built with Clang 18.1.4 for main: seed = 1715575791 llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 17 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q5_1: 28 tensors llama_model_loader: - type q8_0: 3 tensors llama_model_loader: - type q5_K: 4 tensors llama_model_loader: - type q6_K: 2 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 19.99 MiB (7.43 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' ggml_vulkan: Found 1 Vulkan devices: Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64 llm_load_tensors: ggml ctx size = 0.05 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/7 layers to GPU llm_load_tensors: CPU buffer size = 19.99 MiB ............................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 4.50 MiB llama_new_context_with_model: KV self size = 4.50 MiB, K (f16): 2.25 MiB, V (f16): 2.25 MiB WARNING: failed to allocate 0.00 MB of pinned memory ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: Null Pointer: ErrorInitializationFailed llama_new_context_with_model: CPU output buffer size = 0.00 MiB ggml_gallocr_reserve_n: reallocating Vulkan0 buffer from size 0.00 MiB to 16.86 MiB ggml_gallocr_reserve_n: reallocating Vulkan_Host buffer from size 0.00 MiB to 3.50 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 16.86 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB llama_new_context_with_model: graph nodes = 221 llama_new_context_with_model: graph splits = 100 ggml_gallocr_needs_realloc: graph has different number of nodes ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving

embedding 0: -0.078424 0.061774 0.122099 0.071252 -0.013703 -0.013969 0.057376 -0.043510 -0.059822 0.018061 0.005385 -0.043010 0.038214 -0.014732 0.027173 -0.001804
embedding 1: 0.005265 -0.016769 0.052540 -0.024372 -0.062103 -0.001837 0.098836 0.026607 0.044697 0.020890 -0.045096 -0.030395 -0.035944 0.049458 0.016966 -0.003935

cosine similarity matrix:

1.00 0.22
0.22 1.00

llama_print_timings: load time = 104.76 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 42.91 ms / 9 tokens ( 4.77 ms per token, 209.73 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 817.37 ms / 10 tokens

But it doesn't work for a longer prompt

.\bin\embedding.exe -m ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf -p "Antibiotics are a type of medication used to treat bacterial infections. They work by either killing the bacteria or preventing them from reproducing, allowing the body's immune system to fight off the infection. Antibiotics are usually taken orally in the form of pills, capsules, or liquid solutions, or sometimes administered intravenously. They are not effective against viral infections, and using them inappropriately can lead to antibiotic resistance.`nI love cat"

For debug build, an MSVC runtime error shows up: "Expression: can't dereference invalidated vector iterator", this is an error specific to this case though, I think I have seen it when I run llama.cpp main debug build

Log 2

main: build = 2864 (cbf7589) main: built with Clang 18.1.4 for main: seed = 1715576013 llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 17 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q5_1: 28 tensors llama_model_loader: - type q8_0: 3 tensors llama_model_loader: - type q5_K: 4 tensors llama_model_loader: - type q6_K: 2 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 19.99 MiB (7.43 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' ggml_vulkan: Found 1 Vulkan devices: Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64 llm_load_tensors: ggml ctx size = 0.05 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/7 layers to GPU llm_load_tensors: CPU buffer size = 19.99 MiB ............................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 4.50 MiB llama_new_context_with_model: KV self size = 4.50 MiB, K (f16): 2.25 MiB, V (f16): 2.25 MiB WARNING: failed to allocate 0.00 MB of pinned memory ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: Null Pointer: ErrorInitializationFailed llama_new_context_with_model: CPU output buffer size = 0.00 MiB ggml_gallocr_reserve_n: reallocating Vulkan0 buffer from size 0.00 MiB to 16.86 MiB ggml_gallocr_reserve_n: reallocating Vulkan_Host buffer from size 0.00 MiB to 3.50 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 16.86 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB llama_new_context_with_model: graph nodes = 221 llama_new_context_with_model: graph splits = 100 ggml_gallocr_needs_realloc: graph has different number of nodes ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving

For release build, here is the error on the terminal: GGML_ASSERT: C:\Users\adriankhl\git\develop\llama.cpp\ggml-vulkan.cpp:1913: src1_type == GGML_TYPE_F32

Log 3

main: build = 2864 (cbf7589) main: built with Clang 18.1.4 for main: seed = 1715576579 llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 17 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q5_1: 28 tensors llama_model_loader: - type q8_0: 3 tensors llama_model_loader: - type q5_K: 4 tensors llama_model_loader: - type q6_K: 2 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 19.99 MiB (7.43 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' ggml_vulkan: Found 1 Vulkan devices: Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64 llm_load_tensors: ggml ctx size = 0.05 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/7 layers to GPU llm_load_tensors: CPU buffer size = 19.99 MiB ............................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 4.50 MiB llama_new_context_with_model: KV self size = 4.50 MiB, K (f16): 2.25 MiB, V (f16): 2.25 MiB WARNING: failed to allocate 0.00 MB of pinned memory ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: Null Pointer: ErrorInitializationFailed llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 16.86 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB llama_new_context_with_model: graph nodes = 221 llama_new_context_with_model: graph splits = 100

0cc4m · 2024-05-18T06:27:30Z

Thank you for the detailed report and the investigation and apologies for not getting back to you sooner. I'll look into it and let you know what I find.

0cc4m · 2024-05-18T08:22:28Z

@Adriankhl Can you check whether #7360 fixes your issues?

Adriankhl · 2024-05-18T13:02:27Z

@0cc4m hi, if the prompt is long, I still get a similar VC++ error in debug build, in release build the run finish, but it gives nan vector:

main: build = 2923 (8dbde1f0)
main: built with Clang 18.1.4 for
main: seed  = 1716037243
llama_model_loader: loaded meta data with 24 key-value pairs and 197 tensors from ..\..\..\models\all-MiniLM-L12-v2.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = all-MiniLM-L12-v2
llama_model_loader: - kv   2:                           bert.block_count u32              = 12
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 384
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 1536
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 17
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  123 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q5_1:   54 tensors
llama_model_loader: - type q8_0:    7 tensors
llama_model_loader: - type q5_K:    6 tensors
llama_model_loader: - type q6_K:    6 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 384
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 32
llm_load_print_meta: n_embd_head_v    = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 384
llm_load_print_meta: n_embd_v_gqa     = 384
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 1536
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 33M
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.21 M
llm_load_print_meta: model size       = 27.96 MiB (7.06 BPW)
llm_load_print_meta: general.name     = all-MiniLM-L12-v2
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.09 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/13 layers to GPU
llm_load_tensors:        CPU buffer size =    27.96 MiB
..............................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     9.00 MiB
llama_new_context_with_model: KV self size  =    9.00 MiB, K (f16):    4.50 MiB, V (f16):    4.50 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.00 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =    16.90 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     3.50 MiB
llama_new_context_with_model: graph nodes  = 431
llama_new_context_with_model: graph splits = 196

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
batch_decode: n_tokens = 94, n_seq = 2

embedding 0: -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind)
embedding 1: -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind) -nan(ind)

cosine similarity matrix:

-nan(ind) -nan(ind)
-nan(ind) -nan(ind)

llama_print_timings:        load time =     109.19 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =     175.69 ms /    94 tokens (    1.87 ms per token,   535.03 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     178.55 ms /    95 tokens

Adriankhl · 2024-05-18T13:04:38Z

Another interesting observation, if I set -ngl to a large value, like 30, I get a non-nan vector, but the values look wrong:

.\bin\embedding.exe -m ..\..\..\models\all-MiniLM-L12-v2.Q5_K_M.gguf -p "Antibiotics are a type of medication used to treat bacterial infections. They work by either killing the bacteria or preventing them from reproducing, allowing the body's immune system to fight off the infection. Antibiotics are usually taken orally in the form of pills, capsules, or liquid solutions, or sometimes administered intravenously. They are not effective against viral infections, and using them inappropriately can lead to antibiotic resistance.`nI love cat" -ngl 15
main: build = 2923 (8dbde1f0)
main: built with Clang 18.1.4 for
main: seed  = 1716037401
llama_model_loader: loaded meta data with 24 key-value pairs and 197 tensors from ..\..\..\models\all-MiniLM-L12-v2.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = all-MiniLM-L12-v2
llama_model_loader: - kv   2:                           bert.block_count u32              = 12
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 384
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 1536
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 17
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 1
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  123 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q5_1:   54 tensors
llama_model_loader: - type q8_0:    7 tensors
llama_model_loader: - type q5_K:    6 tensors
llama_model_loader: - type q6_K:    6 tensors
llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 512
llm_load_print_meta: n_embd           = 384
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 32
llm_load_print_meta: n_embd_head_v    = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 384
llm_load_print_meta: n_embd_v_gqa     = 384
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 1536
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 512
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 33M
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.21 M
llm_load_print_meta: model size       = 27.96 MiB (7.06 BPW)
llm_load_print_meta: general.name     = all-MiniLM-L12-v2
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.18 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors:        CPU buffer size =    12.25 MiB
llm_load_tensors:    Vulkan0 buffer size =    15.71 MiB
..............................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =     9.00 MiB
llama_new_context_with_model: KV self size  =    9.00 MiB, K (f16):    4.50 MiB, V (f16):    4.50 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.00 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =    17.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     3.50 MiB
llama_new_context_with_model: graph nodes  = 431
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
batch_decode: n_tokens = 94, n_seq = 2

embedding 0: -0.012196 -0.004382 -0.068307 -0.037080 -0.011837 -0.000040  0.017563  0.056701  0.020313  0.024539  0.021325  0.052445 -0.015451  0.103782 -0.079035 -0.015415
embedding 1:  0.007400 -0.090975  0.050916 -0.027982 -0.098207 -0.004653  0.129955  0.098967  0.052596  0.070817 -0.015492 -0.080207  0.057286 -0.007871 -0.026050  0.015976

cosine similarity matrix:

  1.00  -0.09
 -0.09   1.00

llama_print_timings:        load time =     177.59 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =      77.88 ms /    94 tokens (    0.83 ms per token,  1207.05 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =      81.94 ms /    95 tokens

0cc4m · 2024-05-18T20:02:27Z

I can see that NaN error, it only happens when no layers are offloaded. Otherwise it seems to work fine.

The NaNs only happen on certain hardware and are caused by some clean-up issue that shows up in the Vulkan validation layer. I'll try to fix that soon.

0cc4m · 2024-05-19T07:56:47Z

@Adriankhl I fixed the NaN issue on my end, can you try running #7360 again?

Adriankhl · 2024-05-19T12:16:08Z

@0cc4m seems working fine🎊I will do a bit more testing later on.

One additional problem, I have figured out the cause of the debug build error, it happens here:

llama.cpp/ggml-vulkan.cpp

Lines 625 to 646 in e23b974

    
           static void ggml_vk_submit(vk_context * ctx, vk::Fence fence) { 
        
           #ifdef GGML_VULKAN_DEBUG 
        
               std::cerr << "ggml_vk_submit(" << ctx->seqs.size() << ", " << fence << ")" << std::endl; 
        
           #endif 
        
               if (ctx->seqs.empty()) { 
        
                   return; 
        
               } 
        
               std::vector<std::vector<uint64_t>> tl_wait_vals; 
        
               std::vector<std::vector<uint64_t>> tl_signal_vals; 
        
               std::vector<std::vector<vk::Semaphore>> tl_wait_semaphores; 
        
               std::vector<std::vector<vk::Semaphore>> tl_signal_semaphores; 
        
               std::vector<vk::TimelineSemaphoreSubmitInfo> tl_submit_infos; 
        
               std::vector<vk::SubmitInfo> submit_infos; 
        
               int idx = -1; 
        
               std::vector<std::vector<vk::PipelineStageFlags>> stage_flags; 
        
               size_t reserve = 0; 
        
               for (const auto& sequence : ctx->seqs) { 
        
                   reserve += sequence.size(); 
        
               }

Because of the MSVC bug, the vector size is detected wrongly in a debug build, even when ctx->seqs is of size 1, the iterator debugging feature of MSVC gets the size wrong and thought it is of size 0, which throw an exception. Can you add add_definitions(-D_ITERATOR_DEBUG_LEVEL=0) for MSVC build in the cmake file to fix this issue?

0cc4m · 2024-05-19T13:36:03Z

@0cc4m seems working fine🎊I will do a bit more testing later on.

Thank you for checking!

One additional problem, I have figured out the cause of the debug build error, it happens here:

llama.cpp/ggml-vulkan.cpp

Lines 625 to 646 in e23b974

static void ggml_vk_submit(vk_context * ctx, vk::Fence fence) {

#ifdef GGML_VULKAN_DEBUG

std::cerr << "ggml_vk_submit(" << ctx->seqs.size() << ", " << fence << ")" << std::endl;

#endif

if (ctx->seqs.empty()) {

return;

}

std::vector<std::vector<uint64_t>> tl_wait_vals;

std::vector<std::vector<uint64_t>> tl_signal_vals;

std::vector<std::vector<vk::Semaphore>> tl_wait_semaphores;

std::vector<std::vector<vk::Semaphore>> tl_signal_semaphores;

std::vector<vk::TimelineSemaphoreSubmitInfo> tl_submit_infos;

std::vector<vk::SubmitInfo> submit_infos;

int idx = -1;

std::vector<std::vector<vk::PipelineStageFlags>> stage_flags;

size_t reserve = 0;

for (const auto& sequence : ctx->seqs) {

reserve += sequence.size();

}

Because of the MSVC bug, the vector size is detected wrongly in a debug build, even when ctx->seqs is of size 1, the iterator debugging feature of MSVC gets the size wrong and thought it is of size 0, which throw an exception. Can you add add_definitions(-D_ITERATOR_DEBUG_LEVEL=0) for MSVC build in the cmake file to fix this issue?

I can't, sorry. I don't use Windows, so I wouldn't be able to verify that, and it's outside the scope of my PR. If you think it's a useful addition you can open a separate PR for it.

Adriankhl · 2024-05-21T00:42:20Z

Thanks for this, and it also fixes the gibberish problem I encountered when the generated text exceeds the context size.

Adriankhl added the bug-unconfirmed label May 7, 2024

Adriankhl mentioned this issue May 12, 2024

Update and fix Vulkan soft_max and argsort implementations #7237

Merged

bitterspeed mentioned this issue May 15, 2024

LlamaCpp crash when embedding (in beta) withcatai/node-llama-cpp#211

Closed

3 tasks

0cc4m mentioned this issue May 18, 2024

Vulkan Embedding Fix #7360

Merged

0cc4m closed this as completed in #7360 May 19, 2024

Adriankhl mentioned this issue May 21, 2024

vulkan: fix clang-cl debug build #7426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding fails to run on vulkan backend #7130

Embedding fails to run on vulkan backend #7130

Adriankhl commented May 7, 2024 •

edited

Loading

Adriankhl commented May 7, 2024

Adriankhl commented May 7, 2024 •

edited

Loading

teleprint-me commented May 7, 2024 •

edited

Loading

Adriankhl commented May 7, 2024

Adriankhl commented May 13, 2024 •

edited

Loading

0cc4m commented May 18, 2024

0cc4m commented May 18, 2024

Adriankhl commented May 18, 2024

Adriankhl commented May 18, 2024

0cc4m commented May 18, 2024

0cc4m commented May 19, 2024

Adriankhl commented May 19, 2024

0cc4m commented May 19, 2024

Adriankhl commented May 21, 2024

Embedding fails to run on vulkan backend #7130

Embedding fails to run on vulkan backend #7130

Comments

Adriankhl commented May 7, 2024 • edited Loading

Adriankhl commented May 7, 2024

Adriankhl commented May 7, 2024 • edited Loading

teleprint-me commented May 7, 2024 • edited Loading

Adriankhl commented May 7, 2024

Adriankhl commented May 13, 2024 • edited Loading

0cc4m commented May 18, 2024

0cc4m commented May 18, 2024

Adriankhl commented May 18, 2024

Adriankhl commented May 18, 2024

0cc4m commented May 18, 2024

0cc4m commented May 19, 2024

Adriankhl commented May 19, 2024

0cc4m commented May 19, 2024

Adriankhl commented May 21, 2024

Adriankhl commented May 7, 2024 •

edited

Loading

Adriankhl commented May 7, 2024 •

edited

Loading

teleprint-me commented May 7, 2024 •

edited

Loading

Adriankhl commented May 13, 2024 •

edited

Loading