-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Closed
Labels
bug-unconfirmedlow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)stale
Description
What happened?
I encountered an issue while loading a custom model in llama.cpp after converting it from PyTorch to GGUF format. Although the model was able to run inference successfully in PyTorch, when attempting to load the GGUF model in llama.cpp, I received the following error:
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.ffn_gate.weight' has wrong shape; expected 768, 2048, got 768, 3072, 1, 1
Here’s a summary of what I did:
- I created a custom model in PyTorch with the following configuration:
{
"vocab_size": 32000,
"hidden_size": 768,
"intermediate_size": 3072,
"num_hidden_layers": 48,
"num_attention_heads": 16,
"hidden_act": "silu",
"max_position_embeddings": 4096,
"initializer_range": 0.015606021841974151,
"rms_norm_eps": 1e-07,
"use_cache": true,
"tie_word_embeddings": false,
"attention_dropout": 0.1
}
The weights were randomly initialized.
- I successfully ran inference in PyTorch using this custom model. Since the weights were randomly initialized, the output from PyTorch was gibberish, which was expected.
- I converted the model to GGUF format using llama.cpp's conversion script.
- Upon loading the converted model in llama.cpp, the above error was thrown during the tensor shape check.
It seems that the tensor shape expected by llama.cpp differs from what is produced during the conversion process, leading to this error.
Name and Version
./llama-cli --version
version: 3511 (0d6fb52)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
No response
Relevant log output
morgen@morgen-LEGION:/home/data1/llm_agent/artifact/performance_testing$ python model_create.py -m test
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Generated Output:
Once upon a time, in a land far away,Ang подацима подацима подацима подацима подацима подацима подацима подацима подацима подацимаautorité подацима подацимаautoritéautoritéautoritéautoritéautoritéautoritéçoisautoritéautoritéautoritéautoritéautoritéautoritéautoritéautoritéautoritéautoritéautoritéautoritéautoritéautoritéautoritéautoritéçois
(/home/data1/llm_agent/empirical_env) morgen@morgen-LEGION:/home/data1/llm_agent/artifact/performance_testing$ python run.py -m test
Error executing command: warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 3511 (0d6fb52b)
main: built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu
main: seed = 1724916737
llama_model_loader: loaded meta data with 23 key-value pairs and 435 tensors from ../performance_testing/models/test/Weights.Pth-502M-F32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Weights.Pth
llama_model_loader: - kv 2: general.size_label str = 502M
llama_model_loader: - kv 3: llama.vocab_size u32 = 32000
llama_model_loader: - kv 4: llama.context_length u32 = 2048
llama_model_loader: - kv 5: llama.embedding_length u32 = 768
llama_model_loader: - kv 6: llama.block_count u32 = 48
llama_model_loader: - kv 7: llama.feed_forward_length u32 = 2048
llama_model_loader: - kv 8: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 9: llama.attention.head_count u32 = 6
llama_model_loader: - kv 10: llama.attention.head_count_kv u32 = 6
llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 0
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - type f32: 435 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 768
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_head = 6
llm_load_print_meta: n_head_kv = 6
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 768
llm_load_print_meta: n_embd_v_gqa = 768
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 2048
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 34B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 502.21 M
llm_load_print_meta: model size = 1.87 GiB (32.00 BPW)
llm_load_print_meta: general.name = Weights.Pth
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.20 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.ffn_gate.weight' has wrong shape; expected 768, 2048, got 768, 3072, 1, 1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../performance_testing/models/test/Weights.Pth-502M-F32.gguf'
main: error: unable to load model
Error running model test on backend cpu
Error executing command: Log start
main: build = 3511 (0d6fb52b)
main: built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu
main: seed = 1724916737
llama_model_loader: loaded meta data with 23 key-value pairs and 435 tensors from ../performance_testing/models/test/Weights.Pth-502M-F32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Weights.Pth
llama_model_loader: - kv 2: general.size_label str = 502M
llama_model_loader: - kv 3: llama.vocab_size u32 = 32000
llama_model_loader: - kv 4: llama.context_length u32 = 2048
llama_model_loader: - kv 5: llama.embedding_length u32 = 768
llama_model_loader: - kv 6: llama.block_count u32 = 48
llama_model_loader: - kv 7: llama.feed_forward_length u32 = 2048
llama_model_loader: - kv 8: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 9: llama.attention.head_count u32 = 6
llama_model_loader: - kv 10: llama.attention.head_count_kv u32 = 6
llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 0
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - type f32: 435 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 768
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_head = 6
llm_load_print_meta: n_head_kv = 6
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 768
llm_load_print_meta: n_embd_v_gqa = 768
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 2048
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 34B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 502.21 M
llm_load_print_meta: model size = 1.87 GiB (32.00 BPW)
llm_load_print_meta: general.name = Weights.Pth
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.41 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.ffn_gate.weight' has wrong shape; expected 768, 2048, got 768, 3072, 1, 1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../performance_testing/models/test/Weights.Pth-502M-F32.gguf'
main: error: unable to load model
Error running model test on backend gpu
Error executing command: Log start
main: build = 3511 (0d6fb52b)
main: built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu
main: seed = 1724916738
llama_model_loader: loaded meta data with 23 key-value pairs and 435 tensors from ../performance_testing/models/test/Weights.Pth-502M-F32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Weights.Pth
llama_model_loader: - kv 2: general.size_label str = 502M
llama_model_loader: - kv 3: llama.vocab_size u32 = 32000
llama_model_loader: - kv 4: llama.context_length u32 = 2048
llama_model_loader: - kv 5: llama.embedding_length u32 = 768
llama_model_loader: - kv 6: llama.block_count u32 = 48
llama_model_loader: - kv 7: llama.feed_forward_length u32 = 2048
llama_model_loader: - kv 8: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 9: llama.attention.head_count u32 = 6
llama_model_loader: - kv 10: llama.attention.head_count_kv u32 = 6
llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 0
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - type f32: 435 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 768
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_head = 6
llm_load_print_meta: n_head_kv = 6
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 768
llm_load_print_meta: n_embd_v_gqa = 768
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 2048
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 34B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 502.21 M
llm_load_print_meta: model size = 1.87 GiB (32.00 BPW)
llm_load_print_meta: general.name = Weights.Pth
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA GeForce RTX 3060 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size = 0.41 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.ffn_gate.weight' has wrong shape; expected 768, 2048, got 768, 3072, 1, 1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../performance_testing/models/test/Weights.Pth-502M-F32.gguf'
main: error: unable to load model
Metadata
Metadata
Assignees
Labels
bug-unconfirmedlow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)stale