-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Closed
Labels
bug-unconfirmedlow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)stale
Description
What happened?
I've already quantized a 2b variant of this model, and one of its instruct fine tune, on a subset of the same data (the first 1000 samples are the same in the same order -- the earlier was 34x1000 samples and now they are 34x5000 samples). now on the 7b model, I get an error. There are differences, my command now has the --save-frequency and --batch-size and --ubatch-size set to bring the hours down. (I have to admit I misread the -ngl flag as --ngl and left that off, trying to add that now)
Name and Version
llama-cli --version
version: 3912 (edc2656)
built with Apple clang version 16.0.0 (clang-1600.0.26.3) for arm64-apple-darwin24.0.0
What operating system are you seeing the problem on?
Mac
Relevant log output
/Users/macdev/Downloads/build/bin/llama-imatrix \
-m ./salamandra-7b_bf16.gguf \
-f ./imatrix/oscar/imatrix-dataset.txt \
-o ./imatrix/oscar/imatrix.dat \
--save-frequency 356 --ubatch-size 4096 --batch-size 9192 --threads 15 \
--ctx-size 8192 \
--rope-freq-base 10000.0 \
--top-p 0.95 \
--temp 0 \
--repeat-penalty 1.2
build: 3906 (7eee341b) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from ./salamandra-7b_bf16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.size_label str = 7.8B
llama_model_loader: - kv 3: general.license str = apache-2.0
llama_model_loader: - kv 4: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 5: general.languages arr[str,36] = ["bg", "ca", "code", "cs", "cy", "da"...
llama_model_loader: - kv 6: llama.block_count u32 = 32
llama_model_loader: - kv 7: llama.context_length u32 = 8192
llama_model_loader: - kv 8: llama.embedding_length u32 = 4096
llama_model_loader: - kv 9: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 10: llama.attention.head_count u32 = 32
llama_model_loader: - kv 11: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 13: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 32
llama_model_loader: - kv 15: llama.vocab_size u32 = 256000
llama_model_loader: - kv 16: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 17: tokenizer.ggml.add_space_prefix bool = true
llama_model_loader: - kv 18: tokenizer.ggml.model str = llama
llama_model_loader: - kv 19: tokenizer.ggml.pre str = default
llama_model_loader: - kv 20: tokenizer.ggml.tokens arr[str,256000] = ["<unk>", "<s>", "</s>", "<pad>", "<|...
llama_model_loader: - kv 21: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 26: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 27: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type bf16: 226 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 104
llm_load_vocab: token to piece cache size = 1.8842 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 256000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = BF16
llm_load_print_meta: model params = 7.77 B
llm_load_print_meta: model size = 14.47 GiB (16.00 BPW)
llm_load_print_meta: general.name = n/a
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 145 '<0x0A>'
llm_load_print_meta: EOT token = 5 '<|im_end|>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: EOG token = 5 '<|im_end|>'
llm_load_print_meta: max token length = 72
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: Metal buffer size = 12817.03 MiB
llm_load_tensors: CPU buffer size = 2000.00 MiB
...........................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 8192
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 42949.67 MB
llama_kv_cache_init: Metal KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.98 MiB
llama_new_context_with_model: Metal compute buffer size = 4384.00 MiB
llama_new_context_with_model: CPU compute buffer size = 4000.00 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 451
system_info: n_threads = 15 (n_threads_batch = 15) / 16 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 255156 ms
compute_imatrix: computing over 12457 chunks with batch_size 8192
compute_imatrix: 57.44 seconds per pass - ETA 198 hours 45.67 minutes
[1]6.6357,[2]6.2563,[3]7.9948,[4]8.4627,[5]9.6051,[6]9.3009,[7]8.6191,[8]8.4246,[9]8.4066,[10]8.3625,[11]8.6955,[12]8.7766,[13]8.8980,[14]9.1834,[15]8.9132,[16]9.2713,[17]9.5073,[18]9.7867,[19]9.6462,[20]9.0441,[21]8.9722,[22]8.8748,[23]9.0193,[24]9.0332,[25]8.6903,[26]8.7325,[27]8.7363,[28]8.7711,[29]8.9056,[30]9.0395,[31]8.9306,[32]9.0220,[33]8.7711,[34]8.7847,[35]8.7584,[36]8.6798,[37]8.5357,[38]8.5432,[39]8.5379,[40]8.5529,[41]8.4827,[42]8.5009,[43]8.3460,[44]8.2342,[45]8.2379,[46]8.2105,[47]8.0725,[48]8.1300,[49]8.1965,[50]8.2583,[51]8.3348,[52]8.3276,[53]8.3039,[54]8.3552,[55]8.3921,[56]8.4113,[57]8.4797,[58]8.2914,[59]8.1574,[60]7.9299,[61]7.6995,[62]7.6783,[63]7.6994,[64]7.7063,[65]7.7271,[66]7.7604,[67]7.8190,[68]7.8291,[69]7.7810,[70]7.6424,[71]7.7119,[72]7.7153,[73]7.7591,[74]7.8025,[75]7.8430,[76]7.7903,[77]7.8158,[78]7.8813,[79]7.8878,[80]7.9519,[81]8.0622,[82]8.0805,[83]8.0981,[84]7.9537,[85]7.8371,[86]7.7102,[87]7.5913,[88]7.4637,[89]7.3735,[90]7.3090,[91]7.1828,[92]7.1577,[93]7.1689,[94]7.2415,[95]7.1740,[96]7.1992,[97]7.1817,[98]7.2124,[99]7.2384,[100]7.2707,[101]7.2781,[102]7.2921,[103]7.3233,[104]7.3537,[105]7.4109,[106]7.4193,[107]7.4181,[108]7.4080,[109]7.4947,[110]7.4652,[111]7.4980,[112]7.5510,[113]7.6089,[114]7.6270,[115]7.6204,[116]7.6481,[117]7.6620,[118]7.6804,[119]7.7198,[120]7.7411,[121]7.7758,[122]7.7900,[123]7.8268,[124]7.8747,[125]7.9093,nan detected in blk.1.attn_output.weight
Metadata
Metadata
Assignees
Labels
bug-unconfirmedlow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)stale