What happened?
Have not modified any code original compilation and operation.
Finetune must be cashful collapse bug.
I really can't solve this problem myself, ask for help THX.
Name and Version
main --version
version: 0 (unknown)
built with MSVC 19.29.30154.0 for x64
ggml-quants.c 2024-04-30 14:36
What operating system are you seeing the problem on?
Windows
Relevant log output
If yu do not modify any code, run the FineTune example for fine -tuning.
finetune -ngl 20000 --model-base "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1.gguf" --checkpoint-in "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1-LATEST.gguf" --checkpoint-out "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1-ITERATION.gguf" --lora-out "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1-LORA.bin" --train-data "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\shakespeare.txt" --save-every 10 --threads 6 --adam-iter 30 --batch 4 --ctx 64 --use-checkpointing
========================================
main: seed: 1716876323
main: model base = 'I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1.gguf'
llama_model_loader: loaded meta data with 17 key-value pairs and 363 tensors from I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.vocab_size u32 = 32000
llama_model_loader: - kv 3: llama.context_length u32 = 4096
llama_model_loader: - kv 4: llama.embedding_length u32 = 5120
llama_model_loader: - kv 5: llama.block_count u32 = 40
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 40
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: general.file_type u32 = 3
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_1: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = Q4_1
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 7.61 GiB (5.02 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.37 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 97.66 MiB
llm_load_tensors: CUDA0 buffer size = 7692.26 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 400.00 MiB
llama_new_context_with_model: KV self size = 400.00 MiB, K (f16): 200.00 MiB, V (f16): 200.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 85.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 11.01 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 85.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 2
main: init model
print_params: n_vocab : 32000
print_params: n_ctx : 128
print_params: n_embd : 5120
print_params: n_ff : 13824
print_params: n_head : 40
print_params: n_head_kv : 40
print_params: n_layer : 40
print_params: norm_rms_eps : 0.000010
print_params: rope_freq_base : 10000.000000
print_params: rope_freq_scale : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq : 4
print_lora_params: n_rank_wk : 4
print_lora_params: n_rank_wv : 4
print_lora_params: n_rank_wo : 4
print_lora_params: n_rank_ffn_norm : 1
print_lora_params: n_rank_ffn_gate : 4
print_lora_params: n_rank_ffn_down : 4
print_lora_params: n_rank_ffn_up : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm : 1
print_lora_params: n_rank_output : 4
main: total train_iterations 0
main: seen train_samples 0
main: seen train_tokens 0
main: completed train_epochs 0
main: lora_size = 131453216 bytes (125.4 MB)
main: opt_size = 196303024 bytes (187.2 MB)
main: opt iter 0
main: input_size = 131076128 bytes (125.0 MB)
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 33455.02 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 21311.02 MiB
main: compute_size = 22346224224 bytes (21311.0 MB)
main: evaluation order = RIGHT_TO_LEFT
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: reallocating buffers automatically
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 21311.02 MiB
main: tokenize training data from I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\shakespeare.txt
main: sample-start:
main: include-sample-start: false
tokenize_file: total number of samples: 27458
main: number of training tokens: 27586
main: number of unique tokens: 3072
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 768376 bytes (0.7 MB)
train_opt_callback: iter= 0 sample=1/27458 sched=0.000000 loss=0.000000 |>
========================================
Call the stack:
> finetune.exe!dequantize_row_q4_1(const block_q4_1 * x, float * y, __int64 k) 行 930 C
finetune.exe!ggml_compute_forward_add_q_f32(const ggml_compute_params * params, ggml_tensor * dst) 行 8220 C
finetune.exe!ggml_compute_forward_add(const ggml_compute_params * params, ggml_tensor * dst) 行 8280 C
finetune.exe!ggml_compute_forward(ggml_compute_params * params, ggml_tensor * tensor) 行 16492 C
finetune.exe!ggml_graph_compute_thread(void * data) 行 18749 C
========================================
code segment
void dequantize_row_q4_1(const block_q4_1 * restrict x, float * restrict y, int64_t k) {
static const int qk = QK4_1;
assert(k % qk == 0);
const int nb = k / qk;
for (int i = 0; i < nb; i++) {
const float d = GGML_FP16_TO_FP32(x[i].d); //<-----HEAR C000005 ERROR
const float m = GGML_FP16_TO_FP32(x[i].m);
for (int j = 0; j < qk/2; ++j) {
const int x0 = (x[i].qs[j] & 0x0F);
const int x1 = (x[i].qs[j] >> 4);
y[i*qk + j + 0 ] = x0*d + m;
y[i*qk + j + qk/2] = x1*d + m;
}
}
}

What happened?
Have not modified any code original compilation and operation.
Finetune must be cashful collapse bug.
I really can't solve this problem myself, ask for help THX.
Name and Version
main --version
version: 0 (unknown)
built with MSVC 19.29.30154.0 for x64
ggml-quants.c 2024-04-30 14:36
What operating system are you seeing the problem on?
Windows
Relevant log output