Bug: Finetune must be cashful collapse bug

### What happened?

Have not modified any code original compilation and operation.
Finetune must be cashful collapse bug.
I really can't solve this problem myself, ask for help THX.



### Name and Version

main --version
version: 0 (unknown)
built with MSVC 19.29.30154.0 for x64
ggml-quants.c 2024-04-30 14:36

### What operating system are you seeing the problem on?

Windows

### Relevant log output

```shell
If yu do not modify any code, run the FineTune example for fine -tuning.
finetune -ngl 20000 --model-base "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1.gguf" --checkpoint-in "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1-LATEST.gguf" --checkpoint-out "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1-ITERATION.gguf" --lora-out "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1-LORA.bin" --train-data "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\shakespeare.txt" --save-every 10 --threads 6 --adam-iter 30 --batch 4 --ctx 64 --use-checkpointing

========================================
main: seed: 1716876323
main: model base = 'I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1.gguf'
llama_model_loader: loaded meta data with 17 key-value pairs and 363 tensors from I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   3:                       llama.context_length u32              = 4096
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   5:                          llama.block_count u32              = 40
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                          general.file_type u32              = 3
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_1:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_1
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 7.61 GiB (5.02 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =    97.66 MiB
llm_load_tensors:      CUDA0 buffer size =  7692.26 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 85.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 11.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    85.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
main: init model
print_params: n_vocab               : 32000
print_params: n_ctx                 : 128
print_params: n_embd                : 5120
print_params: n_ff                  : 13824
print_params: n_head                : 40
print_params: n_head_kv             : 40
print_params: n_layer               : 40
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq             : 4
print_lora_params: n_rank_wk             : 4
print_lora_params: n_rank_wv             : 4
print_lora_params: n_rank_wo             : 4
print_lora_params: n_rank_ffn_norm       : 1
print_lora_params: n_rank_ffn_gate       : 4
print_lora_params: n_rank_ffn_down       : 4
print_lora_params: n_rank_ffn_up         : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm           : 1
print_lora_params: n_rank_output         : 4
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: lora_size = 131453216 bytes (125.4 MB)
main: opt_size  = 196303024 bytes (187.2 MB)
main: opt iter 0
main: input_size = 131076128 bytes (125.0 MB)
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 33455.02 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 21311.02 MiB
main: compute_size = 22346224224 bytes (21311.0 MB)
main: evaluation order = RIGHT_TO_LEFT
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: reallocating buffers automatically
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 21311.02 MiB
main: tokenize training data from I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\shakespeare.txt
main: sample-start:
main: include-sample-start: false
tokenize_file: total number of samples: 27458
main: number of training tokens: 27586
main: number of unique tokens: 3072
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 768376 bytes (0.7 MB)
train_opt_callback: iter=     0 sample=1/27458 sched=0.000000 loss=0.000000 |>

========================================
Call the stack:
>	finetune.exe!dequantize_row_q4_1(const block_q4_1 * x, float * y, __int64 k) 行 930	C
 	finetune.exe!ggml_compute_forward_add_q_f32(const ggml_compute_params * params, ggml_tensor * dst) 行 8220	C
 	finetune.exe!ggml_compute_forward_add(const ggml_compute_params * params, ggml_tensor * dst) 行 8280	C
 	finetune.exe!ggml_compute_forward(ggml_compute_params * params, ggml_tensor * tensor) 行 16492	C
 	finetune.exe!ggml_graph_compute_thread(void * data) 行 18749	C
========================================
code segment
void dequantize_row_q4_1(const block_q4_1 * restrict x, float * restrict y, int64_t k) {
    static const int qk = QK4_1;

    assert(k % qk == 0);

    const int nb = k / qk;

    for (int i = 0; i < nb; i++) {
        const float d = GGML_FP16_TO_FP32(x[i].d); //<-----HEAR C000005 ERROR
        const float m = GGML_FP16_TO_FP32(x[i].m);

        for (int j = 0; j < qk/2; ++j) {
            const int x0 = (x[i].qs[j] & 0x0F);
            const int x1 = (x[i].qs[j] >>   4);

            y[i*qk + j + 0   ] = x0*d + m;
            y[i*qk + j + qk/2] = x1*d + m;
        }
    }
}
![BUG](https://github.com/ggerganov/llama.cpp/assets/6336270/4c7e8013-0c97-443e-8ba7-2bfadc4eca4f)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Finetune must be cashful collapse bug #7583

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: Finetune must be cashful collapse bug #7583

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions