Skip to content

Bug: Finetune must be cashful collapse bug #7583

Description

@jygmysoul

What happened?

Have not modified any code original compilation and operation.
Finetune must be cashful collapse bug.
I really can't solve this problem myself, ask for help THX.

Name and Version

main --version
version: 0 (unknown)
built with MSVC 19.29.30154.0 for x64
ggml-quants.c 2024-04-30 14:36

What operating system are you seeing the problem on?

Windows

Relevant log output

If yu do not modify any code, run the FineTune example for fine -tuning.
finetune -ngl 20000 --model-base "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1.gguf" --checkpoint-in "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1-LATEST.gguf" --checkpoint-out "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1-ITERATION.gguf" --lora-out "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1-LORA.bin" --train-data "I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\shakespeare.txt" --save-every 10 --threads 6 --adam-iter 30 --batch 4 --ctx 64 --use-checkpointing

========================================
main: seed: 1716876323
main: model base = 'I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1.gguf'
llama_model_loader: loaded meta data with 17 key-value pairs and 363 tensors from I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\ggml-model-f32_q4_1.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   3:                       llama.context_length u32              = 4096
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   5:                          llama.block_count u32              = 40
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                          general.file_type u32              = 3
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_1:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_1
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 7.61 GiB (5.02 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =    97.66 MiB
llm_load_tensors:      CUDA0 buffer size =  7692.26 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 85.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 11.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    85.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
main: init model
print_params: n_vocab               : 32000
print_params: n_ctx                 : 128
print_params: n_embd                : 5120
print_params: n_ff                  : 13824
print_params: n_head                : 40
print_params: n_head_kv             : 40
print_params: n_layer               : 40
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq             : 4
print_lora_params: n_rank_wk             : 4
print_lora_params: n_rank_wv             : 4
print_lora_params: n_rank_wo             : 4
print_lora_params: n_rank_ffn_norm       : 1
print_lora_params: n_rank_ffn_gate       : 4
print_lora_params: n_rank_ffn_down       : 4
print_lora_params: n_rank_ffn_up         : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm           : 1
print_lora_params: n_rank_output         : 4
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: lora_size = 131453216 bytes (125.4 MB)
main: opt_size  = 196303024 bytes (187.2 MB)
main: opt iter 0
main: input_size = 131076128 bytes (125.0 MB)
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 33455.02 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 21311.02 MiB
main: compute_size = 22346224224 bytes (21311.0 MB)
main: evaluation order = RIGHT_TO_LEFT
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: reallocating buffers automatically
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 21311.02 MiB
main: tokenize training data from I:\JYGAIBIN\MetaLlamaModel\Llama2-13b-chat\shakespeare.txt
main: sample-start:
main: include-sample-start: false
tokenize_file: total number of samples: 27458
main: number of training tokens: 27586
main: number of unique tokens: 3072
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 768376 bytes (0.7 MB)
train_opt_callback: iter=     0 sample=1/27458 sched=0.000000 loss=0.000000 |>

========================================
Call the stack:
>	finetune.exe!dequantize_row_q4_1(const block_q4_1 * x, float * y, __int64 k) 行 930	C
 	finetune.exe!ggml_compute_forward_add_q_f32(const ggml_compute_params * params, ggml_tensor * dst) 行 8220	C
 	finetune.exe!ggml_compute_forward_add(const ggml_compute_params * params, ggml_tensor * dst) 行 8280	C
 	finetune.exe!ggml_compute_forward(ggml_compute_params * params, ggml_tensor * tensor) 行 16492	C
 	finetune.exe!ggml_graph_compute_thread(void * data) 行 18749	C
========================================
code segment
void dequantize_row_q4_1(const block_q4_1 * restrict x, float * restrict y, int64_t k) {
    static const int qk = QK4_1;

    assert(k % qk == 0);

    const int nb = k / qk;

    for (int i = 0; i < nb; i++) {
        const float d = GGML_FP16_TO_FP32(x[i].d); //<-----HEAR C000005 ERROR
        const float m = GGML_FP16_TO_FP32(x[i].m);

        for (int j = 0; j < qk/2; ++j) {
            const int x0 = (x[i].qs[j] & 0x0F);
            const int x1 = (x[i].qs[j] >>   4);

            y[i*qk + j + 0   ] = x0*d + m;
            y[i*qk + j + qk/2] = x1*d + m;
        }
    }
}
![BUG](https://github.com/ggerganov/llama.cpp/assets/6336270/4c7e8013-0c97-443e-8ba7-2bfadc4eca4f)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions