Skip to content

Misc. bug: inconsistent locale for printing GGUF kv data across examples #10613

@JohannesGaessler

Description

@JohannesGaessler

Name and Version

./build/bin/llama-cli --version
version: 4232 (6acce39)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli, Other (Please specify in the next section)

Problem description & steps to reproduce

I am using a Linux PC with the locale set like this:

> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=de_DE.UTF-8
LC_TIME=de_DE.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=de_DE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=de_DE.UTF-8
LC_NAME=de_DE.UTF-8
LC_ADDRESS=de_DE.UTF-8
LC_TELEPHONE=de_DE.UTF-8
LC_MEASUREMENT=de_DE.UTF-8
LC_IDENTIFICATION=de_DE.UTF-8
LC_ALL=

The way floating point numbers from the model GGUF kv data are printed is inconsistent depending on which binary I run.
llama_cli prints them with a point, llama-perplexity prints them with a comma.
It may make sense to completely ignore any locale set by the user and just always use points.
Honestly this is a very minor issue though.

First Bad Commit

No response

Relevant log output

/home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [11:43]
> export model_name=stories-260k && export quantization=f32

/home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [11:43]
> build/bin/llama-cli --model models/opt/${model_name}-${quantization}.gguf -n 64                   
build: 4232 (6acce397) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 19 key-value pairs and 48 tensors from models/opt/stories-260k-f32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                      tokenizer.ggml.tokens arr[str,512]     = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv   1:                      tokenizer.ggml.scores arr[f32,512]     = [0,000000, 0,000000, 0,000000, 0,0000...
llama_model_loader: - kv   2:                  tokenizer.ggml.token_type arr[i32,512]     = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv   3:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv   4:                       general.architecture str              = llama
llama_model_loader: - kv   5:                               general.name str              = llama
llama_model_loader: - kv   6:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv   7:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv   8:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv   9:          tokenizer.ggml.seperator_token_id u32              = 4294967295
llama_model_loader: - kv  10:            tokenizer.ggml.padding_token_id u32              = 4294967295
llama_model_loader: - kv  11:                       llama.context_length u32              = 128
llama_model_loader: - kv  12:                     llama.embedding_length u32              = 64
llama_model_loader: - kv  13:                  llama.feed_forward_length u32              = 172
llama_model_loader: - kv  14:                 llama.attention.head_count u32              = 8
llama_model_loader: - kv  15:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv  16:                          llama.block_count u32              = 5
llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 8
llama_model_loader: - kv  18:     llama.attention.layer_norm_rms_epsilon f32              = 0,000010
llama_model_loader: - type  f32:   48 tensors
llm_load_vocab: bad special token: 'tokenizer.ggml.seperator_token_id' = 4294967295d, using default id -1
llm_load_vocab: bad special token: 'tokenizer.ggml.padding_token_id' = 4294967295d, using default id -1
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0,0008 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 512
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 128
llm_load_print_meta: n_embd           = 64
llm_load_print_meta: n_layer          = 5
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 8
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 8
llm_load_print_meta: n_embd_head_v    = 8
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 32
llm_load_print_meta: n_embd_v_gqa     = 32
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-05
llm_load_print_meta: f_clamp_kqv      = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale    = 0,0e+00
llm_load_print_meta: n_ff             = 172
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 128
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 292,80 K
llm_load_print_meta: model size       = 1,12 MiB (32,00 BPW) 
llm_load_print_meta: general.name     = llama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 9
llm_load_tensors:   CPU_Mapped model buffer size =     1,12 MiB
...................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000,0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (128) -- possible training context overflow
llama_kv_cache_init:        CPU KV buffer size =     2,50 MiB
llama_new_context_with_model: KV self size  =    2,50 MiB, K (f16):    1,25 MiB, V (f16):    1,25 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,00 MiB
llama_new_context_with_model:        CPU compute buffer size =    72,51 MiB
llama_new_context_with_model: graph nodes  = 166
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
main: model was trained on only 128 context tokens (4096 specified)

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 

sampler seed: 2284000892
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
        dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, temp = 0,800
        mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 1

 Once upon a time, there was a little girl named Lily. She loved to play with her toys and explore the woods. One day, her mommy told her they were going on a big road with long ha

llama_perf_sampler_print:    sampling time =       0,57 ms /    65 runs   (    0,01 ms per token, 114235,50 tokens per second)
llama_perf_context_print:        load time =       4,01 ms
llama_perf_context_print: prompt eval time =       0,00 ms /     1 tokens (    0,00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =      15,61 ms /    64 runs   (    0,24 ms per token,  4099,67 tokens per second)
llama_perf_context_print:       total time =      16,64 ms /    65 tokens

/home/johannesg/Projects/llama.cpp [git::master *] [johannesg@johannes-pc] [11:43]
> build/bin/llama-perplexity --model models/opt/${model_name}-${quantization}.gguf -f wikitext-2-raw/wiki.test.raw -c 128 --chunks 1 
build: 4232 (6acce397) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
llama_model_loader: loaded meta data with 19 key-value pairs and 48 tensors from models/opt/stories-260k-f32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                      tokenizer.ggml.tokens arr[str,512]     = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv   1:                      tokenizer.ggml.scores arr[f32,512]     = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv   2:                  tokenizer.ggml.token_type arr[i32,512]     = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv   3:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv   4:                       general.architecture str              = llama
llama_model_loader: - kv   5:                               general.name str              = llama
llama_model_loader: - kv   6:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv   7:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv   8:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv   9:          tokenizer.ggml.seperator_token_id u32              = 4294967295
llama_model_loader: - kv  10:            tokenizer.ggml.padding_token_id u32              = 4294967295
llama_model_loader: - kv  11:                       llama.context_length u32              = 128
llama_model_loader: - kv  12:                     llama.embedding_length u32              = 64
llama_model_loader: - kv  13:                  llama.feed_forward_length u32              = 172
llama_model_loader: - kv  14:                 llama.attention.head_count u32              = 8
llama_model_loader: - kv  15:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv  16:                          llama.block_count u32              = 5
llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 8
llama_model_loader: - kv  18:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - type  f32:   48 tensors
llm_load_vocab: bad special token: 'tokenizer.ggml.seperator_token_id' = 4294967295d, using default id -1
llm_load_vocab: bad special token: 'tokenizer.ggml.padding_token_id' = 4294967295d, using default id -1
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.0008 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 512
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 128
llm_load_print_meta: n_embd           = 64
llm_load_print_meta: n_layer          = 5
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 8
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 8
llm_load_print_meta: n_embd_head_v    = 8
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 32
llm_load_print_meta: n_embd_v_gqa     = 32
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 172
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 128
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 292.80 K
llm_load_print_meta: model size       = 1.12 MiB (32.00 BPW) 
llm_load_print_meta: general.name     = llama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 9
llm_load_tensors:   CPU_Mapped model buffer size =     1.12 MiB
...................................
llama_new_context_with_model: n_seq_max     = 16
llama_new_context_with_model: n_ctx         = 2048
llama_new_context_with_model: n_ctx_per_seq = 128
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size =     1.25 MiB
llama_new_context_with_model: KV self size  =    1.25 MiB, K (f16):    0.62 MiB, V (f16):    0.62 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.03 MiB
llama_new_context_with_model:        CPU compute buffer size =    36.51 MiB
llama_new_context_with_model: graph nodes  = 166
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: model was trained on only 128 context tokens (2048 specified)

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 190.742 ms
perplexity: calculating perplexity over 1 chunks, n_ctx=128, batch_size=2048, n_seq=16
perplexity: 0.00 seconds per pass - ETA 0.00 minutes
[1]75.2432,
Final estimate: PPL = 75.2432 +/- 40.48787

llama_perf_context_print:        load time =       4.57 ms
llama_perf_context_print: prompt eval time =       2.19 ms /   128 tokens (    0.02 ms per token, 58500.91 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =     195.21 ms /   129 tokens

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions