Codex Local Token Counter Fix - Instructional Guide for llama.cpp #18847
jbulger82
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Codex Local Token Counter Fix - Instructional Guide
Author: Claude (Anthropic) with Jeff Bulger
Date: 2026-01-14
For: OpenAI Codex TUI (v0.80.0) running with local llama.cpp backend
GitHub: Feel free to use and share - MIT License
🎯 What This Fixes
The stock OpenAI Codex TUI doesn't display token usage when using local llama.cpp. After this fix:
Before: Always shows
0 total (0 input + 0 output)After: Real-time token tracking + auto-compact when you hit your limit
📋 Prerequisites
🔧 The Fix (4 Files to Modify)
File 1:
codex-api/src/requests/chat.rsWhat: Tell llama.cpp to include usage data in the streaming response.
Find the payload construction (around line 303-310):
Replace with:
File 2:
codex-api/src/sse/chat.rsWhat: Extract the token usage from llama.cpp's response and pass it to the TUI.
Step 2a: Add Import
At the top of the file, add:
Step 2b: Add Tracking Variable
In the
spawn_stream_handlerfunction, find these lines:Add below them:
Step 2c: Modify
flush_and_completeFunctionFind the
flush_and_completefunction and change it to accept and usetoken_usage:Step 2d: Extract Usage from JSON
After the JSON is parsed (
let value: serde_json::Value = serde_json::from_str(&data)?;), add:Step 2e: Update All
flush_and_completeCallsFind all calls to
flush_and_completeand add the captured_usage parameter:Before:
After:
File 3:
core/src/models_manager/model_info.rsWhat: Set the correct context window for your model so the percentage display is accurate.
Find your model's definition (or add one for local models). Example for Nemotron:
Important: Set
context_windowto match your llama.cpp server's-cflag!File 4:
config.tomlWhat: Enable auto-compaction to prevent OOM when you hit your context limit.
Add at the top level of your config:
This triggers automatic context compaction at 250k tokens (adjust based on your context window).
🔨 Build & Test
cd codex-rs cargo buildThen launch your TUI and run
/status- you should see real token counts!🔍 How It Works
Still showing 0 tokens?
cargo buildwire_apiin config)POST /v1/chat/completionsContext percentage wrong?
context_windowin model_info.rs to match your llama.cpp-cvalueAuto-compact not triggering?
model_auto_compact_token_limitis set in config.toml📊 Token Count Breakdown (For Reference)
Typical baseline overhead:
Optimal baseline: ~5,600 tokens with lightweight prompt
Heavy baseline: ~9,800 tokens with full prompt + features
🙏 Credits
Fix developed by Claude (Anthropic) working with Jeff Bulger on 2026-01-14.
This was a multi-hour debugging session tracing token flow from llama.cpp → SSE parser → TUI. The key insight: the Chat wire API (for local models) never implemented usage extraction, even though Responses API (cloud) had it.
The Problem: Hardcoded
token_usage: Nonein the completion handlerThe Fix: Extract
usagefrom llama.cpp's JSON response and pass it through📜 License
MIT - Use freely, attribution appreciated.
"When Nemo runs well, he runs GREAT. This config makes it happen." - Jeff
In the codex benchmark 12gb vram!
Codex Benchmark (12GB VRAM)
llama log from a small run for those interested...I can run this model at a million context as well on 12gvram thanks to llama.cpp
jeff@jeff-STGAUBRON 10:20 ~/Desktop/Test_llama_build/test_llamaRoc.cpp
$ cd /home/jeff/Desktop/Test_llama_build/test_llamaRoc.cpp && export HSA_OVERRIDE_GFX_VERSION=10.3.0 && export HIP_VISIBLE_DEVICES=0 && export HSA_ENABLE_SDMA=0 && export HSA_DISABLE_FRAGMENT_ALLOCATOR=1 && export LD_LIBRARY_PATH="/home/jeff/Desktop/Test_llama_build/rocm-local/opt/rocm-7.1.1/lib:/home/jeff/Desktop/Test_llama_build/rocm-local/opt/rocm-7.1.1/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" && MODEL="/home/jeff/Desktop/models/Nemotron-3-Nano-30B-A3B-IQ4_NL.gguf" && ./build-rocm-b7622-hipcc/bin/llama-server -m "$MODEL" -c 262144 -b 1625 -ub 20000 --cache-type-k q4_0 --cache-type-v q4_0 -fa on
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6700 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7622 (c69c7eb) with Clang 20.0.0 for Linux x86_64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/home/jeff/Desktop/models/Nemotron-3-Nano-30B-A3B-IQ4_NL.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 20221 MiB of device memory vs. 11844 MiB of free device memory
llama_params_fit_impl: cannot fulfill margin of 1024 MiB, need to reduce device memory by 9401 MiB
llama_params_fit_impl: context size set by user to 262144 -> no change
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 5985 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl: - ROCm0 (AMD Radeon RX 6700 XT): 53 layers, 4866 MiB used, 6977 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl: - ROCm0 (AMD Radeon RX 6700 XT): 53 layers (31 overflowing), 10611 MiB used, 1232 MiB free
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 2.46 seconds
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon RX 6700 XT) (0000:03:00.0) - 12068 MiB free
llama_model_loader: loaded meta data with 53 key-value pairs and 401 tensors from /home/jeff/Desktop/models/Nemotron-3-Nano-30B-A3B-IQ4_NL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nemotron_h_moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_p f32 = 1.000000
llama_model_loader: - kv 3: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 4: general.name str = Nemotron-3-Nano-30B-A3B
llama_model_loader: - kv 5: general.basename str = Nemotron-3-Nano-30B-A3B
llama_model_loader: - kv 6: general.quantized_by str = Unsloth
llama_model_loader: - kv 7: general.size_label str = 30B-A3B
llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 9: nemotron_h_moe.block_count u32 = 52
llama_model_loader: - kv 10: nemotron_h_moe.context_length u32 = 1048576
llama_model_loader: - kv 11: nemotron_h_moe.embedding_length u32 = 2688
llama_model_loader: - kv 12: nemotron_h_moe.feed_forward_length arr[i32,52] = [0, 1856, 0, 1856, 0, 0, 1856, 0, 185...
llama_model_loader: - kv 13: nemotron_h_moe.attention.head_count u32 = 32
llama_model_loader: - kv 14: nemotron_h_moe.attention.head_count_kv arr[i32,52] = [0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, ...
llama_model_loader: - kv 15: nemotron_h_moe.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 16: nemotron_h_moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: nemotron_h_moe.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 18: nemotron_h_moe.expert_used_count u32 = 6
llama_model_loader: - kv 19: nemotron_h_moe.expert_group_count u32 = 1
llama_model_loader: - kv 20: nemotron_h_moe.expert_group_used_count u32 = 1
llama_model_loader: - kv 21: nemotron_h_moe.vocab_size u32 = 131072
llama_model_loader: - kv 22: nemotron_h_moe.rope.dimension_count u32 = 84
llama_model_loader: - kv 23: nemotron_h_moe.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 24: nemotron_h_moe.ssm.state_size u32 = 128
llama_model_loader: - kv 25: nemotron_h_moe.ssm.group_count u32 = 8
llama_model_loader: - kv 26: nemotron_h_moe.ssm.inner_size u32 = 4096
llama_model_loader: - kv 27: nemotron_h_moe.ssm.time_step_rank u32 = 64
llama_model_loader: - kv 28: nemotron_h_moe.rope.scaling.finetuned bool = false
llama_model_loader: - kv 29: nemotron_h_moe.attention.key_length u32 = 128
llama_model_loader: - kv 30: nemotron_h_moe.attention.value_length u32 = 128
llama_model_loader: - kv 31: nemotron_h_moe.expert_feed_forward_length u32 = 1856
llama_model_loader: - kv 32: nemotron_h_moe.expert_shared_feed_forward_length u32 = 3712
llama_model_loader: - kv 33: nemotron_h_moe.expert_count u32 = 128
llama_model_loader: - kv 34: nemotron_h_moe.expert_shared_count u32 = 1
llama_model_loader: - kv 35: nemotron_h_moe.expert_weights_norm bool = true
llama_model_loader: - kv 36: nemotron_h_moe.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 37: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 38: tokenizer.ggml.pre str = pixtral
llama_model_loader: - kv 39: tokenizer.ggml.tokens arr[str,131072] = ["", "
", "", "[INST]", "[...llama_model_loader: - kv 40: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 41: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 11
llama_model_loader: - kv 44: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 45: tokenizer.ggml.padding_token_id u32 = 999
llama_model_loader: - kv 46: tokenizer.chat_template str = {# Unsloth template fixes #}\n{% macro...
llama_model_loader: - kv 47: general.quantization_version u32 = 2
llama_model_loader: - kv 48: general.file_type u32 = 25
llama_model_loader: - kv 49: quantize.imatrix.file str = Nemotron-3-Nano-30B-A3B-GGUF/imatrix_...
llama_model_loader: - kv 50: quantize.imatrix.dataset str = unsloth_calibration_Nemotron-3-Nano-3...
llama_model_loader: - kv 51: quantize.imatrix.entries_count u32 = 185
llama_model_loader: - kv 52: quantize.imatrix.chunks_count u32 = 80
llama_model_loader: - type f32: 237 tensors
llama_model_loader: - type q5_0: 1 tensors
llama_model_loader: - type q5_1: 23 tensors
llama_model_loader: - type q8_0: 24 tensors
llama_model_loader: - type iq4_nl: 116 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = IQ4_NL - 4.5 bpw
print_info: file size = 16.92 GiB (4.60 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load: - 11 ('<|im_end|>')
load: special tokens cache size = 1000
load: token to piece cache size = 0.8499 MB
print_info: arch = nemotron_h_moe
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 1048576
print_info: n_embd = 2688
print_info: n_embd_inp = 2688
print_info: n_layer = 52
print_info: n_head = 32
print_info: n_head_kv = [0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print_info: n_rot = 84
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = [0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = [0, 1856, 0, 1856, 0, 0, 1856, 0, 1856, 0, 1856, 0, 0, 1856, 0, 1856, 0, 1856, 0, 0, 1856, 0, 1856, 0, 1856, 0, 0, 1856, 0, 1856, 0, 1856, 0, 0, 1856, 0, 1856, 0, 1856, 0, 1856, 0, 0, 1856, 0, 1856, 0, 1856, 0, 1856, 0, 1856]
print_info: n_expert = 128
print_info: n_expert_used = 6
print_info: n_expert_groups = 1
print_info: n_group_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = -1
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 1048576
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 4
print_info: ssm_d_inner = 4096
print_info: ssm_d_state = 128
print_info: ssm_dt_rank = 64
print_info: ssm_n_group = 8
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 31B.A3.5B
print_info: model params = 31.58 B
print_info: general.name = Nemotron-3-Nano-30B-A3B
print_info: f_embedding_scale = 0.000000
print_info: f_residual_scale = 0.000000
print_info: f_attention_scale = 0.000000
print_info: n_ff_shexp = 3712
print_info: vocab type = BPE
print_info: n_vocab = 131072
print_info: n_merges = 269443
print_info: BOS token = 1 '
'print_info: EOS token = 11 '<|im_end|>'
print_info: EOT token = 11 '<|im_end|>'
print_info: UNK token = 0 ''
print_info: PAD token = 999 '<SPECIAL_999>'
print_info: LF token = 1010 'Ċ'
print_info: EOG token = 11 '<|im_end|>'
print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading output layer to GPU
load_tensors: offloading 51 repeating layers to GPU
load_tensors: offloaded 53/53 layers to GPU
load_tensors: CPU_Mapped model buffer size = 16965.10 MiB
load_tensors: ROCm0 model buffer size = 7487.97 MiB
.......................................................
common_init_result: added <|im_end|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 262144
llama_context: n_ctx_seq = 262144
llama_context: n_batch = 1625
llama_context: n_ubatch = 1625
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = true
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (262144) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 2.00 MiB
llama_kv_cache: ROCm0 KV buffer size = 432.00 MiB
llama_kv_cache: size = 432.00 MiB (262144 cells, 6 layers, 4/1 seqs), K (q4_0): 216.00 MiB, V (q4_0): 216.00 MiB
llama_memory_recurrent: ROCm0 RS buffer size = 190.47 MiB
llama_memory_recurrent: size = 190.47 MiB ( 4 cells, 52 layers, 4 seqs), R (f32): 6.47 MiB, S (f32): 184.00 MiB
llama_context: ROCm0 compute buffer size = 2501.13 MiB
llama_context: ROCm_Host compute buffer size = 1644.73 MiB
llama_context: graph nodes = 2188
llama_context: graph splits = 33 (with bs=1625), 34 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv load_model: initializing slots, n_slots = 4
slot load_model: id 0 | task -1 | new slot, n_ctx = 262144
slot load_model: id 1 | task -1 | new slot, n_ctx = 262144
slot load_model: id 2 | task -1 | new slot, n_ctx = 262144
slot load_model: id 3 | task -1 | new slot, n_ctx = 262144
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use
--cache-ram 0to disable the prompt cachesrv load_model: for more info see https://github.com//pull/16391
srv load_model: thinking = 1
load_model: chat template, chat_template: {# Unsloth template fixes #}
{% macro render_extra_keys(json_dict, handled_keys) %}
{%- if json_dict is mapping %}
{%- for json_key in json_dict if json_key not in handled_keys %}
{%- if json_dict[json_key] is mapping or (json_dict[json_key] is sequence and json_dict[json_key] is not string) %}
{{- '\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | tojson | safe) ~ '</' ~ json_key ~ '>' }}
{%- else %}
{{-'\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | string) ~ '</' ~ json_key ~ '>' }}
{%- endif %}
{%- endfor %}
{%- endif %}
{% endmacro %}
{%- set enable_thinking = enable_thinking if enable_thinking is defined else True %}
{%- set truncate_history_thinking = truncate_history_thinking if truncate_history_thinking is defined else True %}
{%- set ns = namespace(last_user_idx = -1) %}
{%- set loop_messages = messages %}
{%- for m in loop_messages %}
{%- if m["role"] == "user" %}
{%- set ns.last_user_idx = loop.index0 %}
{%- endif %}
{%- endfor %}
{%- if messages[0]["role"] == "system" %}
{%- set system_message = messages[0]["content"] %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set system_message = "" %}
{%- set loop_messages = messages %}
{%- endif %}
{%- if not tools is defined %}
{%- set tools = [] %}
{%- endif %}
{# Recompute last_user_idx relative to loop_messages after handling system #}
{%- set ns = namespace(last_user_idx = -1) %}
{%- for m in loop_messages %}
{%- if m["role"] == "user" %}
{%- set ns.last_user_idx = loop.index0 %}
{%- endif %}
{%- endfor %}
{%- if system_message is defined %}
{{- "<|im_start|>system\n" + system_message }}
{%- else %}
{%- if tools is iterable and tools | length > 0 %}
{{- "<|im_start|>system\n" }}
{%- endif %}
{%- endif %}
{%- if tools is iterable and tools | length > 0 %}
{%- if system_message is defined and system_message | length > 0 %}
{{- "\n\n" }}
{%- endif %}
{{- "# Tools\n\nYou have access to the following functions:\n\n" }}
{{- "" }}
{%- for tool in tools %}
{%- if tool.function is defined %}
{%- set tool = tool.function %}
{%- endif %}
{{- "\n\n" ~ tool.name ~ "" }}
{%- if tool.description is defined %}
{{- '\n' ~ (tool.description | trim) ~ '' }}
{%- endif %}
{{- '\n' }}
{%- if tool.parameters is defined and tool.parameters is mapping and tool.parameters.properties is defined and tool.parameters.properties is mapping %}
{%- for param_name, param_fields in tool.parameters.properties|items %}
{{- '\n' }}
{{- '\n' ~ param_name ~ '' }}
{%- if param_fields.type is defined %}
{{- '\n' ~ (param_fields.type | string) ~ '' }}
{%- endif %}
{%- if param_fields.description is defined %}
{{- '\n' ~ (param_fields.description | trim) ~ '' }}
{%- endif %}
{%- if param_fields.enum is defined %}
{{- '\n' ~ (param_fields.enum | tojson | safe) ~ '' }}
{%- endif %}
{%- set handled_keys = ['name', 'type', 'description', 'enum'] %}
{{- render_extra_keys(param_fields, handled_keys) }}
{{- '\n' }}
{%- endfor %}
{%- endif %}
{% set handled_keys = ['type', 'properties', 'required'] %}
{{- render_extra_keys(tool.parameters, handled_keys) }}
{%- if tool.parameters is defined and tool.parameters.required is defined %}
{{- '\n' ~ (tool.parameters.required | tojson | safe) ~ '' }}
{%- endif %}
{{- '\n' }}
{%- set handled_keys = ['type', 'name', 'description', 'parameters'] %}
{{- render_extra_keys(tool, handled_keys) }}
{{- '\n' }}
{%- endfor %}
{{- "\n" }}
{%- endif %}
{%- if system_message is defined %}
{{- '<|im_end|>\n' }}
{%- else %}
{%- if tools is iterable and tools | length > 0 %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- for message in loop_messages %}
{%- if message.role == "assistant" %}
{# Add reasoning content in to content field for unified processing below. #}
{%- if message.reasoning_content is defined and message.reasoning_content is string and message.reasoning_content | trim | length > 0 %}
{%- set content = "\n" ~ message.reasoning_content ~ "\n\n" ~ (message.content | default('', true)) %}
{%- else %}
{%- set content = message.content | default('', true) %}
{%- if content is string -%}
{# Allow downstream logic to to take care of broken thought, only handle coherent reasoning here. #}
{%- if '' not in content and '' not in content -%}
{%- set content = "" ~ content -%}
{%- endif -%}
{%- else -%}
{%- set content = content -%}
{%- endif -%}
{%- endif %}
{%- if message.tool_calls is defined and message.tool_calls is iterable and message.tool_calls | length > 0 %}
{# Assistant message has tool calls. #}
{{- '<|im_start|>assistant\n' }}
{%- set include_content = not (truncate_history_thinking and loop.index0 < ns.last_user_idx) %}
{%- if content is string and content | trim | length > 0 %}
{%- if include_content %}
{{- (content | trim) ~ '\n' -}}
{%- else %}
{%- set c = (content | string) %}
{%- if '' in c %}
{# Keep only content after the last closing think. Also generation prompt causes this. #}
{%- set c = (c.split('')|last) %}
{%- elif '' in c %}
{# If was opened but never closed, drop the trailing think segment #}
{%- set c = (c.split('')|first) %}
{%- endif %}
{%- set c = "" ~ c | trim %}
{%- if c | length > 0 %}
{{- c ~ '\n' -}}
{%- endif %}
{%- endif %}
{%- else %}
{{- "" -}}
{%- endif %}
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '<tool_call>\n<function=' ~ tool_call.name ~ '>\n' -}}
{%- if tool_call.arguments is defined %}{%- if tool_call.arguments is mapping %}
{%- for args_name, args_value in tool_call.arguments|items %}
{{- '<parameter=' ~ args_name ~ '>\n' -}}
{%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
{{- args_value ~ '\n\n' -}}
{%- endfor %}{%- endif %}
{%- endif %}
{{- '\n</tool_call>\n' -}}
{%- endfor %}
{{- '<|im_end|>\n' }}
{%- else %}
{# Assistant message doesn't have tool calls. #}
{%- if not (truncate_history_thinking and loop.index0 < ns.last_user_idx) %}
{{- '<|im_start|>assistant\n' ~ (content | default('', true) | string | trim) ~ '<|im_end|>\n' }}
{%- else %}
{%- set c = (content | default('', true) | string) %}
{%- if '' in c and '' in c %}
{%- set c = "" ~ (c.split('')|last) %}
{%- endif %}
{%- set c = c | trim %}
{%- if c | length > 0 %}
{{- '<|im_start|>assistant\n' ~ c ~ '<|im_end|>\n' }}
{%- else %}
{{- '<|im_start|>assistant\n<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endif %}
{%- elif message.role == "user" or message.role == "system" %}
{{- '<|im_start|>' + message.role + '\n' }}
{%- set content = message.content | string %}
{{- content }}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if loop.previtem and loop.previtem.role != "tool" %}
{{- '<|im_start|>user\n' }}
{%- endif %}
{{- '<tool_response>\n' }}
{{- message.content }}
{{- '\n</tool_response>\n' }}
{%- if not loop.last and loop.nextitem.role != "tool" %}
{{- '<|im_end|>\n' }}
{%- elif loop.last %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{%- if enable_thinking %}
{{- '<|im_start|>assistant\n\n' }}
{%- else %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}
{%- endif %}
{# Copyright 2025-present Unsloth. Apache 2.0 License. #}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv update_slots: all slots are idle
srv log_server_r: request: GET /api/tags 127.0.0.1 200
srv update_slots: all slots are idle
srv log_server_r: request: GET /slots 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 1 | processing task
slot update_slots: id 3 | task 1 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 9972
slot update_slots: id 3 | task 1 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 1 | prompt processing progress, n_tokens = 1625, batch.n_tokens = 1625, progress = 0.162956
slot update_slots: id 3 | task 1 | n_tokens = 1625, memory_seq_rm [1625, end)
slot update_slots: id 3 | task 1 | prompt processing progress, n_tokens = 3250, batch.n_tokens = 1625, progress = 0.325913
slot update_slots: id 3 | task 1 | n_tokens = 3250, memory_seq_rm [3250, end)
slot update_slots: id 3 | task 1 | prompt processing progress, n_tokens = 4875, batch.n_tokens = 1625, progress = 0.488869
srv stop: cancel task, id_task = 1
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot release: id 3 | task 1 | stop processing: n_tokens = 4875, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: GET /api/tags 127.0.0.1 200
srv update_slots: all slots are idle
srv log_server_r: request: GET /slots 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 7 | processing task
slot update_slots: id 2 | task 7 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 5605
slot update_slots: id 2 | task 7 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 2 | task 7 | prompt processing progress, n_tokens = 1625, batch.n_tokens = 1625, progress = 0.289920
slot update_slots: id 2 | task 7 | n_tokens = 1625, memory_seq_rm [1625, end)
slot update_slots: id 2 | task 7 | prompt processing progress, n_tokens = 3250, batch.n_tokens = 1625, progress = 0.579839
slot update_slots: id 2 | task 7 | n_tokens = 3250, memory_seq_rm [3250, end)
slot update_slots: id 2 | task 7 | prompt processing progress, n_tokens = 4875, batch.n_tokens = 1625, progress = 0.869759
slot update_slots: id 2 | task 7 | n_tokens = 4875, memory_seq_rm [4875, end)
slot update_slots: id 2 | task 7 | prompt processing progress, n_tokens = 5541, batch.n_tokens = 666, progress = 0.988582
slot update_slots: id 2 | task 7 | n_tokens = 5541, memory_seq_rm [5541, end)
slot update_slots: id 2 | task 7 | prompt processing progress, n_tokens = 5605, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 7 | prompt done, n_tokens = 5605, batch.n_tokens = 64
slot update_slots: id 2 | task 7 | created context checkpoint 1 of 8 (pos_min = 5540, pos_max = 5540, size = 47.618 MiB)
slot print_timing: id 2 | task 7 |
prompt eval time = 8854.74 ms / 5605 tokens ( 1.58 ms per token, 632.99 tokens per second)
eval time = 19107.85 ms / 673 tokens ( 28.39 ms per token, 35.22 tokens per second)
total time = 27962.59 ms / 6278 tokens
slot release: id 2 | task 7 | stop processing: n_tokens = 6277, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.982 (> 0.100 thold), f_keep = 0.893
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 685 | processing task
slot update_slots: id 2 | task 685 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 5706
slot update_slots: id 2 | task 685 | n_past = 5604, slot.prompt.tokens.size() = 6277, seq_id = 2, pos_min = 6276, n_swa = 1
slot update_slots: id 2 | task 685 | restored context checkpoint (pos_min = 5540, pos_max = 5540, size = 47.618 MiB)
slot update_slots: id 2 | task 685 | n_tokens = 5541, memory_seq_rm [5541, end)
slot update_slots: id 2 | task 685 | prompt processing progress, n_tokens = 5642, batch.n_tokens = 101, progress = 0.988784
slot update_slots: id 2 | task 685 | n_tokens = 5642, memory_seq_rm [5642, end)
slot update_slots: id 2 | task 685 | prompt processing progress, n_tokens = 5706, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 685 | prompt done, n_tokens = 5706, batch.n_tokens = 64
slot update_slots: id 2 | task 685 | created context checkpoint 2 of 8 (pos_min = 5641, pos_max = 5641, size = 47.618 MiB)
slot print_timing: id 2 | task 685 |
prompt eval time = 1701.31 ms / 165 tokens ( 10.31 ms per token, 96.98 tokens per second)
eval time = 5458.47 ms / 193 tokens ( 28.28 ms per token, 35.36 tokens per second)
total time = 7159.78 ms / 358 tokens
slot release: id 2 | task 685 | stop processing: n_tokens = 5898, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.349 (> 0.100 thold), f_keep = 0.967
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 880 | processing task
slot update_slots: id 2 | task 880 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 16333
slot update_slots: id 2 | task 880 | n_past = 5705, slot.prompt.tokens.size() = 5898, seq_id = 2, pos_min = 5897, n_swa = 1
slot update_slots: id 2 | task 880 | restored context checkpoint (pos_min = 5641, pos_max = 5641, size = 47.618 MiB)
slot update_slots: id 2 | task 880 | n_tokens = 5642, memory_seq_rm [5642, end)
slot update_slots: id 2 | task 880 | prompt processing progress, n_tokens = 7267, batch.n_tokens = 1625, progress = 0.444927
slot update_slots: id 2 | task 880 | n_tokens = 7267, memory_seq_rm [7267, end)
slot update_slots: id 2 | task 880 | prompt processing progress, n_tokens = 8892, batch.n_tokens = 1625, progress = 0.544419
slot update_slots: id 2 | task 880 | n_tokens = 8892, memory_seq_rm [8892, end)
slot update_slots: id 2 | task 880 | prompt processing progress, n_tokens = 10517, batch.n_tokens = 1625, progress = 0.643911
slot update_slots: id 2 | task 880 | n_tokens = 10517, memory_seq_rm [10517, end)
slot update_slots: id 2 | task 880 | prompt processing progress, n_tokens = 12142, batch.n_tokens = 1625, progress = 0.743403
slot update_slots: id 2 | task 880 | n_tokens = 12142, memory_seq_rm [12142, end)
slot update_slots: id 2 | task 880 | prompt processing progress, n_tokens = 13767, batch.n_tokens = 1625, progress = 0.842895
slot update_slots: id 2 | task 880 | n_tokens = 13767, memory_seq_rm [13767, end)
slot update_slots: id 2 | task 880 | prompt processing progress, n_tokens = 15392, batch.n_tokens = 1625, progress = 0.942387
slot update_slots: id 2 | task 880 | n_tokens = 15392, memory_seq_rm [15392, end)
slot update_slots: id 2 | task 880 | prompt processing progress, n_tokens = 16269, batch.n_tokens = 877, progress = 0.996082
slot update_slots: id 2 | task 880 | n_tokens = 16269, memory_seq_rm [16269, end)
slot update_slots: id 2 | task 880 | prompt processing progress, n_tokens = 16333, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 880 | prompt done, n_tokens = 16333, batch.n_tokens = 64
slot update_slots: id 2 | task 880 | created context checkpoint 3 of 8 (pos_min = 16268, pos_max = 16268, size = 47.618 MiB)
slot print_timing: id 2 | task 880 |
prompt eval time = 16780.02 ms / 10691 tokens ( 1.57 ms per token, 637.13 tokens per second)
eval time = 2609.34 ms / 89 tokens ( 29.32 ms per token, 34.11 tokens per second)
total time = 19389.35 ms / 10780 tokens
slot release: id 2 | task 880 | stop processing: n_tokens = 16421, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.607 (> 0.100 thold), f_keep = 0.995
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 977 | processing task
slot update_slots: id 2 | task 977 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 26911
slot update_slots: id 2 | task 977 | n_past = 16332, slot.prompt.tokens.size() = 16421, seq_id = 2, pos_min = 16420, n_swa = 1
slot update_slots: id 2 | task 977 | restored context checkpoint (pos_min = 16268, pos_max = 16268, size = 47.618 MiB)
slot update_slots: id 2 | task 977 | n_tokens = 16269, memory_seq_rm [16269, end)
slot update_slots: id 2 | task 977 | prompt processing progress, n_tokens = 17894, batch.n_tokens = 1625, progress = 0.664933
slot update_slots: id 2 | task 977 | n_tokens = 17894, memory_seq_rm [17894, end)
slot update_slots: id 2 | task 977 | prompt processing progress, n_tokens = 19519, batch.n_tokens = 1625, progress = 0.725317
slot update_slots: id 2 | task 977 | n_tokens = 19519, memory_seq_rm [19519, end)
slot update_slots: id 2 | task 977 | prompt processing progress, n_tokens = 21144, batch.n_tokens = 1625, progress = 0.785701
slot update_slots: id 2 | task 977 | n_tokens = 21144, memory_seq_rm [21144, end)
slot update_slots: id 2 | task 977 | prompt processing progress, n_tokens = 22769, batch.n_tokens = 1625, progress = 0.846085
slot update_slots: id 2 | task 977 | n_tokens = 22769, memory_seq_rm [22769, end)
slot update_slots: id 2 | task 977 | prompt processing progress, n_tokens = 24394, batch.n_tokens = 1625, progress = 0.906469
slot update_slots: id 2 | task 977 | n_tokens = 24394, memory_seq_rm [24394, end)
slot update_slots: id 2 | task 977 | prompt processing progress, n_tokens = 26019, batch.n_tokens = 1625, progress = 0.966854
slot update_slots: id 2 | task 977 | n_tokens = 26019, memory_seq_rm [26019, end)
slot update_slots: id 2 | task 977 | prompt processing progress, n_tokens = 26847, batch.n_tokens = 828, progress = 0.997622
slot update_slots: id 2 | task 977 | n_tokens = 26847, memory_seq_rm [26847, end)
slot update_slots: id 2 | task 977 | prompt processing progress, n_tokens = 26911, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 977 | prompt done, n_tokens = 26911, batch.n_tokens = 64
slot update_slots: id 2 | task 977 | created context checkpoint 4 of 8 (pos_min = 26846, pos_max = 26846, size = 47.618 MiB)
slot print_timing: id 2 | task 977 |
prompt eval time = 18209.44 ms / 10642 tokens ( 1.71 ms per token, 584.42 tokens per second)
eval time = 12526.74 ms / 406 tokens ( 30.85 ms per token, 32.41 tokens per second)
total time = 30736.18 ms / 11048 tokens
slot release: id 2 | task 977 | stop processing: n_tokens = 27316, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.985
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 1391 | processing task
slot update_slots: id 2 | task 1391 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 26919
slot update_slots: id 2 | task 1391 | n_past = 26906, slot.prompt.tokens.size() = 27316, seq_id = 2, pos_min = 27315, n_swa = 1
slot update_slots: id 2 | task 1391 | restored context checkpoint (pos_min = 26846, pos_max = 26846, size = 47.618 MiB)
slot update_slots: id 2 | task 1391 | n_tokens = 26847, memory_seq_rm [26847, end)
slot update_slots: id 2 | task 1391 | prompt processing progress, n_tokens = 26855, batch.n_tokens = 8, progress = 0.997622
slot update_slots: id 2 | task 1391 | n_tokens = 26855, memory_seq_rm [26855, end)
slot update_slots: id 2 | task 1391 | prompt processing progress, n_tokens = 26919, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 1391 | prompt done, n_tokens = 26919, batch.n_tokens = 64
slot print_timing: id 2 | task 1391 |
prompt eval time = 995.73 ms / 72 tokens ( 13.83 ms per token, 72.31 tokens per second)
eval time = 30936.83 ms / 1004 tokens ( 30.81 ms per token, 32.45 tokens per second)
total time = 31932.55 ms / 1076 tokens
slot release: id 2 | task 1391 | stop processing: n_tokens = 27922, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.982 (> 0.100 thold), f_keep = 0.964
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 2397 | processing task
slot update_slots: id 2 | task 2397 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 27412
slot update_slots: id 2 | task 2397 | n_past = 26918, slot.prompt.tokens.size() = 27922, seq_id = 2, pos_min = 27921, n_swa = 1
slot update_slots: id 2 | task 2397 | restored context checkpoint (pos_min = 26846, pos_max = 26846, size = 47.618 MiB)
slot update_slots: id 2 | task 2397 | n_tokens = 26847, memory_seq_rm [26847, end)
slot update_slots: id 2 | task 2397 | prompt processing progress, n_tokens = 27348, batch.n_tokens = 501, progress = 0.997665
slot update_slots: id 2 | task 2397 | n_tokens = 27348, memory_seq_rm [27348, end)
slot update_slots: id 2 | task 2397 | prompt processing progress, n_tokens = 27412, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 2397 | prompt done, n_tokens = 27412, batch.n_tokens = 64
slot update_slots: id 2 | task 2397 | created context checkpoint 5 of 8 (pos_min = 27347, pos_max = 27347, size = 47.618 MiB)
slot print_timing: id 2 | task 2397 |
prompt eval time = 2606.55 ms / 565 tokens ( 4.61 ms per token, 216.76 tokens per second)
eval time = 23907.75 ms / 773 tokens ( 30.93 ms per token, 32.33 tokens per second)
total time = 26514.30 ms / 1338 tokens
slot release: id 2 | task 2397 | stop processing: n_tokens = 28184, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.997 (> 0.100 thold), f_keep = 0.973
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 3172 | processing task
slot update_slots: id 2 | task 3172 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 27480
slot update_slots: id 2 | task 3172 | n_past = 27411, slot.prompt.tokens.size() = 28184, seq_id = 2, pos_min = 28183, n_swa = 1
slot update_slots: id 2 | task 3172 | restored context checkpoint (pos_min = 27347, pos_max = 27347, size = 47.618 MiB)
slot update_slots: id 2 | task 3172 | n_tokens = 27348, memory_seq_rm [27348, end)
slot update_slots: id 2 | task 3172 | prompt processing progress, n_tokens = 27416, batch.n_tokens = 68, progress = 0.997671
slot update_slots: id 2 | task 3172 | n_tokens = 27416, memory_seq_rm [27416, end)
slot update_slots: id 2 | task 3172 | prompt processing progress, n_tokens = 27480, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 3172 | prompt done, n_tokens = 27480, batch.n_tokens = 64
slot update_slots: id 2 | task 3172 | created context checkpoint 6 of 8 (pos_min = 27415, pos_max = 27415, size = 47.618 MiB)
slot print_timing: id 2 | task 3172 |
prompt eval time = 1770.92 ms / 132 tokens ( 13.42 ms per token, 74.54 tokens per second)
eval time = 101896.73 ms / 3191 tokens ( 31.93 ms per token, 31.32 tokens per second)
total time = 103667.65 ms / 3323 tokens
slot release: id 2 | task 3172 | stop processing: n_tokens = 30670, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.993 (> 0.100 thold), f_keep = 0.896
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 6365 | processing task
slot update_slots: id 2 | task 6365 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 27680
slot update_slots: id 2 | task 6365 | n_past = 27479, slot.prompt.tokens.size() = 30670, seq_id = 2, pos_min = 30669, n_swa = 1
slot update_slots: id 2 | task 6365 | restored context checkpoint (pos_min = 27415, pos_max = 27415, size = 47.618 MiB)
slot update_slots: id 2 | task 6365 | n_tokens = 27416, memory_seq_rm [27416, end)
slot update_slots: id 2 | task 6365 | prompt processing progress, n_tokens = 27616, batch.n_tokens = 200, progress = 0.997688
slot update_slots: id 2 | task 6365 | n_tokens = 27616, memory_seq_rm [27616, end)
slot update_slots: id 2 | task 6365 | prompt processing progress, n_tokens = 27680, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 6365 | prompt done, n_tokens = 27680, batch.n_tokens = 64
slot update_slots: id 2 | task 6365 | created context checkpoint 7 of 8 (pos_min = 27615, pos_max = 27615, size = 47.618 MiB)
slot print_timing: id 2 | task 6365 |
prompt eval time = 2179.08 ms / 264 tokens ( 8.25 ms per token, 121.15 tokens per second)
eval time = 50388.43 ms / 1625 tokens ( 31.01 ms per token, 32.25 tokens per second)
total time = 52567.51 ms / 1889 tokens
slot release: id 2 | task 6365 | stop processing: n_tokens = 29304, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.986 (> 0.100 thold), f_keep = 0.945
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 7992 | processing task
slot update_slots: id 2 | task 7992 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 28062
slot update_slots: id 2 | task 7992 | n_past = 27679, slot.prompt.tokens.size() = 29304, seq_id = 2, pos_min = 29303, n_swa = 1
slot update_slots: id 2 | task 7992 | restored context checkpoint (pos_min = 27615, pos_max = 27615, size = 47.618 MiB)
slot update_slots: id 2 | task 7992 | n_tokens = 27616, memory_seq_rm [27616, end)
slot update_slots: id 2 | task 7992 | prompt processing progress, n_tokens = 27998, batch.n_tokens = 382, progress = 0.997719
slot update_slots: id 2 | task 7992 | n_tokens = 27998, memory_seq_rm [27998, end)
slot update_slots: id 2 | task 7992 | prompt processing progress, n_tokens = 28062, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 7992 | prompt done, n_tokens = 28062, batch.n_tokens = 64
slot update_slots: id 2 | task 7992 | created context checkpoint 8 of 8 (pos_min = 27997, pos_max = 27997, size = 47.618 MiB)
slot print_timing: id 2 | task 7992 |
prompt eval time = 2489.96 ms / 446 tokens ( 5.58 ms per token, 179.12 tokens per second)
eval time = 20526.31 ms / 662 tokens ( 31.01 ms per token, 32.25 tokens per second)
total time = 23016.27 ms / 1108 tokens
slot release: id 2 | task 7992 | stop processing: n_tokens = 28723, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.981 (> 0.100 thold), f_keep = 0.977
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 8656 | processing task
slot update_slots: id 2 | task 8656 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 28599
slot update_slots: id 2 | task 8656 | n_past = 28061, slot.prompt.tokens.size() = 28723, seq_id = 2, pos_min = 28722, n_swa = 1
slot update_slots: id 2 | task 8656 | restored context checkpoint (pos_min = 27997, pos_max = 27997, size = 47.618 MiB)
slot update_slots: id 2 | task 8656 | n_tokens = 27998, memory_seq_rm [27998, end)
slot update_slots: id 2 | task 8656 | prompt processing progress, n_tokens = 28535, batch.n_tokens = 537, progress = 0.997762
slot update_slots: id 2 | task 8656 | n_tokens = 28535, memory_seq_rm [28535, end)
slot update_slots: id 2 | task 8656 | prompt processing progress, n_tokens = 28599, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 8656 | prompt done, n_tokens = 28599, batch.n_tokens = 64
slot update_slots: id 2 | task 8656 | erasing old context checkpoint (pos_min = 5540, pos_max = 5540, size = 47.618 MiB)
slot update_slots: id 2 | task 8656 | created context checkpoint 8 of 8 (pos_min = 28534, pos_max = 28534, size = 47.618 MiB)
slot print_timing: id 2 | task 8656 |
prompt eval time = 2603.42 ms / 601 tokens ( 4.33 ms per token, 230.85 tokens per second)
eval time = 21206.75 ms / 684 tokens ( 31.00 ms per token, 32.25 tokens per second)
total time = 23810.17 ms / 1285 tokens
slot release: id 2 | task 8656 | stop processing: n_tokens = 29282, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv log_server_r: request: GET /api/tags 127.0.0.1 200
srv update_slots: all slots are idle
srv log_server_r: request: GET /slots 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.988 (> 0.100 thold), f_keep = 0.191
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 29282, total state size = 96.208 MiB
srv load: - looking for better prompt, base f_keep = 0.191, sim = 0.988
srv update: - cache state: 1 prompts, 477.150 MiB (limits: 8192.000 MiB, 262144 tokens, 502731 est)
srv update: - prompt 0x2ab563c0: 29282 tokens, checkpoints: 8, 477.150 MiB
srv get_availabl: prompt cache update took 170.31 ms
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 9343 | processing task
slot update_slots: id 2 | task 9343 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 5655
slot update_slots: id 2 | task 9343 | n_past = 5589, slot.prompt.tokens.size() = 29282, seq_id = 2, pos_min = 29281, n_swa = 1
slot update_slots: id 2 | task 9343 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com//pull/13194#issuecomment-2868343055)
slot update_slots: id 2 | task 9343 | erased invalidated context checkpoint (pos_min = 5641, pos_max = 5641, n_swa = 1, size = 47.618 MiB)
slot update_slots: id 2 | task 9343 | erased invalidated context checkpoint (pos_min = 16268, pos_max = 16268, n_swa = 1, size = 47.618 MiB)
slot update_slots: id 2 | task 9343 | erased invalidated context checkpoint (pos_min = 26846, pos_max = 26846, n_swa = 1, size = 47.618 MiB)
slot update_slots: id 2 | task 9343 | erased invalidated context checkpoint (pos_min = 27347, pos_max = 27347, n_swa = 1, size = 47.618 MiB)
slot update_slots: id 2 | task 9343 | erased invalidated context checkpoint (pos_min = 27415, pos_max = 27415, n_swa = 1, size = 47.618 MiB)
slot update_slots: id 2 | task 9343 | erased invalidated context checkpoint (pos_min = 27615, pos_max = 27615, n_swa = 1, size = 47.618 MiB)
slot update_slots: id 2 | task 9343 | erased invalidated context checkpoint (pos_min = 27997, pos_max = 27997, n_swa = 1, size = 47.618 MiB)
slot update_slots: id 2 | task 9343 | erased invalidated context checkpoint (pos_min = 28534, pos_max = 28534, n_swa = 1, size = 47.618 MiB)
slot update_slots: id 2 | task 9343 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 2 | task 9343 | prompt processing progress, n_tokens = 1625, batch.n_tokens = 1625, progress = 0.287356
slot update_slots: id 2 | task 9343 | n_tokens = 1625, memory_seq_rm [1625, end)
slot update_slots: id 2 | task 9343 | prompt processing progress, n_tokens = 3250, batch.n_tokens = 1625, progress = 0.574713
slot update_slots: id 2 | task 9343 | n_tokens = 3250, memory_seq_rm [3250, end)
slot update_slots: id 2 | task 9343 | prompt processing progress, n_tokens = 4875, batch.n_tokens = 1625, progress = 0.862069
slot update_slots: id 2 | task 9343 | n_tokens = 4875, memory_seq_rm [4875, end)
slot update_slots: id 2 | task 9343 | prompt processing progress, n_tokens = 5591, batch.n_tokens = 716, progress = 0.988683
slot update_slots: id 2 | task 9343 | n_tokens = 5591, memory_seq_rm [5591, end)
slot update_slots: id 2 | task 9343 | prompt processing progress, n_tokens = 5655, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 9343 | prompt done, n_tokens = 5655, batch.n_tokens = 64
slot update_slots: id 2 | task 9343 | created context checkpoint 1 of 8 (pos_min = 5590, pos_max = 5590, size = 47.618 MiB)
slot print_timing: id 2 | task 9343 |
prompt eval time = 8991.38 ms / 5655 tokens ( 1.59 ms per token, 628.94 tokens per second)
eval time = 9852.93 ms / 347 tokens ( 28.39 ms per token, 35.22 tokens per second)
total time = 18844.31 ms / 6002 tokens
slot release: id 2 | task 9343 | stop processing: n_tokens = 6001, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.663 (> 0.100 thold), f_keep = 0.942
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 9695 | processing task
slot update_slots: id 2 | task 9695 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 8530
slot update_slots: id 2 | task 9695 | n_past = 5654, slot.prompt.tokens.size() = 6001, seq_id = 2, pos_min = 6000, n_swa = 1
slot update_slots: id 2 | task 9695 | restored context checkpoint (pos_min = 5590, pos_max = 5590, size = 47.618 MiB)
slot update_slots: id 2 | task 9695 | n_tokens = 5591, memory_seq_rm [5591, end)
slot update_slots: id 2 | task 9695 | prompt processing progress, n_tokens = 7216, batch.n_tokens = 1625, progress = 0.845955
slot update_slots: id 2 | task 9695 | n_tokens = 7216, memory_seq_rm [7216, end)
slot update_slots: id 2 | task 9695 | prompt processing progress, n_tokens = 8466, batch.n_tokens = 1250, progress = 0.992497
slot update_slots: id 2 | task 9695 | n_tokens = 8466, memory_seq_rm [8466, end)
slot update_slots: id 2 | task 9695 | prompt processing progress, n_tokens = 8530, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 9695 | prompt done, n_tokens = 8530, batch.n_tokens = 64
slot update_slots: id 2 | task 9695 | created context checkpoint 2 of 8 (pos_min = 8465, pos_max = 8465, size = 47.618 MiB)
slot print_timing: id 2 | task 9695 |
prompt eval time = 5219.43 ms / 2939 tokens ( 1.78 ms per token, 563.09 tokens per second)
eval time = 70685.57 ms / 2444 tokens ( 28.92 ms per token, 34.58 tokens per second)
total time = 75904.99 ms / 5383 tokens
slot release: id 2 | task 9695 | stop processing: n_tokens = 10973, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv log_server_r: request: GET /api/tags 127.0.0.1 200
srv update_slots: all slots are idle
srv log_server_r: request: GET /slots 127.0.0.1 200
srv log_server_r: request: GET /api/tags 127.0.0.1 200
srv update_slots: all slots are idle
srv log_server_r: request: GET /slots 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.792 (> 0.100 thold), f_keep = 0.777
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 12144 | processing task
slot update_slots: id 2 | task 12144 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 10769
slot update_slots: id 2 | task 12144 | n_past = 8529, slot.prompt.tokens.size() = 10973, seq_id = 2, pos_min = 10972, n_swa = 1
slot update_slots: id 2 | task 12144 | restored context checkpoint (pos_min = 8465, pos_max = 8465, size = 47.618 MiB)
slot update_slots: id 2 | task 12144 | n_tokens = 8466, memory_seq_rm [8466, end)
slot update_slots: id 2 | task 12144 | prompt processing progress, n_tokens = 10091, batch.n_tokens = 1625, progress = 0.937042
slot update_slots: id 2 | task 12144 | n_tokens = 10091, memory_seq_rm [10091, end)
slot update_slots: id 2 | task 12144 | prompt processing progress, n_tokens = 10705, batch.n_tokens = 614, progress = 0.994057
slot update_slots: id 2 | task 12144 | n_tokens = 10705, memory_seq_rm [10705, end)
slot update_slots: id 2 | task 12144 | prompt processing progress, n_tokens = 10769, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 12144 | prompt done, n_tokens = 10769, batch.n_tokens = 64
slot update_slots: id 2 | task 12144 | created context checkpoint 3 of 8 (pos_min = 10704, pos_max = 10704, size = 47.618 MiB)
slot print_timing: id 2 | task 12144 |
prompt eval time = 4906.92 ms / 2303 tokens ( 2.13 ms per token, 469.34 tokens per second)
eval time = 28626.52 ms / 985 tokens ( 29.06 ms per token, 34.41 tokens per second)
total time = 33533.44 ms / 3288 tokens
slot release: id 2 | task 12144 | stop processing: n_tokens = 11753, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.791 (> 0.100 thold), f_keep = 0.916
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 13132 | processing task
slot update_slots: id 2 | task 13132 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 13614
slot update_slots: id 2 | task 13132 | n_past = 10768, slot.prompt.tokens.size() = 11753, seq_id = 2, pos_min = 11752, n_swa = 1
slot update_slots: id 2 | task 13132 | restored context checkpoint (pos_min = 10704, pos_max = 10704, size = 47.618 MiB)
slot update_slots: id 2 | task 13132 | n_tokens = 10705, memory_seq_rm [10705, end)
slot update_slots: id 2 | task 13132 | prompt processing progress, n_tokens = 12330, batch.n_tokens = 1625, progress = 0.905685
slot update_slots: id 2 | task 13132 | n_tokens = 12330, memory_seq_rm [12330, end)
slot update_slots: id 2 | task 13132 | prompt processing progress, n_tokens = 13550, batch.n_tokens = 1220, progress = 0.995299
slot update_slots: id 2 | task 13132 | n_tokens = 13550, memory_seq_rm [13550, end)
slot update_slots: id 2 | task 13132 | prompt processing progress, n_tokens = 13614, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 13132 | prompt done, n_tokens = 13614, batch.n_tokens = 64
slot update_slots: id 2 | task 13132 | created context checkpoint 4 of 8 (pos_min = 13549, pos_max = 13549, size = 47.618 MiB)
slot print_timing: id 2 | task 13132 |
prompt eval time = 5195.99 ms / 2909 tokens ( 1.79 ms per token, 559.85 tokens per second)
eval time = 8079.72 ms / 276 tokens ( 29.27 ms per token, 34.16 tokens per second)
total time = 13275.70 ms / 3185 tokens
slot release: id 2 | task 13132 | stop processing: n_tokens = 13889, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.825 (> 0.100 thold), f_keep = 0.980
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 13411 | processing task
slot update_slots: id 2 | task 13411 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 16495
slot update_slots: id 2 | task 13411 | n_past = 13613, slot.prompt.tokens.size() = 13889, seq_id = 2, pos_min = 13888, n_swa = 1
slot update_slots: id 2 | task 13411 | restored context checkpoint (pos_min = 13549, pos_max = 13549, size = 47.618 MiB)
slot update_slots: id 2 | task 13411 | n_tokens = 13550, memory_seq_rm [13550, end)
slot update_slots: id 2 | task 13411 | prompt processing progress, n_tokens = 15175, batch.n_tokens = 1625, progress = 0.919976
slot update_slots: id 2 | task 13411 | n_tokens = 15175, memory_seq_rm [15175, end)
slot update_slots: id 2 | task 13411 | prompt processing progress, n_tokens = 16431, batch.n_tokens = 1256, progress = 0.996120
slot update_slots: id 2 | task 13411 | n_tokens = 16431, memory_seq_rm [16431, end)
slot update_slots: id 2 | task 13411 | prompt processing progress, n_tokens = 16495, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 13411 | prompt done, n_tokens = 16495, batch.n_tokens = 64
slot update_slots: id 2 | task 13411 | created context checkpoint 5 of 8 (pos_min = 16430, pos_max = 16430, size = 47.618 MiB)
slot print_timing: id 2 | task 13411 |
prompt eval time = 5301.19 ms / 2945 tokens ( 1.80 ms per token, 555.54 tokens per second)
eval time = 6638.55 ms / 222 tokens ( 29.90 ms per token, 33.44 tokens per second)
total time = 11939.73 ms / 3167 tokens
slot release: id 2 | task 13411 | stop processing: n_tokens = 16716, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.880 (> 0.100 thold), f_keep = 0.987
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 13636 | processing task
slot update_slots: id 2 | task 13636 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 18753
slot update_slots: id 2 | task 13636 | n_past = 16494, slot.prompt.tokens.size() = 16716, seq_id = 2, pos_min = 16715, n_swa = 1
slot update_slots: id 2 | task 13636 | restored context checkpoint (pos_min = 16430, pos_max = 16430, size = 47.618 MiB)
slot update_slots: id 2 | task 13636 | n_tokens = 16431, memory_seq_rm [16431, end)
slot update_slots: id 2 | task 13636 | prompt processing progress, n_tokens = 18056, batch.n_tokens = 1625, progress = 0.962833
slot update_slots: id 2 | task 13636 | n_tokens = 18056, memory_seq_rm [18056, end)
slot update_slots: id 2 | task 13636 | prompt processing progress, n_tokens = 18689, batch.n_tokens = 633, progress = 0.996587
slot update_slots: id 2 | task 13636 | n_tokens = 18689, memory_seq_rm [18689, end)
slot update_slots: id 2 | task 13636 | prompt processing progress, n_tokens = 18753, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 13636 | prompt done, n_tokens = 18753, batch.n_tokens = 64
slot update_slots: id 2 | task 13636 | created context checkpoint 6 of 8 (pos_min = 18688, pos_max = 18688, size = 47.618 MiB)
slot print_timing: id 2 | task 13636 |
prompt eval time = 5169.56 ms / 2322 tokens ( 2.23 ms per token, 449.17 tokens per second)
eval time = 9686.29 ms / 322 tokens ( 30.08 ms per token, 33.24 tokens per second)
total time = 14855.85 ms / 2644 tokens
slot release: id 2 | task 13636 | stop processing: n_tokens = 19074, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.865 (> 0.100 thold), f_keep = 0.983
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 13961 | processing task
slot update_slots: id 2 | task 13961 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 21668
slot update_slots: id 2 | task 13961 | n_past = 18752, slot.prompt.tokens.size() = 19074, seq_id = 2, pos_min = 19073, n_swa = 1
slot update_slots: id 2 | task 13961 | restored context checkpoint (pos_min = 18688, pos_max = 18688, size = 47.618 MiB)
slot update_slots: id 2 | task 13961 | n_tokens = 18689, memory_seq_rm [18689, end)
slot update_slots: id 2 | task 13961 | prompt processing progress, n_tokens = 20314, batch.n_tokens = 1625, progress = 0.937512
slot update_slots: id 2 | task 13961 | n_tokens = 20314, memory_seq_rm [20314, end)
slot update_slots: id 2 | task 13961 | prompt processing progress, n_tokens = 21604, batch.n_tokens = 1290, progress = 0.997046
slot update_slots: id 2 | task 13961 | n_tokens = 21604, memory_seq_rm [21604, end)
slot update_slots: id 2 | task 13961 | prompt processing progress, n_tokens = 21668, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 13961 | prompt done, n_tokens = 21668, batch.n_tokens = 64
slot update_slots: id 2 | task 13961 | created context checkpoint 7 of 8 (pos_min = 21603, pos_max = 21603, size = 47.618 MiB)
slot print_timing: id 2 | task 13961 |
prompt eval time = 5810.13 ms / 2979 tokens ( 1.95 ms per token, 512.73 tokens per second)
eval time = 66240.13 ms / 2182 tokens ( 30.36 ms per token, 32.94 tokens per second)
total time = 72050.26 ms / 5161 tokens
slot release: id 2 | task 13961 | stop processing: n_tokens = 23849, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.924 (> 0.100 thold), f_keep = 0.909
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 16146 | processing task
slot update_slots: id 2 | task 16146 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 23447
slot update_slots: id 2 | task 16146 | n_past = 21667, slot.prompt.tokens.size() = 23849, seq_id = 2, pos_min = 23848, n_swa = 1
slot update_slots: id 2 | task 16146 | restored context checkpoint (pos_min = 21603, pos_max = 21603, size = 47.618 MiB)
slot update_slots: id 2 | task 16146 | n_tokens = 21604, memory_seq_rm [21604, end)
slot update_slots: id 2 | task 16146 | prompt processing progress, n_tokens = 23229, batch.n_tokens = 1625, progress = 0.990702
slot update_slots: id 2 | task 16146 | n_tokens = 23229, memory_seq_rm [23229, end)
slot update_slots: id 2 | task 16146 | prompt processing progress, n_tokens = 23383, batch.n_tokens = 154, progress = 0.997270
slot update_slots: id 2 | task 16146 | n_tokens = 23383, memory_seq_rm [23383, end)
slot update_slots: id 2 | task 16146 | prompt processing progress, n_tokens = 23447, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 16146 | prompt done, n_tokens = 23447, batch.n_tokens = 64
slot update_slots: id 2 | task 16146 | created context checkpoint 8 of 8 (pos_min = 23382, pos_max = 23382, size = 47.618 MiB)
slot print_timing: id 2 | task 16146 |
prompt eval time = 4699.71 ms / 1843 tokens ( 2.55 ms per token, 392.15 tokens per second)
eval time = 72890.12 ms / 2384 tokens ( 30.57 ms per token, 32.71 tokens per second)
total time = 77589.83 ms / 4227 tokens
slot release: id 2 | task 16146 | stop processing: n_tokens = 25830, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.910 (> 0.100 thold), f_keep = 0.908
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 18533 | processing task
slot update_slots: id 2 | task 18533 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 25755
slot update_slots: id 2 | task 18533 | n_past = 23446, slot.prompt.tokens.size() = 25830, seq_id = 2, pos_min = 25829, n_swa = 1
slot update_slots: id 2 | task 18533 | restored context checkpoint (pos_min = 23382, pos_max = 23382, size = 47.618 MiB)
slot update_slots: id 2 | task 18533 | n_tokens = 23383, memory_seq_rm [23383, end)
slot update_slots: id 2 | task 18533 | prompt processing progress, n_tokens = 25008, batch.n_tokens = 1625, progress = 0.970996
slot update_slots: id 2 | task 18533 | n_tokens = 25008, memory_seq_rm [25008, end)
slot update_slots: id 2 | task 18533 | prompt processing progress, n_tokens = 25691, batch.n_tokens = 683, progress = 0.997515
slot update_slots: id 2 | task 18533 | n_tokens = 25691, memory_seq_rm [25691, end)
slot update_slots: id 2 | task 18533 | prompt processing progress, n_tokens = 25755, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 18533 | prompt done, n_tokens = 25755, batch.n_tokens = 64
slot update_slots: id 2 | task 18533 | erasing old context checkpoint (pos_min = 5590, pos_max = 5590, size = 47.618 MiB)
slot update_slots: id 2 | task 18533 | created context checkpoint 8 of 8 (pos_min = 25690, pos_max = 25690, size = 47.618 MiB)
slot print_timing: id 2 | task 18533 |
prompt eval time = 5397.80 ms / 2372 tokens ( 2.28 ms per token, 439.44 tokens per second)
eval time = 97035.56 ms / 3141 tokens ( 30.89 ms per token, 32.37 tokens per second)
total time = 102433.36 ms / 5513 tokens
slot release: id 2 | task 18533 | stop processing: n_tokens = 28895, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: peg-constructed
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.902 (> 0.100 thold), f_keep = 0.891
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 21677 | processing task
slot update_slots: id 2 | task 21677 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 28550
slot update_slots: id 2 | task 21677 | n_past = 25754, slot.prompt.tokens.size() = 28895, seq_id = 2, pos_min = 28894, n_swa = 1
slot update_slots: id 2 | task 21677 | restored context checkpoint (pos_min = 25690, pos_max = 25690, size = 47.618 MiB)
slot update_slots: id 2 | task 21677 | n_tokens = 25691, memory_seq_rm [25691, end)
slot update_slots: id 2 | task 21677 | prompt processing progress, n_tokens = 27316, batch.n_tokens = 1625, progress = 0.956778
slot update_slots: id 2 | task 21677 | n_tokens = 27316, memory_seq_rm [27316, end)
slot update_slots: id 2 | task 21677 | prompt processing progress, n_tokens = 28486, batch.n_tokens = 1170, progress = 0.997758
slot update_slots: id 2 | task 21677 | n_tokens = 28486, memory_seq_rm [28486, end)
slot update_slots: id 2 | task 21677 | prompt processing progress, n_tokens = 28550, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 2 | task 21677 | prompt done, n_tokens = 28550, batch.n_tokens = 64
slot update_slots: id 2 | task 21677 | erasing old context checkpoint (pos_min = 8465, pos_max = 8465, size = 47.618 MiB)
slot update_slots: id 2 | task 21677 | created context checkpoint 8 of 8 (pos_min = 28485, pos_max = 28485, size = 47.618 MiB)
slot print_timing: id 2 | task 21677 |
prompt eval time = 6025.06 ms / 2859 tokens ( 2.11 ms per token, 474.52 tokens per second)
eval time = 108551.40 ms / 3474 tokens ( 31.25 ms per token, 32.00 tokens per second)
total time = 114576.45 ms / 6333 tokens
slot release: id 2 | task 21677 | stop processing: n_tokens = 32023, truncated = 0
srv update_slots: all slots are idle
Beta Was this translation helpful? Give feedback.
All reactions