-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Description
Name and Version
build: 5478 (f5cd27b) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux, Other? (Please let us know in description)
GGML backends
CUDA
Hardware
Device 0: NVIDIA DRIVE-PG199-PROD, compute capability 8.0, VMM: yes
NVIDIA A100 Drive (32GB) SXM2 installed in Gigabyte T181-G20
Models
https://huggingface.co/Qwen/Qwen3-32B-GGUF/blob/main/Qwen3-32B-Q4_K_M.gguf
Problem description & steps to reproduce
Also mentioned in thread over here
As I was having issues with tooling (n8n, continue for vscode, others) starting to "default" to call stream even when calling for tools I came across the above thread and have been using an updated version of the proxy workaround noted by crashr in n8n-io/n8n#13112 (comment)
Another user noted that the recent work in https://github.com/ggml-org/llama.cpp/releases/tag/b5478 was committed which should resolve these issues.
cloned the repo, utilized the docker build command to build out a local copy with the latest commit from
Line 68 in f5cd27b
| docker build -t local/llama.cpp:server-cuda --target server -f .devops/cuda.Dockerfile . |
This appeared successful and I was able to load the model! Using this configuration for the container:
qwen3_32B:
image: local/llama.cpp:server-cuda-b5478
restart: unless-stopped
ports:
- 8080:8080
volumes:
- {{path}}/models:/models
- {{path}}/hf_home:/root/.cache/huggingface
runtime: nvidia
# command: -m models/7B/gguf --ctx-size 4096 --host 0.0.0.0 --port 8080 --n-gpu-layers 99
environment:
# alternatively, you can use "LLAMA_ARG_MODEL_URL" to download the model
LLAMA_ARG_MODEL: /models/qwen3/Qwen3-32B-Q4_K_M.gguf
LLAMA_ARG_CTX_SIZE: ${LLAMA_ARG_CTX_SIZE:-0}
LLAMA_ARG_N_GPU_LAYERS: ${LLAMA_ARG_N_GPU_LAYERS:-99}
LLAMA_ARG_ALIAS: Qwen3-32B
LLAMA_ARG_FLASH_ATTN: ${LLAMA_ARG_FLASH_ATTN:-true}
LLAMA_ARG_JINJA: true
LLAMA_ARG_THINK: deepseek
HF_TOKEN: ${HUGGING_FACE_HUB_TOKEN}
TZ: ${TZ}
env_file:
- .env
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ["1"]
Tests and Results:
Test: n8n direct to llama-cpp-server in container send "Hello" and return:
n8n version
1.93.0 (Self Hosted)
Time
5/24/2025, 8:38:54 PM
Error cause
{ "status": 500, "headers": { "access-control-allow-origin": "", "content-length": "85", "content-type": "application/json; charset=utf-8", "keep-alive": "timeout=5, max=100", "server": "llama.cpp" }, "error": { "code": 500, "message": "Cannot use tools with stream", "type": "server_error" }, "code": 500, "type": "server_error", "attemptNumber": 3, "retriesLeft": 0 }
test: n8n calling same container but frontended with openwebui:
input:
{
"messages": [
"System: You are a helpful assistant\nHuman: Hello"
],
"estimatedTokens": 11,
"options": {
"openai_api_key": {
"lc": 1,
"type": "secret",
"id": [
"OPENAI_API_KEY"
]
},
"model": "Qwen3-32B",
"timeout": 60000,
"max_retries": 2,
"configuration": {
"baseURL": "{{{openwebuiserver}}}"
},
"model_kwargs": {}
}
}
output:
{
"response": {
"generations": [
[
{
"text": "<think>Okay, the user said \"Hello\". I need to respond appropriately. Since there's no specific query here that requires using any of the provided tools, I should just reply with a friendly greeting. Let me check the available functions again to make sure none are needed. The Home_Assistant, Date_Time, Google_Calendar, and wikipedia-api functions are available, but the user isn't asking for anything that needs those tools right now. So I'll just respond with a hello and offer assistance.</think>Hello! How can I assist you today?",
"generationInfo": {
"prompt": 0,
"completion": 0,
"finish_reason": "stop",
"system_fingerprint": "b5478-f5cd27b7",
"model_name": "Qwen3-32B"
}
}
]
]
},
"tokenUsage": {
"completionTokens": 113,
"promptTokens": 432,
"totalTokens": 545
}
}
Follow up question:
{
"messages": [
"System: You are a helpful assistant\nHuman: Hello\nAI: <think>Okay, the user said \"Hello\". I need to respond appropriately. Since there's no specific query here that requires using any of the provided tools, I should just reply with a friendly greeting. Let me check the available functions again to make sure none are needed. The Home_Assistant, Date_Time, Google_Calendar, and wikipedia-api functions are available, but the user isn't asking for anything that needs those tools right now. So I'll just respond with a hello and offer assistance.</think>Hello! How can I assist you today?\nHuman: What tools do you have available?"
],
"estimatedTokens": 132,
"options": {
"openai_api_key": {
"lc": 1,
"type": "secret",
"id": [
"OPENAI_API_KEY"
]
},
"model": "Qwen3-32B",
"timeout": 60000,
"max_retries": 2,
"configuration": {
"baseURL": "{{{openwebuiserver}}}"
},
"model_kwargs": {}
}
}
error returned in n8n:
{
"errorMessage": "Premature close",
"errorDetails": {},
"n8nDetails": {
"time": "5/24/2025, 8:45:19 PM",
"n8nVersion": "1.93.0 (Self Hosted)",
"binaryDataMode": "default",
"cause": {
"code": "ERR_STREAM_PREMATURE_CLOSE"
}
}
}
Also container crashes, restarts
Test: Using stream proxy workaround from crashr in n8n-io/n8n#13112 (comment)
n8n is able to successfully call both for chat and for tool calls.
First Bad Commit
No response
Relevant log output
qwen3_32B-1 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
qwen3_32B-1 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
qwen3_32B-1 | ggml_cuda_init: found 1 CUDA devices:
qwen3_32B-1 | Device 0: NVIDIA DRIVE-PG199-PROD, compute capability 8.0, VMM: yes
qwen3_32B-1 | load_backend: loaded CUDA backend from /app/libggml-cuda.so
qwen3_32B-1 | load_backend: loaded CPU backend from /app/libggml-cpu-skylakex.so
qwen3_32B-1 | build: 5478 (f5cd27b7) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
qwen3_32B-1 | system info: n_threads = 10, n_threads_batch = 10, total_threads = 20
qwen3_32B-1 |
qwen3_32B-1 | system_info: n_threads = 10 (n_threads_batch = 10) / 20 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
qwen3_32B-1 |
qwen3_32B-1 | main: binding port with default address family
qwen3_32B-1 | main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 19
qwen3_32B-1 | main: loading model
qwen3_32B-1 | srv load_model: loading model '/models/qwen3/Qwen3-32B-Q4_K_M.gguf'
qwen3_32B-1 | llama_model_load_from_file_impl: using device CUDA0 (NVIDIA DRIVE-PG199-PROD) - 31922 MiB free
qwen3_32B-1 | llama_model_loader: loaded meta data with 28 key-value pairs and 707 tensors from /models/qwen3/Qwen3-32B-Q4_K_M.gguf (version GGUF V3 (latest))
qwen3_32B-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
qwen3_32B-1 | llama_model_loader: - kv 0: general.architecture str = qwen3
qwen3_32B-1 | llama_model_loader: - kv 1: general.type str = model
qwen3_32B-1 | llama_model_loader: - kv 2: general.name str = Qwen3 32B Awq Compatible Instruct
qwen3_32B-1 | llama_model_loader: - kv 3: general.finetune str = awq-compatible-Instruct
qwen3_32B-1 | llama_model_loader: - kv 4: general.basename str = Qwen3
qwen3_32B-1 | llama_model_loader: - kv 5: general.size_label str = 32B
qwen3_32B-1 | llama_model_loader: - kv 6: qwen3.block_count u32 = 64
qwen3_32B-1 | llama_model_loader: - kv 7: qwen3.context_length u32 = 40960
qwen3_32B-1 | llama_model_loader: - kv 8: qwen3.embedding_length u32 = 5120
qwen3_32B-1 | llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 25600
qwen3_32B-1 | llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 64
qwen3_32B-1 | llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8
qwen3_32B-1 | llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000
qwen3_32B-1 | llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
qwen3_32B-1 | llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128
qwen3_32B-1 | llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128
qwen3_32B-1 | llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
qwen3_32B-1 | llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
qwen3_32B-1 | llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
qwen3_32B-1 | llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
qwen3_32B-1 | llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
qwen3_32B-1 | llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151645
qwen3_32B-1 | llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643
qwen3_32B-1 | llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 151643
qwen3_32B-1 | llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
qwen3_32B-1 | llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
qwen3_32B-1 | llama_model_loader: - kv 26: general.quantization_version u32 = 2
qwen3_32B-1 | llama_model_loader: - kv 27: general.file_type u32 = 15
qwen3_32B-1 | llama_model_loader: - type f32: 257 tensors
qwen3_32B-1 | llama_model_loader: - type q4_K: 385 tensors
qwen3_32B-1 | llama_model_loader: - type q6_K: 65 tensors
qwen3_32B-1 | print_info: file format = GGUF V3 (latest)
qwen3_32B-1 | print_info: file type = Q4_K - Medium
qwen3_32B-1 | print_info: file size = 18.40 GiB (4.82 BPW)
qwen3_32B-1 | load: special tokens cache size = 26
qwen3_32B-1 | load: token to piece cache size = 0.9311 MB
qwen3_32B-1 | print_info: arch = qwen3
qwen3_32B-1 | print_info: vocab_only = 0
qwen3_32B-1 | print_info: n_ctx_train = 40960
qwen3_32B-1 | print_info: n_embd = 5120
qwen3_32B-1 | print_info: n_layer = 64
qwen3_32B-1 | print_info: n_head = 64
qwen3_32B-1 | print_info: n_head_kv = 8
qwen3_32B-1 | print_info: n_rot = 128
qwen3_32B-1 | print_info: n_swa = 0
qwen3_32B-1 | print_info: is_swa_any = 0
qwen3_32B-1 | print_info: n_embd_head_k = 128
qwen3_32B-1 | print_info: n_embd_head_v = 128
qwen3_32B-1 | print_info: n_gqa = 8
qwen3_32B-1 | print_info: n_embd_k_gqa = 1024
qwen3_32B-1 | print_info: n_embd_v_gqa = 1024
qwen3_32B-1 | print_info: f_norm_eps = 0.0e+00
qwen3_32B-1 | print_info: f_norm_rms_eps = 1.0e-06
qwen3_32B-1 | print_info: f_clamp_kqv = 0.0e+00
qwen3_32B-1 | print_info: f_max_alibi_bias = 0.0e+00
qwen3_32B-1 | print_info: f_logit_scale = 0.0e+00
qwen3_32B-1 | print_info: f_attn_scale = 0.0e+00
qwen3_32B-1 | print_info: n_ff = 25600
qwen3_32B-1 | print_info: n_expert = 0
qwen3_32B-1 | print_info: n_expert_used = 0
qwen3_32B-1 | print_info: causal attn = 1
qwen3_32B-1 | print_info: pooling type = 0
qwen3_32B-1 | print_info: rope type = 2
qwen3_32B-1 | print_info: rope scaling = linear
qwen3_32B-1 | print_info: freq_base_train = 1000000.0
qwen3_32B-1 | print_info: freq_scale_train = 1
qwen3_32B-1 | print_info: n_ctx_orig_yarn = 40960
qwen3_32B-1 | print_info: rope_finetuned = unknown
qwen3_32B-1 | print_info: ssm_d_conv = 0
qwen3_32B-1 | print_info: ssm_d_inner = 0
qwen3_32B-1 | print_info: ssm_d_state = 0
qwen3_32B-1 | print_info: ssm_dt_rank = 0
qwen3_32B-1 | print_info: ssm_dt_b_c_rms = 0
qwen3_32B-1 | print_info: model type = 32B
qwen3_32B-1 | print_info: model params = 32.76 B
qwen3_32B-1 | print_info: general.name = Qwen3 32B Awq Compatible Instruct
qwen3_32B-1 | print_info: vocab type = BPE
qwen3_32B-1 | print_info: n_vocab = 151936
qwen3_32B-1 | print_info: n_merges = 151387
qwen3_32B-1 | print_info: BOS token = 151643 '<|endoftext|>'
qwen3_32B-1 | print_info: EOS token = 151645 '<|im_end|>'
qwen3_32B-1 | print_info: EOT token = 151645 '<|im_end|>'
qwen3_32B-1 | print_info: PAD token = 151643 '<|endoftext|>'
qwen3_32B-1 | print_info: LF token = 198 'Ċ'
qwen3_32B-1 | print_info: FIM PRE token = 151659 '<|fim_prefix|>'
qwen3_32B-1 | print_info: FIM SUF token = 151661 '<|fim_suffix|>'
qwen3_32B-1 | print_info: FIM MID token = 151660 '<|fim_middle|>'
qwen3_32B-1 | print_info: FIM PAD token = 151662 '<|fim_pad|>'
qwen3_32B-1 | print_info: FIM REP token = 151663 '<|repo_name|>'
qwen3_32B-1 | print_info: FIM SEP token = 151664 '<|file_sep|>'
qwen3_32B-1 | print_info: EOG token = 151643 '<|endoftext|>'
qwen3_32B-1 | print_info: EOG token = 151645 '<|im_end|>'
qwen3_32B-1 | print_info: EOG token = 151662 '<|fim_pad|>'
qwen3_32B-1 | print_info: EOG token = 151663 '<|repo_name|>'
qwen3_32B-1 | print_info: EOG token = 151664 '<|file_sep|>'
qwen3_32B-1 | print_info: max token length = 256
qwen3_32B-1 | load_tensors: loading model tensors, this can take a while... (mmap = true)
qwen3_32B-1 | load_tensors: offloading 64 repeating layers to GPU
qwen3_32B-1 | load_tensors: offloading output layer to GPU
qwen3_32B-1 | load_tensors: offloaded 65/65 layers to GPU
qwen3_32B-1 | load_tensors: CUDA0 model buffer size = 18423.65 MiB
qwen3_32B-1 | load_tensors: CPU_Mapped model buffer size = 417.30 MiB
qwen3_32B-1 | ................................................................................................
qwen3_32B-1 | llama_context: constructing llama_context
qwen3_32B-1 | llama_context: n_seq_max = 1
qwen3_32B-1 | llama_context: n_ctx = 40960
qwen3_32B-1 | llama_context: n_ctx_per_seq = 40960
qwen3_32B-1 | llama_context: n_batch = 2048
qwen3_32B-1 | llama_context: n_ubatch = 512
qwen3_32B-1 | llama_context: causal_attn = 1
qwen3_32B-1 | llama_context: flash_attn = 1
qwen3_32B-1 | llama_context: freq_base = 1000000.0
qwen3_32B-1 | llama_context: freq_scale = 1
qwen3_32B-1 | llama_context: CUDA_Host output buffer size = 0.58 MiB
qwen3_32B-1 | llama_kv_cache_unified: CUDA0 KV buffer size = 10240.00 MiB
qwen3_32B-1 | llama_kv_cache_unified: size = 10240.00 MiB ( 40960 cells, 64 layers, 1 seqs), K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
qwen3_32B-1 | llama_context: CUDA0 compute buffer size = 306.75 MiB
qwen3_32B-1 | llama_context: CUDA_Host compute buffer size = 90.01 MiB
qwen3_32B-1 | llama_context: graph nodes = 2311
qwen3_32B-1 | llama_context: graph splits = 2
qwen3_32B-1 | common_init_from_params: setting dry_penalty_last_n to ctx_size = 40960
qwen3_32B-1 | common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
qwen3_32B-1 | srv init: initializing slots, n_slots = 1
qwen3_32B-1 | slot init: id 0 | task -1 | new slot n_ctx_slot = 40960
qwen3_32B-1 | main: model loaded
qwen3_32B-1 | main: chat template, chat_template: {%- if tools %}
qwen3_32B-1 | {{- '<|im_start|>system\n' }}
qwen3_32B-1 | {%- if messages[0].role == 'system' %}
qwen3_32B-1 | {{- messages[0].content + '\n\n' }}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
qwen3_32B-1 | {%- for tool in tools %}
qwen3_32B-1 | {{- "\n" }}
qwen3_32B-1 | {{- tool | tojson }}
qwen3_32B-1 | {%- endfor %}
qwen3_32B-1 | {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
qwen3_32B-1 | {%- else %}
qwen3_32B-1 | {%- if messages[0].role == 'system' %}
qwen3_32B-1 | {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
qwen3_32B-1 | {%- for index in range(ns.last_query_index, -1, -1) %}
qwen3_32B-1 | {%- set message = messages[index] %}
qwen3_32B-1 | {%- if ns.multi_step_tool and message.role == "user" and not('<tool_response>' in message.content and '</tool_response>' in message.content) %}
qwen3_32B-1 | {%- set ns.multi_step_tool = false %}
qwen3_32B-1 | {%- set ns.last_query_index = index %}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- endfor %}
qwen3_32B-1 | {%- for message in messages %}
qwen3_32B-1 | {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
qwen3_32B-1 | {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
qwen3_32B-1 | {%- elif message.role == "assistant" %}
qwen3_32B-1 | {%- set content = message.content %}
qwen3_32B-1 | {%- set reasoning_content = '' %}
qwen3_32B-1 | {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
qwen3_32B-1 | {%- set reasoning_content = message.reasoning_content %}
qwen3_32B-1 | {%- else %}
qwen3_32B-1 | {%- if '</think>' in message.content %}
qwen3_32B-1 | {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
qwen3_32B-1 | {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- if loop.index0 > ns.last_query_index %}
qwen3_32B-1 | {%- if loop.last or (not loop.last and reasoning_content) %}
qwen3_32B-1 | {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
qwen3_32B-1 | {%- else %}
qwen3_32B-1 | {{- '<|im_start|>' + message.role + '\n' + content }}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- else %}
qwen3_32B-1 | {{- '<|im_start|>' + message.role + '\n' + content }}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- if message.tool_calls %}
qwen3_32B-1 | {%- for tool_call in message.tool_calls %}
qwen3_32B-1 | {%- if (loop.first and content) or (not loop.first) %}
qwen3_32B-1 | {{- '\n' }}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- if tool_call.function %}
qwen3_32B-1 | {%- set tool_call = tool_call.function %}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {{- '<tool_call>\n{"name": "' }}
qwen3_32B-1 | {{- tool_call.name }}
qwen3_32B-1 | {{- '", "arguments": ' }}
qwen3_32B-1 | {%- if tool_call.arguments is string %}
qwen3_32B-1 | {{- tool_call.arguments }}
qwen3_32B-1 | {%- else %}
qwen3_32B-1 | {{- tool_call.arguments | tojson }}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {{- '}\n</tool_call>' }}
qwen3_32B-1 | {%- endfor %}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {{- '<|im_end|>\n' }}
qwen3_32B-1 | {%- elif message.role == "tool" %}
qwen3_32B-1 | {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
qwen3_32B-1 | {{- '<|im_start|>user' }}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {{- '\n<tool_response>\n' }}
qwen3_32B-1 | {{- message.content }}
qwen3_32B-1 | {{- '\n</tool_response>' }}
qwen3_32B-1 | {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
qwen3_32B-1 | {{- '<|im_end|>\n' }}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- endfor %}
qwen3_32B-1 | {%- if add_generation_prompt %}
qwen3_32B-1 | {{- '<|im_start|>assistant\n' }}
qwen3_32B-1 | {%- if enable_thinking is defined and enable_thinking is false %}
qwen3_32B-1 | {{- '<think>\n\n</think>\n\n' }}
qwen3_32B-1 | {%- endif %}
qwen3_32B-1 | {%- endif %}, example_format: '<|im_start|>system
qwen3_32B-1 | You are a helpful assistant<|im_end|>
qwen3_32B-1 | <|im_start|>user
qwen3_32B-1 | Hello<|im_end|>
qwen3_32B-1 | <|im_start|>assistant
qwen3_32B-1 | Hi there<|im_end|>
qwen3_32B-1 | <|im_start|>user
qwen3_32B-1 | How are you?<|im_end|>
qwen3_32B-1 | <|im_start|>assistant
qwen3_32B-1 | '
qwen3_32B-1 | main: server is listening on http://0.0.0.0:8080 - starting the main loop
qwen3_32B-1 | srv update_slots: all slots are idle
qwen3_32B-1 | srv log_server_r: request: GET /health 127.0.0.1 200
qwen3_32B-1 | srv log_server_r: request: GET /health 127.0.0.1 200
qwen3_32B-1 | srv log_server_r: request: GET /health 127.0.0.1 200
qwen3_32B-1 | srv log_server_r: request: GET /health 127.0.0.1 200
qwen3_32B-1 | srv log_server_r: request: GET /health 127.0.0.1 200
qwen3_32B-1 | srv log_server_r: request: GET /health 127.0.0.1 200
qwen3_32B-1 | srv log_server_r: request: GET /health 127.0.0.1 200
qwen3_32B-1 | srv params_from_: Chat format: Hermes 2 Pro
qwen3_32B-1 | slot launch_slot_: id 0 | task 0 | processing task
qwen3_32B-1 | slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 559
qwen3_32B-1 | slot update_slots: id 0 | task 0 | kv cache rm [0, end)
qwen3_32B-1 | slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 559, n_tokens = 559, progress = 1.000000
qwen3_32B-1 | slot update_slots: id 0 | task 0 | prompt done, n_past = 559, n_tokens = 559
qwen3_32B-1 | slot release: id 0 | task 0 | stop processing: n_past = 951, truncated = 0
qwen3_32B-1 | slot print_timing: id 0 | task 0 |
qwen3_32B-1 | prompt eval time = 714.34 ms / 559 tokens ( 1.28 ms per token, 782.54 tokens per second)
qwen3_32B-1 | eval time = 12010.56 ms / 393 tokens ( 30.56 ms per token, 32.72 tokens per second)
qwen3_32B-1 | total time = 12724.90 ms / 952 tokens
qwen3_32B-1 | srv update_slots: all slots are idle
qwen3_32B-1 | srv log_server_r: request: POST /v1/chat/completions ipaddress 200
qwen3_32B-1 | srv log_server_r: request: GET /health 127.0.0.1 200
qwen3_32B-1 | srv params_from_: Chat format: Hermes 2 Pro
qwen3_32B-1 | slot launch_slot_: id 0 | task 394 | processing task
qwen3_32B-1 | slot update_slots: id 0 | task 394 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 758
qwen3_32B-1 | slot update_slots: id 0 | task 394 | kv cache rm [559, end)
qwen3_32B-1 | slot update_slots: id 0 | task 394 | prompt processing progress, n_past = 758, n_tokens = 199, progress = 0.262533
qwen3_32B-1 | slot update_slots: id 0 | task 394 | prompt done, n_past = 758, n_tokens = 199
qwen3_32B-1 | slot release: id 0 | task 394 | stop processing: n_past = 874, truncated = 0
qwen3_32B-1 | slot print_timing: id 0 | task 394 |
qwen3_32B-1 | prompt eval time = 278.34 ms / 199 tokens ( 1.40 ms per token, 714.96 tokens per second)
qwen3_32B-1 | eval time = 3556.36 ms / 117 tokens ( 30.40 ms per token, 32.90 tokens per second)
qwen3_32B-1 | total time = 3834.70 ms / 316 tokens
qwen3_32B-1 | srv update_slots: all slots are idle
qwen3_32B-1 | srv log_server_r: request: POST /v1/chat/completions ipaddress 200
qwen3_32B-1 | srv params_from_: Chat format: Hermes 2 Pro
qwen3_32B-1 | slot launch_slot_: id 0 | task 512 | processing task
qwen3_32B-1 | slot update_slots: id 0 | task 512 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 846
qwen3_32B-1 | slot update_slots: id 0 | task 512 | kv cache rm [756, end)
qwen3_32B-1 | slot update_slots: id 0 | task 512 | prompt processing progress, n_past = 846, n_tokens = 90, progress = 0.106383
qwen3_32B-1 | slot update_slots: id 0 | task 512 | prompt done, n_past = 846, n_tokens = 90
qwen3_32B-1 | slot release: id 0 | task 512 | stop processing: n_past = 1104, truncated = 0
qwen3_32B-1 | slot print_timing: id 0 | task 512 |
qwen3_32B-1 | prompt eval time = 142.43 ms / 90 tokens ( 1.58 ms per token, 631.89 tokens per second)
qwen3_32B-1 | eval time = 7905.08 ms / 259 tokens ( 30.52 ms per token, 32.76 tokens per second)
qwen3_32B-1 | total time = 8047.51 ms / 349 tokens
qwen3_32B-1 | srv update_slots: all slots are idle
qwen3_32B-1 | srv log_server_r: request: POST /v1/chat/completions ipaddress 200
qwen3_32B-1 | srv log_server_r: request: GET /health 127.0.0.1 200