Skip to content

Misc. bug: [SYCL][Intel GPU] GPT-OSS 20B not loading fully to VRAM with -ngl 99 (A770) #17643

@savvadesogle

Description

@savvadesogle

Name and Version

llama-cli -v
build: 7195 (47a268e) with Intel(R) oneAPI DPC++/C++ Compiler 2025.2.1 (2025.2.0.20250806) for x86_64-unknown-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server, llama-bench, llama-cli

Command line

./llama-server -m /home/arc/llm/models/gpt-oss-20b-MXFP4.gguf -fa 0 -ngl 999
./llama-bench -m /home/arc/llm/models/gpt-oss-20b-MXFP4.gguf -fa 0,1 -ngl 999

Problem description & steps to reproduce

Image Image Image

Hello.

  1. I compiled the binary files according to the instructions
    https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md
  2. I run the model test or Up the server
  3. Only part of the VRAM is used and the processor is heavily loaded because the inference is running on it.

Model: https://huggingface.co/lmstudio-community/gpt-oss-20b-GGUF or https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

OS: Ubuntu 25.10
ASUSTeK COMPUTER INC. Z10PE-D8 WS
2x Intel® Xeon® E5-2699 v3
GPU: Intel Arc A770
uname -a:
Linux xpu 6.17.9-061709-generic #202511241048 SMP PREEMPT_DYNAMIC Mon Nov 24 12:51:41 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

arc@xpu:~$ inxi -G
Graphics:
  Device-1: Intel DG2 [Arc A770] driver: i915 v: kernel
  Device-2: Intel DG2 [Arc A770] driver: i915 v: kernel
  Display: wayland server: X.Org v: 24.1.6 with: Xwayland v: 24.1.6
    compositor: gnome-shell v: 49.0 driver: X: loaded: N/A
    unloaded: fbdev,modesetting,vesa dri: iris gpu: i915
    resolution: 3440x1440~144Hz
  API: EGL v: 1.5 drivers: iris,swrast
    platforms: gbm,wayland,x11,surfaceless,device
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: intel mesa v: 25.2.3-1ubuntu1
    renderer: Mesa Intel Arc A770 Graphics (DG2)
  API: Vulkan v: 1.4.321 drivers: intel,llvmpipe surfaces: N/A
  Info: Tools: api: clinfo, eglinfo, glxinfo, vulkaninfo gpu: gputop,
    intel_gpu_top, lsgpu, lact x11: xdriinfo, xdpyinfo, xprop, xrandr

First Bad Commit

No response

Relevant log output

arc@xpu:~/llama-cpp-sycl/llama.cpp/build/bin$ ./llama-server -m /home/arc/llm/models/gpt-oss-20b-MXFP4.gguf -fa 0 -ngl 999
main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this)
build: 7195 (47a268ea5) with Intel(R) oneAPI DPC++/C++ Compiler 2025.2.1 (2025.2.0.20250806) for x86_64-unknown-linux-gnu
system info: n_threads = 36, n_threads_batch = 36, total_threads = 72

system_info: n_threads = 36 (n_threads_batch = 36) / 72 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 71 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/arc/llm/models/gpt-oss-20b-MXFP4.gguf'
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) (unknown id) - 15473 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 459 tensors from /home/arc/llm/models/gpt-oss-20b-MXFP4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Openai_Gpt Oss 20b
llama_model_loader: - kv   3:                           general.basename str              = openai_gpt-oss
llama_model_loader: - kv   4:                         general.size_label str              = 20B
llama_model_loader: - kv   5:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv   6:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv   7:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv   8:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv   9:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  10:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  12:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  14:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  15:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  16:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  17:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  18:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  19:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  20:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  21: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {#-\n  In addition to the normal input...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 38
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q8_0:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = MXFP4 MoE
print_info: file size   = 11.27 GiB (4.63 BPW) 
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_embd_inp       = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 20B
print_info: model params     = 20.91 B
print_info: general.name     = Openai_Gpt Oss 20b
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 199999 '<|endoftext|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 10949.33 MiB
load_tensors:        SYCL0 model buffer size =  1242.03 MiB
................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = true
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_DISABLE_DNN: 0
  GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|          1.6.35096+9|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
llama_context:  SYCL_Host  output buffer size =     3.07 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache:      SYCL0 KV buffer size =    96.00 MiB
llama_kv_cache: size =   96.00 MiB (  4096 cells,  12 layers,  4/1 seqs), K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 1024 cells
llama_kv_cache:      SYCL0 KV buffer size =    24.00 MiB
llama_kv_cache: size =   24.00 MiB (  1024 cells,  12 layers,  4/1 seqs), K (f16):   12.00 MiB, V (f16):   12.00 MiB
llama_context:      SYCL0 compute buffer size =   563.39 MiB
llama_context:  SYCL_Host compute buffer size =    87.14 MiB
llama_context: graph nodes  = 1470
llama_context: graph splits = 50
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 4
slot         init: id  0 | task -1 | new slot, n_ctx = 4096
slot         init: id  1 | task -1 | new slot, n_ctx = 4096
slot         init: id  2 | task -1 | new slot, n_ctx = 4096
slot         init: id  3 | task -1 | new slot, n_ctx = 4096
srv          init: prompt cache is enabled, size limit: 8192 MiB
srv          init: use `--cache-ram 0` to disable the prompt cache
srv          init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: thinking = 0
init: chat template, chat_template: {#-
  In addition to the normal inputs of `messages` and `tools`, this template also accepts the
  following kwargs:
  - "builtin_tools": A list, can contain "browser" and/or "python".
  - "model_identity": A string that optionally describes the model identity.
  - "reasoning_effort": A string that describes the reasoning effort, defaults to "medium".
 #}

{#- Tool Definition Rendering ============================================== #}
{%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%}
    {%- if param_spec.type == "array" -%}
        {%- if param_spec['items'] -%}
            {%- if param_spec['items']['type'] == "string" -%}
                {{- "string[]" }}
            {%- elif param_spec['items']['type'] == "number" -%}
                {{- "number[]" }}
            {%- elif param_spec['items']['type'] == "integer" -%}
                {{- "number[]" }}
            {%- elif param_spec['items']['type'] == "boolean" -%}
                {{- "boolean[]" }}
            {%- else -%}
                {%- set inner_type = render_typescript_type(param_spec['items'], required_params) -%}
                {%- if inner_type == "object | object" or inner_type|length > 50 -%}
                    {{- "any[]" }}
                {%- else -%}
                    {{- inner_type + "[]" }}
                {%- endif -%}
            {%- endif -%}
            {%- if param_spec.nullable -%}
                {{- " | null" }}
            {%- endif -%}
        {%- else -%}
            {{- "any[]" }}
            {%- if param_spec.nullable -%}
                {{- " | null" }}
            {%- endif -%}
        {%- endif -%}
    {%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%}
        {#- Handle array of types like ["object", "object"] from Union[dict, list] #}
        {%- if param_spec.type | length > 1 -%}
            {{- param_spec.type | join(" | ") }}
        {%- else -%}
            {{- param_spec.type[0] }}
        {%- endif -%}
    {%- elif param_spec.oneOf -%}
        {#- Handle oneOf schemas - check for complex unions and fallback to any #}
        {%- set has_object_variants = false -%}
        {%- for variant in param_spec.oneOf -%}
            {%- if variant.type == "object" -%}
                {%- set has_object_variants = true -%}
            {%- endif -%}
        {%- endfor -%}
        {%- if has_object_variants and param_spec.oneOf|length > 1 -%}
            {{- "any" }}
        {%- else -%}
            {%- for variant in param_spec.oneOf -%}
                {{- render_typescript_type(variant, required_params) -}}
                {%- if variant.description %}
                    {{- "// " + variant.description }}
                {%- endif -%}
                {%- if variant.default is defined %}
                    {{ "// default: " + variant.default|tojson }}
                {%- endif -%}
                {%- if not loop.last %}
                    {{- " | " }}
                {% endif -%}
            {%- endfor -%}
        {%- endif -%}
    {%- elif param_spec.type == "string" -%}
        {%- if param_spec.enum -%}
            {{- '"' + param_spec.enum|join('" | "') + '"' -}}
        {%- else -%}
            {{- "string" }}
            {%- if param_spec.nullable %}
                {{- " | null" }}
            {%- endif -%}
        {%- endif -%}
    {%- elif param_spec.type == "number" -%}
        {{- "number" }}
    {%- elif param_spec.type == "integer" -%}
        {{- "number" }}
    {%- elif param_spec.type == "boolean" -%}
        {{- "boolean" }}

    {%- elif param_spec.type == "object" -%}
        {%- if param_spec.properties -%}
            {{- "{
" }}
            {%- for prop_name, prop_spec in param_spec.properties.items() -%}
                {{- prop_name -}}
                {%- if prop_name not in (param_spec.required or []) -%}
                    {{- "?" }}
                {%- endif -%}
                {{- ": " }}
                {{ render_typescript_type(prop_spec, param_spec.required or []) }}
                {%- if not loop.last -%}
                    {{-", " }}
                {%- endif -%}
            {%- endfor -%}
            {{- "}" }}
        {%- else -%}
            {{- "object" }}
        {%- endif -%}
    {%- else -%}
        {{- "any" }}
    {%- endif -%}
{%- endmacro -%}

{%- macro render_tool_namespace(namespace_name, tools) -%}
    {{- "## " + namespace_name + "

" }}
    {{- "namespace " + namespace_name + " {

" }}
    {%- for tool in tools %}
        {%- set tool = tool.function %}
        {{- "// " + tool.description + "
" }}
        {{- "type "+ tool.name + " = " }}
        {%- if tool.parameters and tool.parameters.properties %}
            {{- "(_: {
" }}
            {%- for param_name, param_spec in tool.parameters.properties.items() %}
                {%- if param_spec.description %}
                    {{- "// " + param_spec.description + "
" }}
                {%- endif %}
                {{- param_name }}
                {%- if param_name not in (tool.parameters.required or []) -%}
                    {{- "?" }}
                {%- endif -%}
                {{- ": " }}
                {{- render_typescript_type(param_spec, tool.parameters.required or []) }}
                {%- if param_spec.default is defined -%}
                    {%- if param_spec.enum %}
                        {{- ", // default: " + param_spec.default }}
                    {%- elif param_spec.oneOf %}
                        {{- "// default: " + param_spec.default }}
                    {%- else %}
                        {{- ", // default: " + param_spec.default|tojson }}
                    {%- endif -%}
                {%- endif -%}
                {%- if not loop.last %}
                    {{- ",
" }}
                {%- else %}
                    {{- "
" }}
                {%- endif -%}
            {%- endfor %}
            {{- "}) => any;

" }}
        {%- else -%}
            {{- "() => any;

" }}
        {%- endif -%}
    {%- endfor %}
    {{- "} // namespace " + namespace_name }}
{%- endmacro -%}

{%- macro render_builtin_tools(browser_tool, python_tool) -%}
    {%- if browser_tool %}
        {{- "## browser

" }}
        {{- "// Tool for browsing.
" }}
        {{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.
" }}
        {{- "// Cite information from the tool using the following format:
" }}
        {{- "// `【{cursor}†L{line_start}(-L{line_end})?`, for example: `【6†L9-L11】` or `【8†L3】`.
" }}
        {{- "// Do not quote more than 10 words directly from the tool output.
" }}
        {{- "// sources=web (default: web)
" }}
        {{- "namespace browser {

" }}
        {{- "// Searches for information related to `query` and displays `topn` results.
" }}
        {{- "type search = (_: {
" }}
        {{- "query: string,
" }}
        {{- "topn?: number, // default: 10
" }}
        {{- "source?: string,
" }}
        {{- "}) => any;

" }}
        {{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.
" }}
        {{- "// Valid link ids are displayed with the formatting: `【{id}†.*`.
" }}
        {{- "// If `cursor` is not provided, the most recent page is implied.
" }}
        {{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.
" }}
        {{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.
" }}
        {{- "// Use this function without `id` to scroll to a new location of an opened page.
" }}
        {{- "type open = (_: {
" }}
        {{- "id?: number | string, // default: -1
" }}
        {{- "cursor?: number, // default: -1
" }}
        {{- "loc?: number, // default: -1
" }}
        {{- "num_lines?: number, // default: -1
" }}
        {{- "view_source?: boolean, // default: false
" }}
        {{- "source?: string,
" }}
        {{- "}) => any;

" }}
        {{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.
" }}
        {{- "type find = (_: {
" }}
        {{- "pattern: string,
" }}
        {{- "cursor?: number, // default: -1
" }}
        {{- "}) => any;

" }}
        {{- "} // namespace browser

" }}
    {%- endif -%}

    {%- if python_tool %}
        {{- "## python

" }}
        {{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).

" }}
        {{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.

" }}
    {%- endif -%}
{%- endmacro -%}

{#- System Message Construction ============================================ #}
{%- macro build_system_message() -%}
    {%- if model_identity is not defined %}
        {%- set model_identity = "You are ChatGPT, a large language model trained by OpenAI." %}
    {%- endif %}
    {{- model_identity + "
" }}
    {{- "Knowledge cutoff: 2024-06
" }}
    {{- "Current date: " + strftime_now("%Y-%m-%d") + "

" }}
    {%- if reasoning_effort is not defined %}
        {%- set reasoning_effort = "medium" %}
    {%- endif %}
    {{- "Reasoning: " + reasoning_effort + "

" }}
    {%- if builtin_tools %}
        {{- "# Tools

" }}
        {%- set available_builtin_tools = namespace(browser=false, python=false) %}
        {%- for tool in builtin_tools %}
            {%- if tool == "browser" %}
                {%- set available_builtin_tools.browser = true %}
            {%- elif tool == "python" %}
                {%- set available_builtin_tools.python = true %}
            {%- endif %}
        {%- endfor %}
        {{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }}
    {%- endif -%}
    {{- "# Valid channels: analysis, commentary, final. Channel must be included for every message." }}
    {%- if tools -%}
        {{- "
Calls to these tools must go to the commentary channel: 'functions'." }}
    {%- endif -%}
{%- endmacro -%}

{#- Main Template Logic ================================================= #}
{#- Set defaults #}

{#- Render system message #}
{{- "<|start|>system<|message|>" }}
{{- build_system_message() }}
{{- "<|end|>" }}

{#- Extract developer message #}
{%- if messages[0].role == "developer" or messages[0].role == "system" %}
    {%- set developer_message = messages[0].content %}
    {%- set loop_messages = messages[1:] %}
{%- else %}
    {%- set developer_message = "" %}
    {%- set loop_messages = messages %}
{%- endif %}

{#- Render developer message #}
{%- if developer_message or tools %}
    {{- "<|start|>developer<|message|>" }}
    {%- if developer_message %}
        {{- "# Instructions

" }}
        {{- developer_message }}
    {%- endif %}
    {%- if tools -%}
        {{- "

" }}
        {{- "# Tools

" }}
        {{- render_tool_namespace("functions", tools) }}
    {%- endif -%}
    {{- "<|end|>" }}
{%- endif %}

{#- Render messages #}
{%- set last_tool_call = namespace(name=none) %}
{%- for message in loop_messages -%}
    {#- At this point only assistant/user/tool messages should remain #}
    {%- if message.role == 'assistant' -%}
        {#- Checks to ensure the messages are being passed in the format we expect #}
        {%- if "content" in message %}
            {%- if false %}
                {{- raise_exception("You have passed a message containing <|channel|> tags in the content field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
            {%- endif %}
        {%- endif %}
        {%- if "thinking" in message %}
            {%- if "<|channel|>analysis<|message|>" in message.thinking or "<|channel|>final<|message|>" in message.thinking %}
                {{- raise_exception("You have passed a message containing <|channel|> tags in the thinking field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
            {%- endif %}
        {%- endif %}
        {%- if "tool_calls" in message %}
            {#- We assume max 1 tool call per message, and so we infer the tool call name #}
            {#- in "tool" messages from the most recent assistant tool call name #}
            {%- set tool_call = message.tool_calls[0] %}
            {%- if tool_call.function %}
                {%- set tool_call = tool_call.function %}
            {%- endif %}
            {%- if message.content and message.thinking %}
                {{- raise_exception("Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.") }}
            {%- elif message.content %}
                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
            {%- elif message.thinking %}
                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
            {%- endif %}
            {{- "<|start|>assistant to=" }}
            {{- "functions." + tool_call.name + "<|channel|>commentary " }}
            {{- (tool_call.content_type if tool_call.content_type is defined else "json") + "<|message|>" }}
            {{- tool_call.arguments|tojson }}
            {{- "<|call|>" }}
            {%- set last_tool_call.name = tool_call.name %}
        {%- elif loop.last and not add_generation_prompt %}
            {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
            {#- This is a situation that should only occur in training, never in inference. #}
            {%- if "thinking" in message %}
                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
            {%- endif %}
            {#- <|return|> indicates the end of generation, but <|end|> does not #}
            {#- <|return|> should never be an input to the model, but we include it as the final token #}
            {#- when training, so the model learns to emit it. #}
            {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
        {%- else %}
            {#- CoT is dropped during all previous turns, so we never render it for inference #}
            {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
            {%- set last_tool_call.name = none %}
        {%- endif %}
    {%- elif message.role == 'tool' -%}
        {%- if last_tool_call.name is none %}
            {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
        {%- endif %}
        {{- "<|start|>functions." + last_tool_call.name }}
        {{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }}
    {%- elif message.role == 'user' -%}
        {{- "<|start|>user<|message|>" + message.content + "<|end|>" }}
    {%- endif -%}
{%- endfor -%}

{#- Generation prompt #}
{%- if add_generation_prompt -%}
<|start|>assistant
{%- endif -%}, example_format: '<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-12-01

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

You are a helpful assistant<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there<|end|><|start|>user<|message|>How are you?<|end|><|start|>assistant'
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle
^Csrv    operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB]                     | total    free     self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - SYCL0 (Intel(R) Arc(TM) A770 Graphics) | 15473 = 15473 + ( 1925 =  1242 +     120 +     563) + 17592186042490 |
llama_memory_breakdown_print: |   - Host                                   |                  11036 = 10949 +       0 +      87                   |

Metadata

Metadata

Labels

SYCLhttps://en.wikipedia.org/wiki/SYCL - GPU programming languageperformanceSpeed related topics

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions