Skip to content

Eval bug: '????????' output with Vulkan backend #17055

@louiswpf

Description

@louiswpf

Name and Version

$ ./llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
version: 6969 (aa37417)
built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu

$ vulkaninfo --summary | rg 'apiVersion|driverInfo|driverName'
apiVersion = 1.4.318
driverName = radv
driverInfo = Mesa 25.2.6-arch1.1

Operating systems

Linux

GGML backends

Vulkan

Hardware

AMD Radeon RX 6800 XT

Models

ggml-org/gpt-oss-20b-GGUF

Problem description & steps to reproduce

The Vulkan backend for llama.cpp produces '???????' when interacting with a model consecutively.

First Bad Commit

I've bisected the issue to:
commit d2a2673 ("vulkan: fix shmem overrun in mmq id shader (#16873)")

Relevant log output

$ ./build/bin/llama-cli -m ~/data/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf --ctx-size 0 --jinja -p "why is the sky blue"
ggml_vulkan: Found 1 Vulkan devices:                                                                      ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
build: 6969 (aa374175c) with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6800 XT (RADV NAVI21)) (0000:0d:00.0) - 14982 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 459 tensors from /home/user/data/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gpt Oss 20b
llama_model_loader: - kv   3:                           general.basename str              = gpt-oss
llama_model_loader: - kv   4:                         general.size_label str              = 20B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["vllm", "text-generation"]
llama_model_loader: - kv   7:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv   8:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv   9:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  10:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  11:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  12:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  14:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  16:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  17:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  18:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  19:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  20:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  21:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  22:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  23: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {#-\n  In addition to the normal input...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 38
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q8_0:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = MXFP4 MoE
print_info: file size   = 11.27 GiB (4.63 BPW)
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 20B
print_info: model params     = 20.91 B
print_info: general.name     = Gpt Oss 20b
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 199999 '<|endoftext|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   586.82 MiB
load_tensors:      Vulkan0 model buffer size = 10949.35 MiB
.................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_seq     = 131072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: Vulkan_Host  output buffer size =     0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
llama_kv_cache:    Vulkan0 KV buffer size =  3072.00 MiB
llama_kv_cache: size = 3072.00 MiB (131072 cells,  12 layers,  1/1 seqs), K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 640 cells
llama_kv_cache:    Vulkan0 KV buffer size =    15.00 MiB
llama_kv_cache: size =   15.00 MiB (   640 cells,  12 layers,  1/1 seqs), K (f16):    7.50 MiB, V (f16):    7.50 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:    Vulkan0 compute buffer size =   413.14 MiB
llama_context: Vulkan_Host compute buffer size =   262.90 MiB
llama_context: graph nodes  = 1352
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 12
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-06

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

You are a helpful assistant

<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there<|end|><|start|>user<|message|>How are you?<|end|><|start|>assistant

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: interactive mode on.
sampler seed: 1436927887
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 131072
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-06

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>why is the sky blue<|end|><|start|>assistant<|channel|>analysis<|message|>The user asks: "why is the sky blue". They likely want an explanation of Rayleigh scattering. Provide a concise but thorough answer: scattering by small molecules, wavelength dependence, blue visible. Could mention human eye sensitivity, atmospheric composition, scattering cross-section ~1/λ^4, sky appears blue from scattering of sunlight by nitrogen and oxygen molecules. Also mention other factors: twilight colors, scattering by aerosols, etc. Provide some details. Should be friendly. No need to ask clarifying questions. Provide answer.<|end|><|start|>assistant<|channel|>final<|message|>**Why the sky looks blue**
The blue colour of the daytime sky is a result of *Rayleigh scattering*—the preferential scattering of short‑wavelength light by the molecules that make up the atmosphere.

| Step | What happens | Why it matters |
|------|--------------|----------------|
| 1. Sunlight enters Earth’s atmosphere | The Sun emits a nearly white (full‑spectrum) beam of photons. | The spectrum contains all visible wavelengths (≈400–700 nm). |
| 2. Photons collide with molecules | Air is mostly N₂ and O₂, each about 1 cm³. | These molecules are far smaller than the wavelengths of visible light (≈0.1 µm). |
| 3. Scattering occurs | The electric field of each photon polarises the molecule, which then re‑radiates light in all directions. | The scattering cross‑section ∝ 1/λ⁴ (Rayleigh’s law). |
| 4. Blue photons are scattered more | For λ ≈ 400 nm (blue) the scattering is ~60× larger than for λ ≈ 700 nm (red). | Most of the blue light is redirected into our line of sight from every direction in the sky. |
| 5. Our eyes perceive the result | The sky is illuminated by a diffuse field rich in scattered blue photons; the remaining unscattered light reaches the ground as the familiar daylight. | The human eye is relatively sensitive to blue, so the sky’s colour is dominated by the blue component. |

### Quick “physics” recap

- **Rayleigh scattering** applies to particles much smaller than the wavelength of light.
- Scattering cross‑section σ ∝ **(size⁶ / λ⁴)**.
- For air molecules (~0.1 nm diameter) and visible light, σ is huge for blue (λ ≈ 440 nm) and tiny for red (λ ≈ 700 nm).
- Therefore, blue photons are *much* more likely to be redirected in all directions.

### Why the sky isn’t white

If all wavelengths scattered equally, the sky would appear white.
Because scattering falls off steeply with wavelength, the sky becomes dominated by the short‑wavelength (blue) photons. The remaining longer wavelengths (green, yellow, orange) still reach the ground, giving the sky its characteristic “white‑ish” appearance when you look at a small patch, but the overall diffuse field is blue.

### Extra notes

- **Twilight colors**: At sunrise/sunset the Sun’s light passes through a longer path, so even more blue light is scattered out of the beam, leaving the longer wavelengths (red, orange) to dominate the sky’s hue.
- **Aerosols and pollution**: Larger particles (Mie scattering) scatter all wavelengths almost equally, which can make the sky look whiter or hazier.
- **Polarization**: Scattered blue light is partially polarized; if you look at the sky with a polarizing filter, the brightness changes.

**Bottom line:** The sky is blue because sunlight is scattered by air molecules in a way that preferentially redirects the short‑wavelength blue light into every direction we see.

> what's topology
<|channel|>analysis<|message|>User asks "what's topology". Need to explain concept of topology, a branch of mathematics focusing on properties preserved under continuous deformations. Provide overview, key terms, examples, motivation, applications. Should be accessible. Provide definitions, examples like doughnut vs sphere, continuous maps, open sets, topological spaces, homeomorphism. Maybe mention topology as 'rubber-sheet geometry'. Provide historical context, key people. Provide mention of subfields: algebraic topology, differential topology, geometric topology, etc. Also mention everyday applications. Provide answer.<|end|><|start|>assistant<|channel|>final<|message|>### In a nutshell
**Topology** is the branch of mathematics that studies the *shape* of objects in a way that ignores distances, sizes, or angles, and focuses only on how they can be continuously stretched, compressed, or twisted without tearing or gluing.

Think of a coffee mug and a donut (torus). If you could melt the mug into a doughy material and then reshape it without cutting it or attaching new pieces, you could turn the mug’s handle into the donut’s hole. Topology says those two objects are *the same* (“homeomorphic”), because they share the same fundamental structure.

---

## 1. The basic language

| Concept | What it is | Why it matters |
|---------|------------|----------------|
| **Topological space** | A set \(X\) equipped with a collection \(\tau\) of “open sets” satisfying certain axioms (any union of open sets is open; finite intersections are open; the whole set and the empty set are open). | It generalises familiar notions of open intervals in \(\mathbb{R}\) to arbitrary spaces. |
| **Open set** | Intuitively a set that does not include its boundary. | Open sets are the building blocks of a topology. |
| **Continuous map** | A function \(f:X\to Y\) such that the pre‑image of every open set in \(Y\) is open in \(X\). | The central operation: it preserves the “shape” structure. |
| **Homeomorphism** | A continuous bijection with a continuous inverse. | Declares two spaces equivalent in topology. |
| **Topological invariant** | A property that stays the same under homeomorphisms (e.g., number of holes, connectedness). | Gives tools to distinguish spaces. |

---

## 2. A quick “rubber‑sheet geometry” illustration

* **Stretching**: You can stretch a rubber band without cutting it. Its length changes, but its single‑loop structure remains.
* **Twisting**: Twist a rubber hose; it still has one hole and one loop.
* **Bending**: Bend a pipe; still one loop, same topology.
* **Crushing**: Flatten the pipe into a flat shape; still one loop.
* **Tearing**: Cut the pipe; now two separate pieces → topology changes.

Thus, topology cares about *how many pieces* you have, *how many holes* you can’t fill, etc., not about exact measurements.

---

## 3. Key examples

| Space | Topological invariant | What it tells you |
|-------|-----------------------|-------------------|
| \(\mathbb{R}\) (line) | 1‑dimensional, connected | No holes |
| \(\mathbb{R}^2\) (plane) | 2‑dimensional, connected | No holes |
| Circle \(S^1\) | 1‑dimensional, compact, connected, “hole” (cannot shrink to point) | Cannot be continuously deformed to a point |
| Torus \(T^2\) | 2‑dimensional, has two independent holes | Distinct from sphere |
| Sphere \(S^2\) | 2‑dimensional, compact, simply‑connected (no holes) | Distinct from torus |

*The number of “holes” is captured by tools like **Euler characteristic** (\(\chi\)) or **Betti numbers** in algebraic topology.*

---

## 4. Main subfields

| Field | Focus | Typical tools |
|-------|-------|---------------|
| **Algebraic topology** | Uses algebra (groups, homology, cohomology) to study topological invariants. | Homology, homotopy, fundamental group, cohomology rings. |
| **Differential topology** | Studies smooth manifolds and smooth maps. | Tangent spaces, differential forms, Morse theory. |
| **Geometric topology** | Concerns low‑dimensional manifolds (3‑ and 4‑dimensional). | Knot theory, 3‑manifold classification. |
| **Point‑set topology** | The foundational theory of topological spaces, continuity, compactness, connectedness. | Separation axioms, nets/filters, topology of function spaces. |

---

## 5. Why it matters

1. **Pure math** – Provides a unifying language across geometry, analysis, algebra, and logic.
2. **Physics** – Topology underpins *topological phases* of matter, quantum field theory, string theory.
3. **Computer science** – Topological data analysis (persistent homology) extracts shape from high‑dimensional data.
4. **Engineering** – Robotics (configuration spaces), computer graphics (mesh topology), network design.
5. **Biology** – DNA knotting, protein folding, brain connectivity networks.

---

## 6. Quick “how‑to” check if two spaces are homeomorphic

| Step | What to look for |
|------|------------------|
| 1. Count connected components | Homeomorphisms preserve connectivity. |
| 2. Count holes (Betti numbers) | e.g., a torus has 2 holes, a sphere 0. |
| 3. Check compactness, connectedness, and boundary properties | Homeomorphisms preserve compactness. |
| 4. Find an explicit continuous bijection with a continuous inverse | If you can write it down, you’re done. |

If any invariant differs, the spaces are *not* homeomorphic.

---

## 7. A bit of history

| Year | Event | Who |
|------|-------|-----|
| 1872 | “Topological spaces” coined | Henri Poincaré |
| 1904 | First rigorous definition of topological space | Felix Hausdorff |
| 1941 | Homology theory developed | P.S. Alexandrov, H. Borsuk |
| 1945 | Fundamental group formalised | J. H. C. Whitehead |
| 1950s‑60s | Algebraic topology blossoms | Eilenberg, Steenrod |

---

## 8. Bottom line

*Topology asks: “Which properties of a space survive any continuous deformation?”*
It strips away the metric details (length, angle) and keeps only the essential “shape” features—number ofcomponents, holes, and how they’re arranged. This abstraction has turned out to be a powerful lens for understanding geometry, physics, data, and many other disciplines.

> what's philosophy
????????????????????????????????????????????????????????????????????????????????????????????????

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions