Skip to content

Eval bug: Qwen3.5-35B-a3b full prompt re-processing with Claude Code #20003

@anubhavgupta

Description

@anubhavgupta

Name and Version

version: 8179 (ecbcb7e)

Operating systems

Windows

GGML backends

CUDA

Hardware

5090 Mobile (24GB Vram), cuda 13.1.

Models

unsloth's Qwen3.5-35B-A3B-UD-Q4_K_M.gguf

Problem description & steps to reproduce

args that I use to launch llama.cpp:
-m C:\Users\anubh.lmstudio\models\lmstudio-community\Qwen3.5-35b-a3b-GGUF\Qwen3.5-35B-A3B-UD-Q4_K_M.gguf -ngl 99 -t 23 --temp 0.6 --top-k 20 --top-p 0.95 --parallel 1 --mlock --swa-full -c 200000 -ctk q8_0 -ctv q8_0 -fa on --jinja --reasoning-budget 0 --host 0.0.0.0 -fit off

When using the model with Claude code, full prompt re-processing happens every time(as can be seen in the logs), although this behaviour is not observed with OpenCode.

Repo Steps:
1 Load the model
2 Configure Claude code to run it with the model locally.
3 go inside any codebase and delete any existing CLAUED.md file
4 run /init inside claude code.

Expected:

  • Full prompt re-processing should not happen. i.e: "forcing full prompt re-processing due to lack of cache data" log.

Actual:

  • Full prompt re-processing happens.

First Bad Commit

No response

Relevant log output

Logs
STDERR: srv  params_from_: Chat format: peg-constructed

STDERR: slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.810 (> 0.100 thold), f_keep = 0.811

STDERR: slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 745 | processing task, is_child = 0
slot update_slots: id  0 | task 745 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 18673
slot update_slots: id  0 | task 745 | n_past = 15126, slot.prompt.tokens.size() = 18662, seq_id = 0, pos_min = 18661, n_swa = 1
slot update_slots: id  0 | task 745 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 745 | erased invalidated context checkpoint (pos_min = 18134, pos_max = 18134, n_tokens = 18135, n_swa = 1, size = 62.813 MiB)

STDERR: slot update_slots: id  0 | task 745 | n_tokens = 0, memory_seq_rm [0, end)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions