A research tool for experimenting with token-level LLM steering, KV cache compression, and multi-sequence batching. The goal is to make these experiments easy to run, measure, and compare — not to be a production inference server. No auth, no rate limiting, no reliability guarantees. If you want to run models, use Ollama. If you want to control them, this is the workbench.
Wraps llama.cpp, communicates over NDJSON on stdin/stdout — one client per process, trivially testable, SSH-transparent. A Unix socket transport (same protocol, multiple clients) is the natural next step for multi-caller use cases.
The name: In The Last of Us, a Clicker is the stage of Cordyceps fungal infection where the host is fully under control — no vision, no will, navigating purely by the fungus's signals. Alpaca is Stanford's instruction-tuned derivative of LLaMA, the origin of the open-source LLM lineage this runs on. Put them together: clickpaca is what happens when the Cordyceps gets into the alpaca family. Token-level grammar constraints, logit bias, and multi-sequence steering take over the model's output one token at a time. The LLM is the host. You are the fungus.
A neural network and its weights are not a knowledge base. They are a fixed computational circuit — the weights encode how to compute, not a table of facts to look up. Think of it like a very sophisticated FPGA: the bitstream (weights) defines the circuit, and running an input through it produces a deterministic computation. The "intelligence" is in the topology of the computation, not in stored records.
The KV cache is the activation state of that circuit after processing a particular input — the values of every internal register at every layer, for every token seen so far. It's both things simultaneously: cached compute (you don't re-run the input) and cached representation (the circuit's internal encoding of what it processed). Reloading a KV state is like restoring a checkpoint mid-computation — the circuit resumes exactly where it left off.
This reframes what KV compression means: you're compressing register values, not compressing knowledge. The question is whether the downstream computation — the next forward pass — produces the same result from compressed register state as from full-precision state. TurboQuant's answer: yes, if you compress carefully. turbo4 (MSE-only) keeps register values close to ground truth. turbo3 (MSE + QJL residual) adds random noise that the softmax gate amplifies — biasing the computation even though the inner products are technically unbiased.
See docs/LLM-AS-PROCESSOR.md for a full treatment of this model and what it implies for system design. For the quantization math, see docs/KV-RESEARCH.md.
Most people use LLMs as a black box: send a prompt, receive text. We're interested in what happens when you treat the model as a programmable substrate — controlling its output at the token level, not just through prompting.
The specific research questions driving this:
Can you pre-compute and reuse model reasoning? The KV (key-value) cache stores the model's internal state after processing a prompt. If you can save and reload KV states, you can process a document once and reuse that computation across thousands of queries — rather than reprocessing it every time. With 5× KV compression (TurboQuant, ICLR 2026), you can fit far more of this cached state in memory.
Can you run many steered inferences in parallel? If you need structured output (valid JSON, a specific schema, constrained vocabulary), today's servers apply grammar constraints globally and queue requests. We want per-sequence grammar constraints running in a single batched forward pass — so 8 concurrent requests with 8 independent grammars use the same GPU call as 1 request.
Does KV compression hurt quality? By how much? The conventional wisdom is that KV compression degrades output. Our measurements on Gemma4 show the opposite for turbo4: better perplexity than f16 at 3.8× compression. The reason is subtle — the TurboQuant variant that adds a 1-bit QJL residual (turbo3) actually hurts quality because the random projection adds variance that softmax amplifies. turbo4 (MSE-only, no QJL) avoids this. See KV-RESEARCH.md for the full analysis.
This is a research workbench, not a production server. The design choices — NDJSON subprocess instead of HTTP, C++ core with TypeScript harness, explicit KV cache types — are chosen to make these experiments measurable and reproducible.
Ollama and LM Studio are great for running models. They're not designed for controlling them.
We needed to explore token-level steering — grammar constraints, logit bias, multi-sequence batching, KV cache compression — in combination. None of the existing servers expose these as composable primitives. Ollama queues requests and doesn't support per-request grammars. LM Studio has no batching API. llama.cpp's HTTP server has grammar support but no logit bias composition and no true concurrent batching.
The alternative was FFI — calling into libllama.dylib directly from Bun/TypeScript via dlopen and function pointers. We tried this (see docs/HOW-WE-GOT-HERE.md). It's fragile: the function pointer table breaks on every llama.cpp API change, the type marshaling is error-prone, and debugging across the FFI boundary is painful.
The insight: a standalone subprocess that speaks NDJSON on stdin/stdout is simpler than FFI and more powerful than HTTP for a single-caller driver. Bun spawns the process, writes JSON lines in, reads JSON lines out. No HTTP server, no port management, no REST client. The language boundary is crossed exactly once — at process spawn. Everything inside the subprocess is native C++, everything outside is idiomatic TypeScript.
This also makes the subprocess trivially replaceable, testable (printf '...' | ./inference-server), and network-transparent (pipe over SSH with no protocol changes).
The constraint: one client per process. If multiple independent callers need to share a loaded model, a Unix socket transport would serve the same NDJSON protocol to multiple clients simultaneously — no protocol change required, just a different I/O layer. HTTP would work too but adds per-request context ownership and loses the batching model. The current stdio design is the right shape for a single orchestrating process; the socket variant is the right shape for a shared local service.
Most LLM servers (LM Studio, Ollama, OpenAI, llama.cpp's own HTTP server) handle requests one at a time or queue them. Steering parameters — grammar, sampling, logit bias — are global per request and don't compose.
This server is built around three capabilities that don't exist together anywhere else:
1. Composable token steering in a single sampler chain
Grammar constraints, logit bias, repetition penalty, top-k/p/min-p, and temperature all run as a single ordered chain before each token is selected — not as separate post-processing steps. This means:
- Grammar zeros out structurally invalid tokens first
- Logit bias then nudges the remaining distribution (force a token, ban a token, boost a domain term)
- Repetition penalty runs on what's left
- Sampling selects from the result
You can force a specific output format (JSON Schema via LLGuidance) while simultaneously banning specific tokens and boosting domain vocabulary. None of the major servers support this combination.
2. Multi-sequence batching with independent steering per sequence
Up to 8 concurrent sequences share a single llama_decode call. Each sequence has its own:
- Grammar / schema constraint (independent parse state, cloned at infer time)
- Logit bias array
- Sampling parameters
A single process can run 8 concurrent requests where seq 1 uses a JSON schema, seq 2 uses a GBNF grammar with logit bias, seq 3 runs free generation — all batched through one forward pass. HTTP servers can't do this; each request owns the model context.
3. TurboQuant KV cache compression
TurboQuant (Zandieh et al., ICLR 2026) compresses the KV cache ~5× using a randomized Hadamard transform before quantization. This is not available in any production LLM server.
The compression compounds with batching: with f16 KV, 8 concurrent sequences consume ~800 MiB of KV cache. With turbo3, that drops to ~160 MiB — freeing 640 MiB for more sequences or longer context. On memory-constrained systems (or when running large models), this is the difference between 2 concurrent sequences and 10.
The combination
Token-native grammar constraints (LLGuidance operates in token space, not character space) + composable logit steering + multi-sequence batching + 5× KV compression, all in a single process, single forward pass. No existing server provides this stack.
- mainline — standard llama.cpp (b8763), f16/q8_0/q4_0 KV cache
- turboquant — TheTom/llama-cpp-turboquant fork, turbo2/3/4 KV cache (TurboQuant paper)
- llg — mainline + LLGuidance, adds
%llguidanceand%json_schemagrammar formats
make # build mainline + turboquant + Metal shader cache + tok
make model # download test model (~5GB, requires HF login — see below)
make test # run steering test suite (11 tests)
make test-tq # run test suite against turboquant binary
make mainline # mainline only (f16/q8/q4 KV cache)
make turboquant # TurboQuant only (turbo2/3/4 KV cache)
make llg # LLGuidance binary (needs separate llg build, see below)
make tok # token vocabulary helper
make metal # recompile Metal shaders (after TurboQuant source changes)
make test # steering test suite against mainline
make test-tq # steering test suite against turboquant
make clean # remove built binariesgit clone https://github.com/brian-ln/clickpaca
cd clickpaca
make setup PRESET=macos-metal # macOS M-series
make setup PRESET=linux-rocm # Linux AMD ROCm (Strix Halo / RX 7900)
make setup PRESET=linux-cuda # Linux NVIDIA CUDA
make setup PRESET=linux-cpu # Linux CPU only
make # build clickpaca binariesmake setup clones llama.cpp and the TurboQuant fork as siblings, builds them using the selected CMake preset, and sets up the builds/ symlink. CMake presets are defined in CMakePresets.json.
Apple Silicon note: the macos-metal preset produces a single binary that runs on both M4 and M5. The TurboQuant Metal kernel dispatches at runtime — M5 (Apple10 GPU family) uses a fused kernel that makes turbo types approach f16 speed; M4 (Apple9) pays a ~35% generation speed penalty for the compression. Same binary, different hardware path.
ROCm target GPU: the linux-rocm preset defaults to gfx1151 (Strix Halo / Ryzen AI MAX+). For other AMD GPUs use linux-rocm-gfx1100 (RDNA3 / RX 7900) or override:
make setup PRESET=linux-rocm CMAKE_EXTRA="-DAMDGPU_TARGETS=gfx1030"Custom paths — if dependencies are already built elsewhere:
make MAINLINE_BUILD=/path/to/llama.cpp/build/bin mainline
make TQ_ROOT=/path/to/llama-cpp-turboquant turboquantRequirements: CMake ≥ 3.23, C++17 compiler. macOS: Xcode CLI tools. ROCm: /opt/rocm. Bun for tests.
Tests require a Gemma4 E4B Q4_K_M model. Point at any existing copy:
# Option 1: point at a model you already have
export CLICKPACA_MODEL=/path/to/gemma4-e4b-Q4_K_M.gguf
make test
# Option 2: already using llama.cpp with -hf downloads?
# Models land in ~/.cache/llama.cpp/ (or $LLAMA_CACHE) — clickpaca checks there by default.
# Option 3: download via llama.cpp -hf (caches to ~/.cache/llama.cpp/ automatically)
export HF_TOKEN=hf_your_token_here # from https://huggingface.co/settings/tokens
make model # uses llama-cli -hf internally
make testNote: Ollama blob files do NOT work — they are multimodal shards, not complete standalone GGUFs.
Model search order: CLICKPACA_MODEL → LLAMA_CACHE → ~/.cache/llama.cpp/.
NDJSON on stdin, NDJSON on stdout. One JSON object per line.
Load a model. Must be sent first.
{
"type": "init",
"modelPath": "/path/to/model.gguf",
"nGpuLayers": 99,
"nCtx": 8192,
"nBatch": 512,
"nSeqMax": 64,
"cacheTypeK": "f16",
"cacheTypeV": "f16"
}cacheTypeK / cacheTypeV options: f16 (default), q8_0, q4_0, turbo2, turbo3, turbo4
turbo types require the turboquant binary.
nSeqMax controls the maximum concurrent sequences and KV cache pre-allocation. Default 64.
Set nSeqMax: 2 when using the score message — high nSeqMax exhausts Metal compute buffers during perplexity computation.
Register a grammar constraint by id. The compiled sampler is cached and cloned per sequence.
{
"type": "grammar",
"id": "my-schema",
"grammar": "root ::= (\"yes\" | \"no\")"
}Supports GBNF syntax. See llama.cpp grammars for examples.
LLGuidance (%llguidance { ... }) supported when compiled with LLAMA_LLGUIDANCE=ON.
Enqueue a generation request.
{
"type": "infer",
"seq_id": 1,
"prompt": "<bos><|turn>user\nHello\n<turn|>\n<|turn>model\n",
"maxTokens": 256,
"temperature": 0.7,
"topK": 40,
"topP": 0.95,
"minP": 0.0,
"penaltyRepeat": 1.0,
"penaltyFreq": 0.0,
"penaltyPresent": 0.0,
"penaltyLastN": 64,
"grammar_id": "my-schema",
"logit_bias": [
{ "token": 107, "bias": -1e9 },
{ "token": 4443, "bias": 5.0 }
]
}Up to 64 sequences (MAX_SEQ) can be active concurrently. Active sequences are batched together per generation tick.
Compute perplexity (NLL) of a text without generating tokens. Used for KV cache quality evaluation.
{ "type": "score", "seq_id": 1, "text": "The quick brown fox..." }Note: Use nSeqMax: 2 in init when scoring — high nSeqMax exhausts Metal compute buffers.
Response: {"type":"scored","seq_id":1,"tokens":17,"nll":2.10,"ppl":8.19}
Lower PPL = better. Use to compare cache types on the same corpus.
Kill an active sequence and free its KV cache slot.
{ "type": "cancel", "seq_id": 1 }Graceful shutdown. Active sequences are abandoned.
{ "type": "quit" }{ "type": "ready", "nVocab": 262144, "build": "mainline" }
{ "type": "grammar_ready", "id": "my-schema" }
{ "type": "token", "seq_id": 1, "piece": " Hello" }
{ "type": "done", "seq_id": 1, "tokens_prompt": 12, "tokens_predicted": 48,
"prompt_ms": 147.5, "eval_ms": 2800.0,
"prompt_tps": 81.4, "eval_tps": 17.1 }
{ "type": "scored", "seq_id": 1, "tokens": 17, "nll": 2.10, "ppl": 8.19 }
{ "type": "error", "seq_id": -1, "message": "..." }All steering is applied in the sampler chain before each token is selected. The chain runs in this order:
logits (full vocabulary)
→ grammar (zero out tokens invalid at current parse state)
→ logit_bias (force / ban / nudge specific tokens)
→ repetition penalty (penalize recently seen tokens)
→ top-k (keep k most probable)
→ min-p (adaptive probability floor)
→ top-p (nucleus: keep top p% of mass)
→ temperature (scale distribution)
→ dist (sample)
Forces output to match a formal grammar. The grammar is compiled once on grammar message, then cloned per sequence — each sequence has independent parse state.
How it works: The grammar is represented as a pushdown automaton (llama_grammar_rules + llama_grammar_stacks). Before each token, llama_grammar_apply_impl walks all active parse stacks and sets logits to -inf for any token that isn't a valid continuation. After a token is accepted, llama_grammar_accept advances the stacks.
Reference:
Direct surgery on the probability distribution at the logit level.
"logit_bias": [
{ "token": 4443, "bias": 1e9 }, // force this token
{ "token": 107, "bias": -1e9 }, // ban this token
{ "token": 1234, "bias": 2.5 } // nudge probability up
]Applied before temperature/sampling so it interacts correctly with all other constraints. -inf (-1e9) bans permanently for the sequence. +inf (+1e9) forces selection regardless of model preference.
Reference: llama_sampler_init_logit_bias in llama-sampling.cpp
Keeps only the k highest-probability tokens after temperature scaling; zeroes the rest.
"topK": 40topK: 1 = greedy decoding (deterministic, always picks most likely token).
topK: 0 = disabled (use with minP or topP alone).
Reference: Top-k sampling
Keeps the smallest set of tokens whose cumulative probability exceeds p. Adapts to distribution shape — narrows when the model is confident, widens when uncertain.
"topP": 0.95Reference: Holtzman et al. 2020 — The Curious Case of Neural Text Degeneration
Keeps tokens with probability ≥ p × max_probability. More adaptive than top-p — self-calibrates relative to the peak of the distribution.
"minP": 0.050.05 means keep any token with at least 5% of the top token's probability. Tends to produce better quality than top-p alone for open-ended generation.
Reference: Min-p sampling PR
Penalizes tokens that have recently appeared in the output, reducing loops and repetitive text.
"penaltyRepeat": 1.3,
"penaltyFreq": 0.0,
"penaltyPresent": 0.0,
"penaltyLastN": 64penaltyRepeat— multiplicative penalty on logits of seen tokens.1.0= disabled.1.3= moderate.1.0+only.penaltyFreq— additional penalty proportional to frequency of appearancepenaltyPresent— flat penalty for any token that has appeared at allpenaltyLastN— sliding window of recent tokens to consider.-1= full context.
Reference: Keskar et al. 2019 — CTRL
The KV cache is the model's internal register state — the activations at every layer for every token processed. Compressing it means compressing those register values, not compressing text. The question is whether downstream computation (the next token prediction) produces the same result from compressed state as from full-precision state.
The turboquant binary supports compressed KV cache types that reduce memory by ~3.8–6× at the cost of some quality. q8_0 and q4_0 are available on all binaries; turbo* types require the turboquant binary.
"cacheTypeK": "turbo3",
"cacheTypeV": "turbo3"| Type | Bits/value | Block size | Compression vs f16 | Binary | Notes |
|---|---|---|---|---|---|
f16 |
16 | — | 1× | all | Default, lossless |
q8_0 |
8 | — | 2× | all | Simple uniform quantization |
q4_0 |
4 | — | 4× | all | Aggressive, noticeable quality loss |
turbo2 |
~2.625 | 10 bytes/128 | ~6.1× | turboquant | WHT + 2-bit PolarQuant |
turbo3 |
~3.25 | 14 bytes/128 | ~4.9× | turboquant | WHT + 3-bit PolarQuant |
turbo4 |
~4.25 | 68 bytes/128 | ~3.8× | turboquant | WHT + 4-bit PolarQuant |
How TurboQuant works:
Standard KV quantization degrades quality because key/value vectors contain outlier dimensions — a small number of channels with very large magnitudes. Uniform quantization wastes its precision budget covering the outlier range.
TurboQuant applies a randomized Hadamard transform to key vectors before quantization. The Walsh-Hadamard transform (WHT) is a linear rotation that distributes outlier energy uniformly across all dimensions. After rotation, the vector is amenable to uniform quantization with minimal error.
The rotation is randomized (Hadamard matrix × random diagonal sign matrix) so no adversarial input can consistently produce outliers post-rotation.
On Apple Silicon, a fused Metal Flash Attention kernel performs the WHT dequantization inline during attention computation — the full dequantized KV cache is never materialized in memory.
On pre-M5 hardware (M1–M4): Uses a 4-magnitude LUT path. The theoretical memory bandwidth savings exist but are partially offset by dequant overhead — generation is slower than f16 at short context, roughly equivalent at long context.
Metal shader caching: The TurboQuant fork embeds Metal shader source and compiles it at runtime (GGML_METAL_EMBED_LIBRARY). make metal pre-compiles the shaders to default.metallib and places it next to the binary, reducing startup from ~30s to ~0s.
Reference:
- TurboQuant paper — Zandieh et al., ICLR 2026
- TheTom/llama-cpp-turboquant
- Walsh-Hadamard Transform
- Lloyd-Max quantization
Up to 64 sequences (MAX_SEQ) can run concurrently. The scheduler prefills one pending sequence per tick (variable-length prompts don't batch efficiently), then decodes all prefill-complete sequences in a single batched llama_decode call.
Each sequence has independent KV cache slots, sampler chains, and grammar state. Cancelling one sequence frees its slot immediately.
The practical limit is KV cache memory: with f16 KV, 64 sequences × 8K context ≈ 6.4 GiB. With turbo4, the same footprint supports ~240 sequences. MAX_SEQ is a soft ceiling — set it to the number of sequences you actually need to reduce KV pre-allocation overhead.
| Cache | PPL | vs f16 | KV mem |
|---|---|---|---|
| f16 | 52.995 ±2.19 | baseline | ~100 MiB/seq |
| turbo4 | 50.483 ±2.06 | -4.74% | ~26 MiB/seq |
| turbo3 | 51.582 ±2.09 | -2.67% | ~20 MiB/seq |
| turbo2 | 54.506 ±2.15 | +2.84% | ~13 MiB/seq |
turbo4 is the recommended default. Better quality than f16 at 3.8× KV compression. turbo3 is also good (-2.67% vs f16). turbo2 degrades slightly. Results are deterministic (zero variance across 3 controlled runs). Reproduce: bash results/ppl-wikitext2-reproduce.sh
| Cache | pp512 (tok/s) | tg128 (tok/s) |
|---|---|---|
| f16 | 1423 ±32 | 48.8 ±2.1 |
| turbo4 | 1196 ±33 | 31.4 ±2.2 |
| turbo3 | 570 ±305 | 35.0 ±1.7 |
f16 is faster on M4 Max — dequant overhead exceeds bandwidth savings at these context lengths. turbo4 costs ~35% generation speed for 3.8× KV compression. On M5 Max (tensor API, fused kernel), turbo4 is expected to approach f16 speed. [NEEDS M5 MEASUREMENT]
See docs/KV-RESEARCH.md for full quantization theory, the QJL analysis, and community findings. See docs/CAPABILITIES.md for experiment designs and the KV persistence roadmap.
bin/mainline/tok model.gguf "yes" "no" # find token IDs
bin/mainline/tok model.gguf --id 107 108 # decode IDs to strings
bin/mainline/tok model.gguf --search "hello" # all tokens containing substrUse with logit_bias to find the correct token IDs for your model.