azerozero · Destynova2 · Apr 25, 2026 · Apr 25, 2026
diff --git a/docs/how-to/auto-tune-routing.md b/docs/how-to/auto-tune-routing.md
@@ -104,10 +104,67 @@
 
 To make tool-heavy requests more likely to score as complex, raise the
 `tools` weight. To ignore keyword matching entirely, set its weight to
-`0.0`. These are configured in `[[tiers]]` scoring config (see the
-configuration reference).
+`0.0`.
+
+## Tune classifier weights via grob_configure
+
+The `[classifier]` section is exposed as a writable section of
+`grob_configure`, so you can adjust weights and thresholds at runtime
+without restarting the proxy. Hot-reload rebuilds the scorer
+atomically — in-flight requests continue on the old snapshot.
+
+### Read the current values
+
+```
+grob_configure action=read section=classifier
+```
+
+Returns:
+
+```json
+{
+  "weights": {
+    "max_tokens": 1.0, "tools": 1.0, "context_size": 1.0,
+    "keywords": 1.0, "system_prompt": 1.0
+  },
+  "thresholds": {
+    "medium_threshold": 2.0, "complex_threshold": 5.0
+  }
+}
+```
+
+### Update one key at a time
+
+```json
+{
+  "method": "grob_configure",
+  "params": {
+    "action": "update",
+    "section": "classifier",
+    "key": "weights.tools",
+    "value": 5.0
+  }
+}
+```
+
+Whitelisted keys:
+
+- `weights.max_tokens`, `weights.tools`, `weights.context_size`,
+  `weights.keywords`, `weights.system_prompt`
+- `thresholds.medium_threshold`, `thresholds.complex_threshold`
+
+Other keys are rejected with `unknown classifier key`. Credentials and
+DLP settings remain blocked by the central deny-list.
+
+### Bypass the scorer per request
+
+When a client already knows the right tier, override scoring entirely
+with [`grob_hint`](use-grob-hint.md) (header, body field, or MCP
+tool). The hint is consumed for one request only.
+
+## Iterate
 
 ## Iterate
 
 1. Collect traces for a representative period (a few hours to a day).
 2. Run the `jq` analysis to find misrouted or slow requests.
@@ -121,5 +178,9 @@
 
 ## Further reading
 
+- [Use grob_hint to override request complexity](use-grob-hint.md) — bypass
+  the scorer entirely when the client knows the tier.
+- [Configure the SimHash fuzzy response cache](configure-simhash-cache.md) — pair
+  classifier tuning with cache tuning for end-to-end speedup.
 - [Configuration reference](../reference/configuration.md) -- full list of config keys
 - [Observability reference](../reference/observability.md) -- Prometheus metrics and SSE stream
diff --git a/docs/how-to/configure-simhash-cache.md b/docs/how-to/configure-simhash-cache.md
@@ -0,0 +1,107 @@
+# Configure the SimHash fuzzy response cache
+
+Grob caches deterministic LLM responses (`temperature = 0`) so that
+identical or *near-identical* prompts can reuse a previous answer
+without hitting the upstream provider. The cache has two layers:
+
+- **Exact** — SHA-256 of the canonicalised request. Sub-microsecond
+  lookup, zero false positives.
+- **Fuzzy (SimHash)** — 64-bit perceptual fingerprint of the prompt
+  text plus Hamming-distance lookup. Catches paraphrases, whitespace
+  changes, and minor edits that the exact cache would miss.
+
+The fuzzy layer uses *no* embeddings — the fingerprint is computed
+from token shingles and `DefaultHasher`. It has no model dependency
+and adds ~microseconds per request.
+
+## How SimHash works (one paragraph)
+
+The prompt is normalised (lowercased, punctuation stripped,
+whitespace collapsed) and split into tokens. Each token is hashed
+together with its position; per-bit weights are accumulated across
+all tokens. The final 64-bit fingerprint has bit *i* set iff the
+cumulative weight at position *i* is positive. Two similar prompts
+share most bits; the **Hamming distance** (number of differing bits)
+measures dissimilarity. Identical prompts → distance 0; complete
+paraphrases → typically 1–4; unrelated prompts → typically > 20.
+
+## Configure the cache
+
+Add or update the `[cache]` section of `~/.grob/config.toml`:
+
+```toml
+[cache]
+enabled = true
+max_capacity = 2000          # entries (~4 MiB at 2 KiB avg)
+ttl_secs = 3600              # 1 hour
+max_entry_bytes = 2097152    # 2 MiB per entry
+simhash_threshold = 3        # max Hamming distance for a fuzzy hit
+```
+
+Then reload:
+
+```sh
+curl -X POST http://localhost:13456/api/config/reload
+```
+
+## Tuning the threshold
+
+`simhash_threshold` is the maximum Hamming distance for a cache hit.
+Lower values are stricter; higher values are more permissive.
+
+| Threshold | Behaviour | Use when |
+|-----------|-----------|----------|
+| `0` | Exact-match only (fingerprint must match perfectly) | You want the fuzzy layer disabled in practice |
+| `1`–`2` | Catches whitespace and trivial edits | Conservative — minimise false positives |
+| `3` (default) | Catches paraphrases of short prompts and small edits to long ones | Balanced |
+| `4`–`6` | Catches synonym swaps, reordered tokens | Aggressive — boilerplate-heavy workloads |
+| `≥ 10` | High false-positive risk — unrelated prompts may match | Not recommended |
+
+A threshold of 3 over a 64-bit fingerprint is roughly a 5% Hamming
+radius. Empirically this catches paraphrases without colliding
+unrelated short prompts in production logs.
+
+## Per-tenant isolation
+
+The exact cache key includes the tenant ID, so cached responses are
+never shared across virtual API keys. The SimHash layer keys on the
+fingerprint only; if you need strict tenant isolation on the fuzzy
+layer too, set `simhash_threshold = 0` for that deployment.
+
+## Observability
+
+Grob exports two Prometheus counters specific to the SimHash layer:
+
+| Metric | Meaning |
+|--------|---------|
+| `grob_simhash_cache_hits_total` | Fuzzy lookups that returned a hit |
+| `grob_simhash_cache_misses_total` | Fuzzy lookups with no entry within threshold |
+
+Generic cache metrics (`grob_cache_hits_total`,
+`grob_cache_misses_total`) cover both layers. To compute the fuzzy
+**uplift** — share of hits the exact cache would have missed — divide
+SimHash hits by total cache lookups.
+
+## Disable the cache entirely
+
+Set `enabled = false`. The SimHash layer is skipped along with the
+exact layer; every request hits the upstream provider.
+
+## Trade-offs
+
+- **Latency**: SimHash adds ~1–5 µs per lookup (token hashing +
+  fingerprint scan). Negligible compared to network round-trip.
+- **Memory**: each fuzzy entry stores its 64-bit fingerprint plus a
+  reference to the cached response. Bounded by `max_capacity`.
+- **Determinism**: fuzzy hits return a response that was generated for
+  a *similar but not identical* prompt. Always safe for explanatory or
+  template-style prompts; can drift on prompts whose semantics depend
+  on specific token order or wording. Test with representative
+  workloads before raising the threshold above 3.
+
+## Further reading
+
+- [Configuration reference](../reference/configuration.md) — full list
+  of `[cache]` options.
+- [Auto-tune routing with trace analysis](auto-tune-routing.md) — pair
+  the cache with classifier tuning for end-to-end speedup.
diff --git a/docs/how-to/use-grob-hint.md b/docs/how-to/use-grob-hint.md
@@ -0,0 +1,115 @@
+# Use grob_hint to override request complexity
+
+Skip the heuristic classifier when the client already knows how complex
+a request is. `grob_hint` declares a complexity tier for a single
+request, bypassing scoring and feeding directly into provider selection.
+
+Three surfaces are equivalent and supported:
+
+| Surface | Best for |
+|---------|----------|
+| `X-Grob-Hint` HTTP header | curl, scripts, any HTTP client |
+| `metadata.grob_hint` request body field | SDK clients (Anthropic, OpenAI) |
+| `grob_hint` MCP tool | MCP-native agents that cannot set headers |
+
+## Valid hint values
+
+- `trivial` — fast lookup, short answer, no reasoning
+- `medium` — standard reasoning, moderate context
+- `complex` — deep reasoning, multi-step, tool use, large context
+
+Anything else is rejected with `400 Bad Request`.
+
+## Priority order
+
+When several hint surfaces are set on the same request, the first match
+wins:
+
+1. `X-Grob-Hint` header
+2. `metadata.grob_hint` body field
+3. MCP one-shot slot (set via `grob_hint` tool, consumed on next dispatch)
+
+If none are set, the heuristic classifier scores the request from
+observable signals (max_tokens, tools, context size, keywords, system
+prompt length).
+
+## Use the X-Grob-Hint header
+
+Cleanest path for shell scripts and `curl`. Add a single header:
+
+```sh
+curl http://localhost:13456/v1/messages \
+  -H 'Content-Type: application/json' \
+  -H 'X-Grob-Hint: complex' \
+  -d '{
+    "model": "claude-sonnet-4",
+    "max_tokens": 4096,
+    "messages": [{"role": "user", "content": "Refactor this monorepo build pipeline."}]
+  }'
+```
+
+The hint is consumed for this request only. The next request without
+the header falls back to scoring.
+
+## Use metadata.grob_hint in the body
+
+Some SDKs forbid custom headers but allow arbitrary `metadata` fields.
+Drop the hint there — Grob reads it before forwarding the request:
+
+```json
+{
+  "model": "claude-sonnet-4",
+  "max_tokens": 1024,
+  "metadata": {
+    "grob_hint": "trivial"
+  },
+  "messages": [{"role": "user", "content": "What time zone is UTC+1?"}]
+}
+```
+
+Grob strips `metadata.grob_hint` before forwarding to the upstream
+provider, so no provider-specific metadata schema is contaminated.
+
+## Use the grob_hint MCP tool
+
+For MCP clients (Claude Code, Cursor, custom agents) that cannot set
+custom HTTP headers and don't shape the request body directly. Call
+the tool **before** the request you want to influence:
+
+```json
+{
+  "method": "tools/call",
+  "params": {
+    "name": "grob_hint",
+    "arguments": {"complexity": "complex"}
+  }
+}
+```
+
+The hint is stored in a one-shot slot on the server and consumed by
+the next dispatch from the same MCP session. After consumption the slot
+is cleared automatically — you must call `grob_hint` again to influence
+a subsequent request.
+
+## When to use each surface
+
+- **Header** — quick experiments, batch scripts, profiling.
+- **Metadata** — production SDK clients where you control the request
+  body but not the transport.
+- **MCP tool** — agentic clients that operate through MCP and don't
+  craft HTTP requests directly. Useful when an agent's planner has
+  already classified the task and wants to avoid re-running the
+  scorer on the proxy.
+
+## Troubleshooting
+
+| Symptom | Likely cause |
+|---------|--------------|
+| Hint ignored | Spelled wrong; only `trivial`/`medium`/`complex` are accepted |
+| Hint applied to the wrong request | MCP one-shot slot was consumed earlier; call `grob_hint` again |
+| Header passed through to provider | Should not happen — file an issue with the request trace |
+
+## Further reading
+
+- [Auto-tune routing with trace analysis](auto-tune-routing.md) — when to
+  rely on the scorer instead of pinning hints.