Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 63 additions & 2 deletions docs/how-to/auto-tune-routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,10 +104,67 @@

To make tool-heavy requests more likely to score as complex, raise the
`tools` weight. To ignore keyword matching entirely, set its weight to
`0.0`. These are configured in `[[tiers]]` scoring config (see the
configuration reference).
`0.0`.

## Tune classifier weights via grob_configure

The `[classifier]` section is exposed as a writable section of
`grob_configure`, so you can adjust weights and thresholds at runtime
without restarting the proxy. Hot-reload rebuilds the scorer
atomically — in-flight requests continue on the old snapshot.

### Read the current values

```
grob_configure action=read section=classifier
```

Returns:

```json
{
"weights": {
"max_tokens": 1.0, "tools": 1.0, "context_size": 1.0,
"keywords": 1.0, "system_prompt": 1.0
},
"thresholds": {
"medium_threshold": 2.0, "complex_threshold": 5.0
}
}
```

### Update one key at a time

```json
{
"method": "grob_configure",
"params": {
"action": "update",
"section": "classifier",
"key": "weights.tools",
"value": 5.0
}
}
```

Whitelisted keys:

- `weights.max_tokens`, `weights.tools`, `weights.context_size`,
`weights.keywords`, `weights.system_prompt`
- `thresholds.medium_threshold`, `thresholds.complex_threshold`

Other keys are rejected with `unknown classifier key`. Credentials and
DLP settings remain blocked by the central deny-list.

### Bypass the scorer per request

When a client already knows the right tier, override scoring entirely
with [`grob_hint`](use-grob-hint.md) (header, body field, or MCP
tool). The hint is consumed for one request only.

## Iterate

## Iterate

Check failure on line 167 in docs/how-to/auto-tune-routing.md

View workflow job for this annotation

GitHub Actions / markdownlint

Multiple headings with the same content

docs/how-to/auto-tune-routing.md:167 MD024/no-duplicate-heading Multiple headings with the same content [Context: "Iterate"] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md024.md

1. Collect traces for a representative period (a few hours to a day).
2. Run the `jq` analysis to find misrouted or slow requests.
Expand All @@ -121,5 +178,9 @@

## Further reading

- [Use grob_hint to override request complexity](use-grob-hint.md) — bypass
the scorer entirely when the client knows the tier.
- [Configure the SimHash fuzzy response cache](configure-simhash-cache.md) — pair
classifier tuning with cache tuning for end-to-end speedup.
- [Configuration reference](../reference/configuration.md) -- full list of config keys
- [Observability reference](../reference/observability.md) -- Prometheus metrics and SSE stream
107 changes: 107 additions & 0 deletions docs/how-to/configure-simhash-cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Configure the SimHash fuzzy response cache

Grob caches deterministic LLM responses (`temperature = 0`) so that
identical or *near-identical* prompts can reuse a previous answer
without hitting the upstream provider. The cache has two layers:

- **Exact** — SHA-256 of the canonicalised request. Sub-microsecond
lookup, zero false positives.
- **Fuzzy (SimHash)** — 64-bit perceptual fingerprint of the prompt
text plus Hamming-distance lookup. Catches paraphrases, whitespace
changes, and minor edits that the exact cache would miss.

The fuzzy layer uses *no* embeddings — the fingerprint is computed
from token shingles and `DefaultHasher`. It has no model dependency
and adds ~microseconds per request.

## How SimHash works (one paragraph)

The prompt is normalised (lowercased, punctuation stripped,
whitespace collapsed) and split into tokens. Each token is hashed
together with its position; per-bit weights are accumulated across
all tokens. The final 64-bit fingerprint has bit *i* set iff the
cumulative weight at position *i* is positive. Two similar prompts
share most bits; the **Hamming distance** (number of differing bits)
measures dissimilarity. Identical prompts → distance 0; complete
paraphrases → typically 1–4; unrelated prompts → typically > 20.

## Configure the cache

Add or update the `[cache]` section of `~/.grob/config.toml`:

```toml
[cache]
enabled = true
max_capacity = 2000 # entries (~4 MiB at 2 KiB avg)
ttl_secs = 3600 # 1 hour
max_entry_bytes = 2097152 # 2 MiB per entry
simhash_threshold = 3 # max Hamming distance for a fuzzy hit
```

Then reload:

```sh
curl -X POST http://localhost:13456/api/config/reload
```

## Tuning the threshold

`simhash_threshold` is the maximum Hamming distance for a cache hit.
Lower values are stricter; higher values are more permissive.

| Threshold | Behaviour | Use when |
|-----------|-----------|----------|
| `0` | Exact-match only (fingerprint must match perfectly) | You want the fuzzy layer disabled in practice |
| `1`–`2` | Catches whitespace and trivial edits | Conservative — minimise false positives |
| `3` (default) | Catches paraphrases of short prompts and small edits to long ones | Balanced |
| `4`–`6` | Catches synonym swaps, reordered tokens | Aggressive — boilerplate-heavy workloads |
| `≥ 10` | High false-positive risk — unrelated prompts may match | Not recommended |

A threshold of 3 over a 64-bit fingerprint is roughly a 5% Hamming
radius. Empirically this catches paraphrases without colliding
unrelated short prompts in production logs.

## Per-tenant isolation

The exact cache key includes the tenant ID, so cached responses are
never shared across virtual API keys. The SimHash layer keys on the
fingerprint only; if you need strict tenant isolation on the fuzzy
layer too, set `simhash_threshold = 0` for that deployment.

## Observability

Grob exports two Prometheus counters specific to the SimHash layer:

| Metric | Meaning |
|--------|---------|
| `grob_simhash_cache_hits_total` | Fuzzy lookups that returned a hit |
| `grob_simhash_cache_misses_total` | Fuzzy lookups with no entry within threshold |

Generic cache metrics (`grob_cache_hits_total`,
`grob_cache_misses_total`) cover both layers. To compute the fuzzy
**uplift** — share of hits the exact cache would have missed — divide
SimHash hits by total cache lookups.

## Disable the cache entirely

Set `enabled = false`. The SimHash layer is skipped along with the
exact layer; every request hits the upstream provider.

## Trade-offs

- **Latency**: SimHash adds ~1–5 µs per lookup (token hashing +
fingerprint scan). Negligible compared to network round-trip.
- **Memory**: each fuzzy entry stores its 64-bit fingerprint plus a
reference to the cached response. Bounded by `max_capacity`.
- **Determinism**: fuzzy hits return a response that was generated for
a *similar but not identical* prompt. Always safe for explanatory or
template-style prompts; can drift on prompts whose semantics depend
on specific token order or wording. Test with representative
workloads before raising the threshold above 3.

## Further reading

- [Configuration reference](../reference/configuration.md) — full list
of `[cache]` options.
- [Auto-tune routing with trace analysis](auto-tune-routing.md) — pair
the cache with classifier tuning for end-to-end speedup.
115 changes: 115 additions & 0 deletions docs/how-to/use-grob-hint.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Use grob_hint to override request complexity

Skip the heuristic classifier when the client already knows how complex
a request is. `grob_hint` declares a complexity tier for a single
request, bypassing scoring and feeding directly into provider selection.

Three surfaces are equivalent and supported:

| Surface | Best for |
|---------|----------|
| `X-Grob-Hint` HTTP header | curl, scripts, any HTTP client |
| `metadata.grob_hint` request body field | SDK clients (Anthropic, OpenAI) |
| `grob_hint` MCP tool | MCP-native agents that cannot set headers |

## Valid hint values

- `trivial` — fast lookup, short answer, no reasoning
- `medium` — standard reasoning, moderate context
- `complex` — deep reasoning, multi-step, tool use, large context

Anything else is rejected with `400 Bad Request`.

## Priority order

When several hint surfaces are set on the same request, the first match
wins:

1. `X-Grob-Hint` header
2. `metadata.grob_hint` body field
3. MCP one-shot slot (set via `grob_hint` tool, consumed on next dispatch)

If none are set, the heuristic classifier scores the request from
observable signals (max_tokens, tools, context size, keywords, system
prompt length).

## Use the X-Grob-Hint header

Cleanest path for shell scripts and `curl`. Add a single header:

```sh
curl http://localhost:13456/v1/messages \
-H 'Content-Type: application/json' \
-H 'X-Grob-Hint: complex' \
-d '{
"model": "claude-sonnet-4",
"max_tokens": 4096,
"messages": [{"role": "user", "content": "Refactor this monorepo build pipeline."}]
}'
```

The hint is consumed for this request only. The next request without
the header falls back to scoring.

## Use metadata.grob_hint in the body

Some SDKs forbid custom headers but allow arbitrary `metadata` fields.
Drop the hint there — Grob reads it before forwarding the request:

```json
{
"model": "claude-sonnet-4",
"max_tokens": 1024,
"metadata": {
"grob_hint": "trivial"
},
"messages": [{"role": "user", "content": "What time zone is UTC+1?"}]
}
```

Grob strips `metadata.grob_hint` before forwarding to the upstream
provider, so no provider-specific metadata schema is contaminated.

## Use the grob_hint MCP tool

For MCP clients (Claude Code, Cursor, custom agents) that cannot set
custom HTTP headers and don't shape the request body directly. Call
the tool **before** the request you want to influence:

```json
{
"method": "tools/call",
"params": {
"name": "grob_hint",
"arguments": {"complexity": "complex"}
}
}
```

The hint is stored in a one-shot slot on the server and consumed by
the next dispatch from the same MCP session. After consumption the slot
is cleared automatically — you must call `grob_hint` again to influence
a subsequent request.

## When to use each surface

- **Header** — quick experiments, batch scripts, profiling.
- **Metadata** — production SDK clients where you control the request
body but not the transport.
- **MCP tool** — agentic clients that operate through MCP and don't
craft HTTP requests directly. Useful when an agent's planner has
already classified the task and wants to avoid re-running the
scorer on the proxy.

## Troubleshooting

| Symptom | Likely cause |
|---------|--------------|
| Hint ignored | Spelled wrong; only `trivial`/`medium`/`complex` are accepted |
| Hint applied to the wrong request | MCP one-shot slot was consumed earlier; call `grob_hint` again |
| Header passed through to provider | Should not happen — file an issue with the request trace |

## Further reading

- [Auto-tune routing with trace analysis](auto-tune-routing.md) — when to
rely on the scorer instead of pinning hints.
Loading