feat(server): optionally prepend tool schemas to the system prompt#68
Conversation
|
I found this one while testing #66 inside hermes. I was confused why Hermes new-session prompts weren't being cached. Yes, date-time was one, but even when we trimmed off enough to go past the date-time suffix, I was still not cache hitting. Dumped the raw query and saw that ds4 was appending the tool summary to the system prompt, putting the date-time near the middle of the system prompt, so now trimming xxx bytes off of the prompt wasn't enough cache hit on the new-session prompt. By prepending the tool block to the system prompt, we have a larger fully-static section and the chop can be smaller and still cache hit. There is a pretty big performance win on Hermes Agent. |
|
@antirez prepending the tools block that ds4 generates let's ds4 cache-hit cold queries from agents like Hermes. They send up tools (which currently get appended) and somewhere in the middle is the current date/time. The date/time triggers a cache miss, but when the tools block is prepended and we trim a few tokens off the tail of the request, we wind up with a clean cache hit on all cold requests from agents. I've seen tens of seconds saved on the start of a new conversation. If we combine this feature with #66, caching only the system prompt on cold, we can use a smaller I made this and #66 both opt-in with a new flag, but if you'd prefer either of these to be default behavior, let me know and I'll rework the PRs. |
b71a1e7 to
cdedc8c
Compare
|
@unsaltedbutter-ai the reason to put tools at the end was that hopefully we sometimes were inside the uncompressed KV cache, but I just checked the server logs for a few actual instances: if I tokenize the trace we are basically never inside the last 128 tokens when the tool call is emitted, so anyway we are using the compressed KV cache. At this point, does it make sense to have this as an option? What about switching to prefixed tools instead? |
ds4-server auto-injects "## Tools" and the per-request tool-schema listing into the system content of every chat-style request that declares tools. Place that block at the head of the system region, before any client-supplied system messages, so the dynamic tail of the client's system prompt (commonly a per-request timestamp or agent-id emitted by an SDK) sits at the very end of the system region, immediately before <|User|>. That position is what --kv-cache-boundary-trim-tokens subtracts from when computing the cached cut. With tool schemas at the head, trim can chop only the variable bytes and keep the much larger tool-schema region inside the cached prefix, so cross-session lookups hit instead of always missing on the dynamic tail. The model still sees the same content in the same role; only the order inside the system block changes. Includes a unit test asserting that the tool block is rendered before the client's system content and that the client content stays immediately before the user marker.
cdedc8c to
7b68234
Compare
I offered it as an option because I wanted to make it easier to adopt in case there was a tradeoff to be considered. If you're thinking of making it the default mechanism, I think that would be great. This tool-first ordering plus cold-caching only the system prompt (PR #66) are pretty big wins when using ds4-server with agents like Hermes. I've re-written the commit to make tools-first the standard. |
Summary
Adds
--tools-prepend-system, an opt-in flag that renders ds4's auto-injected tool boilerplate at the start of the system prompt instead of after the client's system content. The model sees the same content in the same role; only the order changes. With the flag enabled, the client's system content (including any dynamic tail it contains) sits immediately before<|User|>, so the existing--kv-cache-boundary-trim-tokensknob can chop just those bytes and keep the tool-schema region in the cached prefix.Motivation
render_chat_prompt_textinjects## Toolsand### Available Tool Schemasinstructions into every prompt whose request carries atoolsfield. Today that block is appended after the client's own system content, producing this layout:For tool-using agents whose system message has a small dynamic field, the variable bytes sit between two stable regions. There is no length-based cache cut that excludes the dynamic content while including any of the tool schemas. Every cross-session lookup misses for the same structural reason regardless of how
--kv-cache-boundary-trim-tokensis tuned.For example, a Hermes-style agent typically emits a few lines at the end of its system prompt summarizing the session context:
The first line varies between sessions (timestamp) while the rest is stable. Under the current layout that block lives in the middle of the rendered system region, with the much larger tool-schema region after it. There is no horizontal cut that captures any tool schemas without also capturing the variable timestamp, so the KV cache cannot be reused across sessions.
With
--tools-prepend-system, the layout becomes:The dynamic content is now at the tail of the system block. The length-based heuristic in
kv_cache_store_lenproduces a cut below the variable bytes, and cross-session cache hits start working without any other knob changes.What this changes
--tools-prepend-system. Off by default. Adds abool tools_prepend_systemfield onserver_configandstruct server.render_chat_prompt_textgains abool tools_prepend_systemparameter that selects whether the tool block is appended (default) or prepended.TOOLS_AFTER_SYSTEMandTOOLS_BEFORE_SYSTEMare used at call sites so the position argument reads naturally next to the existingDS4_THINK_*constants instead of as a bare boolean.server_tools_prepend_system(const server *s)lets parsers read the flag before the fullstruct serverdefinition is in scope, matching the existing pattern used bytool_memory_attach_to_messagesand friends.tool schemas rendered before client system contentwhen the flag is active, so operators can confirm the flag took effect without running a request first.Behavior
/v1/completionsrequests (no chat-template tool injection runs).Benchmark
Workload: an agent with a 16K-token system prompt that ends with a small dynamic tail (a
Conversation started: …block of roughly 50 tokens immediately before the user message). Two back-to-back sessions with clean context.Server launched with
--kv-cache-boundary-trim-tokens 1000 --kv-cache-boundary-align-tokens 2048 --tools-prepend-systemand the disk cache directory unmodified between the two sessions.Without
--tools-prepend-system, the same two-session test produces zero cross-session cache hits. The cold cut lands above the dynamic tail (because the appended tool block sits between the tail and the user marker), the SHA always differs across sessions, andprompt donereports the full ~45 s on every request.Tests
test_render_chat_prompt_text_tools_prepend_system(new) asserts both renderings contain the same set of marker strings and that the tool block precedes the client system content when the flag is true and follows it when the flag is false.TOOLS_AFTER_SYSTEMexplicitly so behavior is unchanged when the flag is off.Run with:
Usage
Startup will print:
Recommended companion flags for tool-using agents with a small dynamic tail:
The trim of 1000 with align of 2048 lands the cold cut at a prefill-batch-aligned position safely below typical dynamic tails. Adjust trim downward if the dynamic tail is smaller, or upward if it is larger.