feat(server): optionally prepend tool schemas to the system prompt by unsaltedbutter-ai · Pull Request #68 · antirez/ds4

unsaltedbutter-ai · 2026-05-11T05:02:45Z

Summary

Adds --tools-prepend-system, an opt-in flag that renders ds4's auto-injected tool boilerplate at the start of the system prompt instead of after the client's system content. The model sees the same content in the same role; only the order changes. With the flag enabled, the client's system content (including any dynamic tail it contains) sits immediately before <｜User｜>, so the existing --kv-cache-boundary-trim-tokens knob can chop just those bytes and keep the tool-schema region in the cached prefix.

Motivation

render_chat_prompt_text injects ## Tools and ### Available Tool Schemas instructions into every prompt whose request carries a tools field. Today that block is appended after the client's own system content, producing this layout:

<｜begin▁of▁sentence｜>
[client's system content, possibly with a small dynamic tail]
[ds4-injected tool schemas + boilerplate]
<｜User｜>
[user message]

For tool-using agents whose system message has a small dynamic field, the variable bytes sit between two stable regions. There is no length-based cache cut that excludes the dynamic content while including any of the tool schemas. Every cross-session lookup misses for the same structural reason regardless of how --kv-cache-boundary-trim-tokens is tuned.

For example, a Hermes-style agent typically emits a few lines at the end of its system prompt summarizing the session context:

Conversation started: Sunday, Oct 31, 2008 03:00 PM
Model: deepseek-v4-flash
Provider: custom

Host: macOS (26.4.1)
User home directory: /Users/snaka
Current working directory: /Users/snaka/Documents

The first line varies between sessions (timestamp) while the rest is stable. Under the current layout that block lives in the middle of the rendered system region, with the much larger tool-schema region after it. There is no horizontal cut that captures any tool schemas without also capturing the variable timestamp, so the KV cache cannot be reused across sessions.

With --tools-prepend-system, the layout becomes:

<｜begin▁of▁sentence｜>
[ds4-injected tool schemas + boilerplate]   <-- byte-stable across requests
[client's system content, dynamic tail intact]
<｜User｜>
[user message]

The dynamic content is now at the tail of the system block. The length-based heuristic in kv_cache_store_len produces a cut below the variable bytes, and cross-session cache hits start working without any other knob changes.

What this changes

New CLI flag --tools-prepend-system. Off by default. Adds a bool tools_prepend_system field on server_config and struct server.
render_chat_prompt_text gains a bool tools_prepend_system parameter that selects whether the tool block is appended (default) or prepended.
Named macros TOOLS_AFTER_SYSTEM and TOOLS_BEFORE_SYSTEM are used at call sites so the position argument reads naturally next to the existing DS4_THINK_* constants instead of as a bare boolean.
A forward-declared accessor server_tools_prepend_system(const server *s) lets parsers read the flag before the full struct server definition is in scope, matching the existing pattern used by tool_memory_attach_to_messages and friends.
Startup logs tool schemas rendered before client system content when the flag is active, so operators can confirm the flag took effect without running a request first.

Behavior

Off by default. Existing behavior is unchanged when the flag is not passed.
When enabled, the auto-injected tool block is placed at the start of the system content. All other rendering (BOS, role markers, thinking tags, tool-result rendering) is unchanged.
Has no effect on requests without tools (no tool block to position).
Has no effect on raw /v1/completions requests (no chat-template tool injection runs).

Benchmark

Workload: an agent with a 16K-token system prompt that ends with a small dynamic tail (a Conversation started: … block of roughly 50 tokens immediately before the user message). Two back-to-back sessions with clean context.

Server launched with --kv-cache-boundary-trim-tokens 1000 --kv-cache-boundary-align-tokens 2048 --tools-prepend-system and the disk cache directory unmodified between the two sessions.

Metric	First request (cold)	Second request (cache hit)	Saved
Prefill wall	45.7 s	6.9 s	38.8 s
Prefill tokens	16081	1703	14378 tokens served from cache
Total request	~47 s	~26 s	~21 s
Cache file	written	reused	one file serves every subsequent session

Without --tools-prepend-system, the same two-session test produces zero cross-session cache hits. The cold cut lands above the dynamic tail (because the appended tool block sits between the tail and the user marker), the SHA always differs across sessions, and prompt done reports the full ~45 s on every request.

Tests

test_render_chat_prompt_text_tools_prepend_system (new) asserts both renderings contain the same set of marker strings and that the tool block precedes the client system content when the flag is true and follows it when the flag is false.
All existing tests pass; call sites updated to pass TOOLS_AFTER_SYSTEM explicitly so behavior is unchanged when the flag is off.

Run with:

make test
# or, to run just the model-free tests:
./ds4_test --server

Usage

./ds4-server [other flags] --tools-prepend-system

Startup will print:

ds4-server: tool schemas rendered before client system content

Recommended companion flags for tool-using agents with a small dynamic tail:

./ds4-server [other flags] \
  --kv-disk-dir /path/to/cache \
  --kv-disk-space-mb 8192 \
  --kv-cache-boundary-align-tokens 2048 \
  --kv-cache-boundary-trim-tokens 1000 \
  --tools-prepend-system

The trim of 1000 with align of 2048 lands the cold cut at a prefill-batch-aligned position safely below typical dynamic tails. Adjust trim downward if the dynamic tail is smaller, or upward if it is larger.

unsaltedbutter-ai · 2026-05-11T05:06:10Z

I found this one while testing #66 inside hermes. I was confused why Hermes new-session prompts weren't being cached. Yes, date-time was one, but even when we trimmed off enough to go past the date-time suffix, I was still not cache hitting. Dumped the raw query and saw that ds4 was appending the tool summary to the system prompt, putting the date-time near the middle of the system prompt, so now trimming xxx bytes off of the prompt wasn't enough cache hit on the new-session prompt. By prepending the tool block to the system prompt, we have a larger fully-static section and the chop can be smaller and still cache hit. There is a pretty big performance win on Hermes Agent.

unsaltedbutter-ai · 2026-05-12T15:28:03Z

@antirez prepending the tools block that ds4 generates let's ds4 cache-hit cold queries from agents like Hermes. They send up tools (which currently get appended) and somewhere in the middle is the current date/time. The date/time triggers a cache miss, but when the tools block is prepended and we trim a few tokens off the tail of the request, we wind up with a clean cache hit on all cold requests from agents. I've seen tens of seconds saved on the start of a new conversation.

If we combine this feature with #66, caching only the system prompt on cold, we can use a smaller --kv-cache-boundary-trim-tokens to cache more of the cold, saving a bit more time.

I made this and #66 both opt-in with a new flag, but if you'd prefer either of these to be default behavior, let me know and I'll rework the PRs.

antirez · 2026-05-15T06:48:07Z

@unsaltedbutter-ai the reason to put tools at the end was that hopefully we sometimes were inside the uncompressed KV cache, but I just checked the server logs for a few actual instances: if I tokenize the trace we are basically never inside the last 128 tokens when the tool call is emitted, so anyway we are using the compressed KV cache. At this point, does it make sense to have this as an option? What about switching to prefixed tools instead?

ds4-server auto-injects "## Tools" and the per-request tool-schema listing into the system content of every chat-style request that declares tools. Place that block at the head of the system region, before any client-supplied system messages, so the dynamic tail of the client's system prompt (commonly a per-request timestamp or agent-id emitted by an SDK) sits at the very end of the system region, immediately before <｜User｜>. That position is what --kv-cache-boundary-trim-tokens subtracts from when computing the cached cut. With tool schemas at the head, trim can chop only the variable bytes and keep the much larger tool-schema region inside the cached prefix, so cross-session lookups hit instead of always missing on the dynamic tail. The model still sees the same content in the same role; only the order inside the system block changes. Includes a unit test asserting that the tool block is rendered before the client's system content and that the client content stays immediately before the user marker.

unsaltedbutter-ai · 2026-05-15T20:20:48Z

@unsaltedbutter-ai the reason to put tools at the end was that hopefully we sometimes were inside the uncompressed KV cache, but I just checked the server logs for a few actual instances: if I tokenize the trace we are basically never inside the last 128 tokens when the tool call is emitted, so anyway we are using the compressed KV cache. At this point, does it make sense to have this as an option? What about switching to prefixed tools instead?

I offered it as an option because I wanted to make it easier to adopt in case there was a tradeoff to be considered.

If you're thinking of making it the default mechanism, I think that would be great. This tool-first ordering plus cold-caching only the system prompt (PR #66) are pretty big wins when using ds4-server with agents like Hermes.

I've re-written the commit to make tools-first the standard.

unsaltedbutter-ai force-pushed the feat/tools-prepend-system branch 3 times, most recently from b71a1e7 to cdedc8c Compare May 15, 2026 05:10

antirez added kv-cache tools-calling no-brainer 💃 labels May 15, 2026

unsaltedbutter-ai force-pushed the feat/tools-prepend-system branch from cdedc8c to 7b68234 Compare May 15, 2026 20:20

antirez merged commit 03a43b8 into antirez:main May 15, 2026

unsaltedbutter-ai deleted the feat/tools-prepend-system branch May 16, 2026 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): optionally prepend tool schemas to the system prompt#68

feat(server): optionally prepend tool schemas to the system prompt#68
antirez merged 1 commit into
antirez:mainfrom
unsaltedbutter-ai:feat/tools-prepend-system

unsaltedbutter-ai commented May 11, 2026

Uh oh!

unsaltedbutter-ai commented May 11, 2026

Uh oh!

unsaltedbutter-ai commented May 12, 2026

Uh oh!

antirez commented May 15, 2026

Uh oh!

unsaltedbutter-ai commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants