DeepSeek-Reasonix-AHE: 基于 Reasonix 的可观测 harness 演化实验 #3457

GTC2080 · 2026-06-07T13:52:57Z

GTC2080
Jun 7, 2026

您好，我是 Reasonix 的使用者和 contributor（GTC2080）。最近我基于 DeepSeek-Reasonix 做了一个实验性二次开发项目，暂名为 DeepSeek-Reasonix-AHE：

https://github.com/GTC2080/DeepSeek-Reasonix-AHE

它不是上游官方版本，也不是希望立即合并的大功能 PR，而是围绕论文 “Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses” 做的一层实验框架。我的理解是：Reasonix 本身可以作为 coding-agent 的优秀基座，而 AHE 可以作为一层升级框架，让 agent harness 具备可观测、可评测、可回滚、可迭代演化的能力。

目前这个实验项目已经完成了一套本地底座，主要包括：

Trace 可观测系统
增加 typed JSONL trace，记录 session、turn、model request/response、tool call/result、usage、cache stats 等事件，并支持 metadata / preview / full 三种模式，默认带安全 redaction。
Prompt Cache Contract
为 system prompt、tool schema、stable prefix 建立稳定 hash 契约，用来检测同一会话内影响 DeepSeek prompt cache 的输入是否漂移。当前策略是 warning-only，不中断运行。
Harness Snapshot
新增 .reasonix-harness/source、snapshots、active、pinned 等本地布局，可以将 prompts、tool descriptions、skills、middleware、routing 固化成版本化 snapshot。
Local Eval Runner
新增 reasonix lab eval，可以运行本地 AHE benchmark task，并输出 trace、diff、verify log、cache report、result 等 artifact。
Cache Report / Evidence Distill
可以从 trace 中聚合 prompt cache hit/miss、contract violation、harness snapshot 等信息，也可以从 eval artifact 中生成 deterministic evidence report。
Proposal Workflow
增加 proposal manifest、proposal create/check/status/apply/accept/reject，支持 staged apply，在临时 source 副本中应用 diff，生成 target snapshot，不直接修改 live source。
GC / Quota Dry-run
增加本地 artifact 的 dry-run GC 规划器，能说明哪些 trace/eval/proposal/snapshot 会被保留或候选清理，以及原因。
Active Harness Injection + Promote/Rollback
目前最新阶段已经让 .reasonix-harness/active 真正影响 reasonix run/chat：session start 时会注入 active snapshot 的 prompt/middleware/routing，并用 tool_descriptions/.md 覆盖 provider-facing tool schema description。同时增加了 snapshot promote/rollback，方便把候选 snapshot 安全推广或回滚。

我个人认为，这条路线对 Reasonix 可能有一些参考价值：它不是单纯增加功能，而是把 Reasonix 的 harness 本身变成一个可以被观察、评测、证明、回滚和逐步演化的对象。尤其是 Reasonix 本身已经很重视 DeepSeek prompt cache，这套 AHE 层可以进一步帮助分析 cache 命中率、定位 prefix drift，并为后续自动化 harness evolution 打基础。

如果您空闲时方便的话，想请您简单看一下这个实验方向，主要想请教：

这个方向是否符合 Reasonix 的长期设计理念；
当前架构有没有明显不合适或过度设计的地方；
active harness injection、cache contract、eval artifact 这些能力是否真的对 Reasonix 有价值；
如果未来想向上游 Reasonix 贡献其中一部分，哪些模块更适合拆出来，哪些应该继续保持实验性质。

当然完全理解维护项目很忙，不需要正式 review。哪怕只是指出几个方向性问题，对这个实验项目也会非常有帮助。

再次感谢您维护 Reasonix。这个 AHE 项目建立在 Reasonix 已有架构之上，希望它能为 Reasonix 后续发展提供一些有价值的参考。

esengine · 2026-06-08T03:05:30Z

esengine
Jun 8, 2026
Maintainer

Thanks for putting this together so thoroughly — and for your ongoing contributions 🙏 I really like the AHE angle: it doesn't just add a feature, it treats the harness itself as something observable and verifiable, which is very much in line with where Reasonix is headed. Here's my take on your four questions.

1. Does the direction fit?
Parts of it fit very well. Reasonix's core identity is "cache-first and cheap," so anything that makes prompt-cache hit-rate observable and makes drift locatable is going with the grain for us. Your trace + cache contract land right on that line. The "automatically evolving harness" layer, by contrast, is more forward-looking for the core repo — worth exploring, but I'd keep it strictly separate from the observability piece.

2. Any over-engineering?
Honestly, the snapshot / proposal workflow / active injection / GC stack as a whole is heavy for the core repo. Reasonix is deliberately cheap and lean — another layer of versioning + staged apply + injection override carries real maintenance and cognitive cost. It's perfectly reasonable as a standalone experiment, but to land in core I'd want each piece to first prove its value in the thinnest form that still produces signal.

3. Which capabilities are genuinely valuable?

Prompt Cache Contract / prefix-drift detection — highest value. We care a lot about cache already, but we're missing a hard signal for "did the stable prefix drift within a session." I'd genuinely want this.
Eval artifacts / cache report — also useful, and we already have /e2e running accuracy/cache/token/cost benchmarks; your deterministic evidence report could plug straight into that.
Active harness injection — interesting, but it's heavy runtime-behavior surgery; I'd keep it in the experimental layer to mature.

4. What to upstream later?
I'd split it as thin slices:

Start with the cache contract (prefix-drift detector) — warning-only, as a standalone, optional module with zero runtime side effects. It's the easiest to review and the one that most reinforces our positioning.
Next, the cache hit/miss report aggregated from traces, which can feed the existing e2e suite directly.
Keep snapshot / proposal / active-injection / GC experimental for now; once the shape settles and there's data showing the payoff, we can talk about them individually.

If you'd like to start with a standalone "cache contract" PR, I'd be happy to review it carefully. Thanks again — this direction is genuinely useful input for Reasonix.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek-Reasonix-AHE: 基于 Reasonix 的可观测 harness 演化实验 #3457

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DeepSeek-Reasonix-AHE: 基于 Reasonix 的可观测 harness 演化实验 #3457

Uh oh!

GTC2080 Jun 7, 2026

Replies: 1 comment

Uh oh!

esengine Jun 8, 2026 Maintainer

GTC2080
Jun 7, 2026

esengine
Jun 8, 2026
Maintainer