Skip to content

fix(adk): harden checkpoint resume compatibility#896

Merged
shentongmartin merged 14 commits intomainfrom
fix/checkpoint-compat-resume
Mar 18, 2026
Merged

fix(adk): harden checkpoint resume compatibility#896
shentongmartin merged 14 commits intomainfrom
fix/checkpoint-compat-resume

Conversation

@shentongmartin
Copy link
Copy Markdown
Contributor

@shentongmartin shentongmartin commented Mar 17, 2026

Fix ADK checkpoint resume across v0.7 and v0.8.0–v0.8.3

Problem

Resuming from old ADK checkpoints can fail even when the logical adk.State is compatible.

This is caused by Go gob decoding rules:

  • Some checkpoint fields are stored behind any/interface{} so the stream carries a concrete type name.
  • Gob compatibility depends on both the on-wire type name and the wire kind. A struct wire and a GobEncoder payload are incompatible even if the type name is identical.

Historically, _eino_adk_react_state was reused with two different wire kinds:

  • v0.7.*: struct wire
  • v0.8.0–v0.8.3: GobEncode payload (opaque bytes)

Solution

  • Make State a plain gob struct (remove GobEncode/GobDecode) and register it under _eino_adk_react_state.
    • v0.7 checkpoints decode directly into *State (gob ignores fields that did not exist).
    • Only v0.8.0–v0.8.3 needs special handling.
  • For v0.8.0–v0.8.3 checkpoints only, preprocess ADK checkpoint bytes before gob decoding and rewrite the on-wire name to a same-length alias (stateGobNameV080) so gob routes to a GobDecode-compatible type (stateV080).
  • At resume time, migrate stateV080 to the current *State.
  • Propagate migration failures explicitly so users see "migration failed" instead of an opaque gob decode error.

Key Insight

If a type name is reused across versions but its wire kind changes, old bytes must be routed to a different local decoder. The safest options are:

  • use a new on-wire name for the new format, and/or
  • rewrite the legacy on-wire name to a same-length compat alias.

Summary

Problem Solution
Old checkpoints fail to resume due to gob wire-kind mismatch under the same name Route legacy bytes to compat decoders via same-length name aliasing, then migrate to current *State

修复 ADK 在 v0.7 与 v0.8.0–v0.8.3 的 checkpoint 恢复兼容性

问题

从旧的 ADK checkpoint 恢复时,即使 adk.State 语义上兼容,也可能恢复失败。

根因来自 Go gob 的解码规则:

  • checkpoint 中部分字段通过 any/interface{} 保存,线上会携带具体类型名。
  • gob 的兼容性不仅取决于线上类型名,还取决于 wire 类型;同名下 struct wire 与 GobEncoder payload 是不兼容的。

历史上 _eino_adk_react_state 被复用但 wire 类型发生变化:

  • v0.7.*:struct wire
  • v0.8.0–v0.8.3:GobEncode payload(不透明 bytes)

方案

  • State 变为普通 gob struct(移除 GobEncode/GobDecode),并注册到 _eino_adk_react_state
    • v0.7 checkpoint 可直接解码到 *State(gob 会忽略旧数据中不存在的字段差异)。
    • 仅 v0.8.0–v0.8.3 需要特殊处理。
  • 仅针对 v0.8.0–v0.8.3,在 gob 解码前对 ADK checkpoint bytes 做预处理:把线上类型名重写为等长别名(stateGobNameV080),并将其注册到 GobDecode 兼容的类型(stateV080)。
  • 在 resume 时把 stateV080 迁移到当前 *State
  • 迁移失败显式暴露:直接返回“迁移失败”的错误,避免吞错后变成下游 gob 解码的“黑盒报错”。

关键认知

如果一个线上类型名跨版本复用但 wire 类型变化,就必须把旧 bytes 路由到不同的本地 decoder。最稳妥的手段是:

  • 新格式使用新的线上类型名;以及/或者
  • 将旧格式的线上类型名重写为等长 compat alias。

总结

问题 方案
同名但 wire 类型变化导致旧 checkpoint gob 解码失败 通过等长 alias 路由到 compat decoder,再迁移到当前 *State

Change-Id: Ia874fd20a27d19f9b0f6cbb078bcffcc7bfb1a33
- Clarify and harden CMA checkpoint byte migration

- Stabilize State gob names and remove internals map

- Add v0.8.3 checkpoint fixture and resume test

Change-Id: If335d36232a8bf8dc0011a4549c574032b13b4df
Change-Id: Id5ee4f19fb6801f2ef64ec6b6774b02be50ffe82
Change-Id: Ieb91a1f15f9093f85038bc5c4ff6f8932b19f0f2
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 92.72727% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.92%. Comparing base (2033c27) to head (3ba4622).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
adk/chatmodel.go 57.89% 6 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #896      +/-   ##
==========================================
+ Coverage   81.86%   81.92%   +0.06%     
==========================================
  Files         146      146              
  Lines       16035    16050      +15     
==========================================
+ Hits        13127    13149      +22     
  Misses       1963     1963              
+ Partials      945      938       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Change-Id: Ib28262a51aa10c15871979067496c92684fafb97
Change-Id: Iadc660e90e8b28bebb9354d000ce26e77270f83e
- Generate checkpoint_data_v0.8.4.bin using current interrupt format

- Add resume test for v0.8.4 fixture

Change-Id: Iea6dd9377f936506cb24044212c23d3ebb821fee
Change-Id: I700b6e82bbdecd914662585f876ed8628e96048b
@shentongmartin shentongmartin force-pushed the fix/checkpoint-compat-resume branch 3 times, most recently from 06c69a1 to 5293ed2 Compare March 18, 2026 07:50
- Fix stale comments and correct version ranges

- Propagate checkpoint migration errors

- Make deep compat tests table-driven

Change-Id: I8c47abc8ce486ffc163ee78df37131af9790709d
@shentongmartin shentongmartin force-pushed the fix/checkpoint-compat-resume branch from 5293ed2 to 72c3828 Compare March 18, 2026 07:54
Change-Id: I1e38adc115e00b27161eedf4bf2bdd188e5dc047
Change-Id: Idff42e9910397495ca65f79daf6068fb8cb605b5
Change-Id: I4408fddb062df6ff5f87293af0b49b70d7fdf787
@shentongmartin shentongmartin force-pushed the fix/checkpoint-compat-resume branch from b22119e to 4b176f8 Compare March 18, 2026 09:47
- Make State a plain gob struct (remove GobEncode/GobDecode)

- Drop stateV07 and keep stateV080 for v0.8.0-v0.8.3

- Update v0.8.4 fixture

Change-Id: I34ff4f4e40b55a8afd4dfab75b9a11f6f0da8207
@shentongmartin shentongmartin force-pushed the fix/checkpoint-compat-resume branch from 4b176f8 to 887a4a2 Compare March 18, 2026 09:51
Change-Id: I0fb39f23be00167f09303f807ed58ca6722ff00d
@shentongmartin shentongmartin merged commit 6f28c57 into main Mar 18, 2026
17 checks passed
@shentongmartin shentongmartin deleted the fix/checkpoint-compat-resume branch March 18, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants