fix(core): sanitize orphaned tool_use/tool_result on history restore (#1360)#1362
Merged
fix(core): sanitize orphaned tool_use/tool_result on history restore (#1360)#1362
Conversation
…1360) Cross-session history restore could produce invalid tool_use/tool_result sequences at history boundaries, causing Claude API 400 errors. Two root causes: - RC-1: load_history_filtered() LIMIT clause can split a tool_use/tool_result pair at the boundary, leaving an orphaned tool_use as the last restored message. - RC-2: Session interruption between persisting the assistant tool_use message and the user tool_result message leaves an orphaned tool_use in SQLite. Add sanitize_tool_pairs() called in load_history() after the loading loop. The function loops until stable, removing: 1. Trailing assistant messages that have ToolUse parts with no following user message containing ToolResult parts. 2. Leading user messages that have ToolResult parts with no preceding assistant message containing ToolUse parts. Each removal is logged via tracing::warn with the affected tool IDs. Add 6 unit tests covering all cases: trailing orphan, leading orphan, complete pair preserved, multiple consecutive orphans, plain messages unchanged, and combined leading+trailing orphans in one history.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tool_use/tool_resultsequences at history boundaries, causing Claude API 400 errors on session resumesanitize_tool_pairs()post-load sanitization inload_history()that removes orphaned tool messages at both ends of restored historyRoot causes
RC-1:
load_history_filtered()LIMIT clause can split atool_use/tool_resultpair at the boundary, leaving an orphanedtool_useas the last restored message.RC-2: Session interruption (Ctrl+C, timeout, crash) between persisting the assistant
tool_usemessage and the usertool_resultmessage leaves an orphanedtool_usein SQLite. On next session restore this triggers a Claude API 400.Changes
crates/zeph-core/src/agent/persistence.rs: Add privatesanitize_tool_pairs(messages: &mut Vec<Message>) -> usize. Called inload_history()after the loading loop on the just-loaded slice (split off fromself.messagesto exclude the system prompt). The function loops until stable, removing:ToolUseparts with no following userToolResultToolResultparts with no preceding assistantToolUseEach removal logs
tracing::warn!with affected tool IDs.Tests
6 new unit tests in
agent::persistence::tests:load_history_removes_trailing_orphan_tool_useload_history_removes_leading_orphan_tool_resultload_history_preserves_complete_tool_pairsload_history_handles_multiple_trailing_orphansload_history_no_tool_messages_unchangedload_history_removes_both_leading_and_trailing_orphansTest plan
cargo +nightly fmt --checkpassescargo clippy --workspace --features full -- -D warningspassescargo nextest run --workspace --features full --lib --binspasses (4693 tests)load_historytests passconfig_default_snapshotfailure is pre-existing on main (confirmed), unrelated to this PRNotes
O(n)remove(0)for leading orphan removal is acceptable givenhistory_limitbounds; can be optimized withVecDequein a future pass (R-1)Closes #1360