Skip to content

fix(skills): record tool call outcomes in native tool path#1462

Merged
bug-ops merged 3 commits intomainfrom
wilson-score-active-skill-names
Mar 9, 2026
Merged

fix(skills): record tool call outcomes in native tool path#1462
bug-ops merged 3 commits intomainfrom
wilson-score-active-skill-names

Conversation

@bug-ops
Copy link
Owner

@bug-ops bug-ops commented Mar 9, 2026

Problem

handle_native_tool_calls never called record_skill_outcomes(), so Wilson score re-ranking received zero signal from actual tool execution when using providers with native tool_use support (Claude, OpenAI, OpenAI-compatible).

Only the legacy text-extraction path called record_skill_outcomes (via handle_tool_result), meaning skill rankings were based solely on explicit user feedback corrections — never on whether tool calls actually succeeded or failed.

Root Cause

handle_native_tool_calls (tool_execution.rs) processes tool results and sends outputs back to the LLM, but contained no calls to record_skill_outcomes. The legacy handle_tool_result path had these calls at lines ~638, 644, 659, 736.

Fix

Added record_skill_outcomes calls in the results loop of handle_native_tool_calls, after the match tool_result block, mirroring legacy behavior:

Result Outcome
Ok(Some(out)) — success output "success"
Ok(Some(out))[error]/[exit code "tool_failure" + attempt_self_reflection
Ok(Some(out)) — empty output "success"
Ok(None) "success" (via "(no output)" → else branch)
Err(e) "tool_failure"

attempt_self_reflection is included on the native failure path (matching legacy) — at the insertion point persist_message/push_message have not yet been called, so return Ok(()) is safe.

Tests

5 regression tests added covering all branches:

  • native_tool_success_outcome_does_not_panic
  • native_tool_error_output_does_not_panic
  • native_tool_exit_code_output_does_not_panic
  • native_tool_executor_error_does_not_panic
  • native_tool_no_active_skills_does_not_panic

Fixes #1436

@github-actions github-actions bot added bug Something isn't working size/L Large PR (201-500 lines) rust Rust code changes core zeph-core crate and removed size/L Large PR (201-500 lines) labels Mar 9, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 9, 2026 17:09
bug-ops added 3 commits March 9, 2026 18:22
`handle_native_tool_calls` never called `record_skill_outcomes()`, so
Wilson score received no signal from actual tool execution when using
providers with native tool_use support (Claude, OpenAI).

Add `record_skill_outcomes` calls after each tool result in the
native path, mirroring the legacy `handle_tool_result` behavior:
- success output → "success" outcome
- `[error]`/`[exit code` output → "tool_failure" + attempt_self_reflection
- executor error (`Err(e)`) → "tool_failure"

Also add 5 regression tests covering all branches.

Fixes #1436
Apply ContentSource sanitization to tool output before passing it to
attempt_self_reflection(), matching the legacy error path behavior.

Also document the known limitation that a self-reflection early return
from handle_native_tool_calls drops remaining batch results.

Update config snapshot after LSP defaults removal in main.
The lsp section was incorrectly removed from the snapshot during branch
sync. Config::default() with --features full still includes it.
@bug-ops bug-ops force-pushed the wilson-score-active-skill-names branch from fd6de07 to 76024e2 Compare March 9, 2026 17:22
@bug-ops bug-ops merged commit 3bdd8f2 into main Mar 9, 2026
18 checks passed
@bug-ops bug-ops deleted the wilson-score-active-skill-names branch March 9, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core zeph-core crate rust Rust code changes size/L Large PR (201-500 lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wilson score: tool call outcomes silently dropped when active_skill_names is empty

1 participant