Skip to content

Self-heal binding rejections + streaming-failed reply-token loop#561

Merged
eanzhao merged 2 commits intofeature/lark-botfrom
fix/2026-05-03_binding-and-reply-self-heal
May 4, 2026
Merged

Self-heal binding rejections + streaming-failed reply-token loop#561
eanzhao merged 2 commits intofeature/lark-botfrom
fix/2026-05-03_binding-and-reply-self-heal

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented May 3, 2026

Production failure modes (observed 2026-05-03)

Two independent symptoms left a Lark user visibly stuck even though the cluster was otherwise healthy. Both healed inside aevatar — no NyxID changes.

#1/llm says "binding 已失效, 请 /init" while /init refuses with "已绑定"

After PR #558 re-DCR'd the cluster's OAuth client to add the proxy scope, the user's existing NyxID binding (issued for the previous client_id) was rejected at broker token-exchange. ModelChannelSlashCommandHandler caught the rejection and pointed the user at /init — which then refused because the local readmodel still holds the (now-dead) binding_id. The user loops forever.

Pod log evidence:

oauth_broker_service.rs: "Scope 'proxy' is not allowed for this binding"
ModelChannelSlashCommandHandler -> "当前 NyxID 绑定已失效, 请先发送 /init 重新绑定"
InitChannelSlashCommandHandler  -> "已绑定 NyxID 账号, 需要切换账号请先发送 /unbind 再发送 /init"

#2 — Bot replies ... forever after LLM failure

The streaming sink fires the first chunk via channel-relay/reply, consuming the single-use reply token and placing a ... placeholder. If the LLM then fails (e.g. upstream 429 usage_limit_reached), pre-fix the runtime fell through to RunLlmReplyAsync which issued a fresh /reply against the dead token:

NyxID API request failed: POST .../api/v1/channel-relay/reply -> 401
body={"error":"unauthorized","message":"Unauthorized: Reply token already used"}

User stares at ... with no error message ever appearing.

Fixes (self-heal, no external repo changes)

#1ModelChannelSlashCommandHandler self-heals stale local binding

When the broker throws BindingRevokedException / BindingNotFoundException / BindingScopeMismatchException, the handler now dispatches RevokeBindingCommand to the local ExternalIdentityBindingGAgent (reason auto_self_heal_*) so the readmodel flips to revoked, and tells the user the binding was cleared and /init will work next.

Mirrors the dispatch shape UnbindChannelSlashCommandHandler uses for explicit /unbind, but skips the NyxID-side revoke (NyxID is the one that just told us the binding is gone). When IActorRuntime is not registered (some demo / CLI hosts), degrades to "tell the user, hope they /unbind" rather than crashing the slash.

#2ConversationGAgent.TryCompleteStreamedReplyAsync handles the Failed branch

When streaming has already committed the placeholder (state.PlatformMessageId set, not Disabled), the Failed branch now edits the placeholder in place via RunStreamChunkAsync (which uses channel-relay/reply/update — no reply token needed) with the classified failure text, then persists ConversationTurnCompletedEvent so the runtime envelope retry loop does not refire and consume the dead token again.

If the edit also fails (rare — Lark can refuse stale-message edits), persists the last flushed partial as terminal: same defence-in-depth pattern PR #374 already established for the Completed path.

Tests

  • 4 new ModelSlashCommandHandlerTests pinning the binding self-heal for each rejection shape (scope_mismatch, remote_revoked, remote_not_found) plus the degraded-path when IActorRuntime is not registered. Asserts a RevokeBindingCommand envelope is dispatched to the local actor with the correct subject + reason code.
  • 2 new ConversationGAgentDedupTests:
    • WhenStreamingStartedThenLlmFailed_EditsPlaceholderInsteadOfReusingToken — pins the placeholder edit fires + RunLlmReplyAsync is NOT invoked + persisted outbound reflects the failure text.
    • WhenStreamingStartedAndFailedEditAlsoFails_PersistsLastFlushedAsTerminalWithoutReusingToken — pins the defence-in-depth fallback when the edit attempt is also rejected.

Verification

  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj (851/851)
  • dotnet test test/Aevatar.GAgents.Channel.Protocol.Tests/Aevatar.GAgents.Channel.Protocol.Tests.csproj --filter ~ConversationGAgentDedupTests (36/36)
  • dotnet test test/Aevatar.Foundation.Core.Tests/Aevatar.Foundation.Core.Tests.csproj (230/230)
  • bash tools/ci/test_stability_guards.sh
  • bash tools/ci/query_projection_priming_guard.sh

Refs

Two production failure modes observed 2026-05-03 leave a Lark user
visibly stuck even though the cluster is healthy. Both healed locally,
no NyxID changes:

(1) /llm /route /model says "binding 已失效, 请发送 /init" while /init
says "已绑定, 请先 /unbind". After PR #558 re-DCR'd the cluster's
OAuth client to add the `proxy` scope, the user's existing NyxID
binding (issued for the previous client_id) is rejected at broker
token-exchange. ModelChannelSlashCommandHandler caught the rejection
and pointed the user at /init — which then refused because the local
readmodel still holds the (now-dead) binding_id, looping the user
forever.

Self-heal: when the broker throws BindingRevokedException /
BindingNotFoundException / BindingScopeMismatchException, dispatch
RevokeBindingCommand to the local ExternalIdentityBindingGAgent
(reason="auto_self_heal_*") so the readmodel flips to revoked, and
tell the user the binding was cleared and /init will work now.
Mirrors the dispatch shape UnbindChannelSlashCommandHandler uses
for explicit /unbind, but skips the NyxID-side revoke (NyxID is the
one that just told us the binding is gone).

(2) Bot replies as "..." forever when the LLM call fails. The
streaming sink fires the first chunk via channel-relay/reply,
consuming the single-use reply token and creating a placeholder
message. If the LLM then fails (e.g. upstream 429), pre-fix the
runtime fell through to RunLlmReplyAsync which issued a fresh /reply
against the dead token and got `401 Reply token already used` —
leaving the user staring at "..." with no error explanation.

Self-heal: ConversationGAgent.TryCompleteStreamedReplyAsync now
takes the Failed branch when streaming has already committed the
placeholder. Edits the existing message in place via
RunStreamChunkAsync (channel-relay/reply/update — no reply token
needed) with the classified failure text, then persists
ConversationTurnCompletedEvent so the runtime envelope retry loop
does not refire and consume the dead token again. If the edit also
fails (rare: Lark may refuse stale-message edits), persist the
last flushed partial as terminal — same defence-in-depth pattern
the existing PR #374 fix uses for the Completed path.

Tests:
  - 4 new ModelSlashCommandHandlerTests pinning the binding self-heal
    for each rejection shape + degraded-path when IActorRuntime is
    missing.
  - 2 new ConversationGAgentDedupTests pinning the streaming-Failed
    branch edits the placeholder + falls through to "persist last
    flushed partial" when the edit also fails.

Verification:
  dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests --no-build (851/851)
  dotnet test test/Aevatar.GAgents.Channel.Protocol.Tests --no-build (36/36 in dedup suite)
  dotnet test test/Aevatar.Foundation.Core.Tests             (230/230)
  bash tools/ci/test_stability_guards.sh                     (passed)
  bash tools/ci/query_projection_priming_guard.sh            (passed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13d0d26ab7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"/model encountered NyxID-side binding rejection ({Reason}) but IActorRuntime is not registered; cannot self-heal local actor. subject={Platform}:{Tenant}:{User}",
reason,
context.Subject.Platform, context.Subject.Tenant, context.Subject.ExternalUserId);
return new MessageContent { Text = userMessage };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return honest fallback when binding self-heal cannot run

When IActorRuntime is unavailable, this branch returns the same userMessage used for successful self-heal, which currently claims the local binding was auto-cleared. In that environment no RevokeBindingCommand is dispatched, so stale local binding state can remain and /init may still reject as already bound, recreating the same user loop this change is trying to break. Please return a degraded message that explicitly says auto-clean failed/unavailable and guides the user to /unbind.

Useful? React with 👍 / 👎.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.02%. Comparing base (c8db46d) to head (7f0f592).
⚠️ Report is 3 commits behind head on feature/lark-bot.

@@                Coverage Diff                @@
##           feature/lark-bot     #561   +/-   ##
=================================================
  Coverage             72.02%   72.02%           
=================================================
  Files                  1255     1255           
  Lines                 90723    90723           
  Branches              11877    11877           
=================================================
+ Hits                  65343    65345    +2     
+ Misses                20698    20697    -1     
+ Partials               4682     4681    -1     
Flag Coverage Δ
ci 72.02% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

reason);
}

return new MessageContent { Text = userMessage };
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不应该无条件返回带“本地已自动清理”的 userMessage。如果 _actorRuntime 缺失,或者 CreateAsync / HandleEventAsync 在上面的 catch 里失败,本地 binding actor/readmodel 实际上没有被 revoke,用户按提示去 /init 仍然会看到“已绑定”,回到这个 PR 要修的死循环。建议只在 dispatch 成功后返回“已自动清理”;失败路径返回明确的 fallback(例如提示重试 /model 或先 /unbind),并考虑像 /unbind 一样 retry 一次本地 revoke。测试也应覆盖 runtime 缺失/dispatch 失败时不声称已清理。

… path

Both reviewers caught the same bug: pre-fix the self-heal returned the
"本地已自动清理" message even when IActorRuntime was unregistered or
when CreateAsync / HandleEventAsync threw — the local readmodel was
NOT actually cleaned in those paths, so the user follows the message
to /init, /init still sees the stale binding and refuses, recreating
the exact loop this PR exists to break.

Split the self-heal API into cleanedMessage + degradedMessage. The
cleaned message is returned ONLY when the local revoke envelope was
actually dispatched to the binding actor; otherwise the degraded
message points the user at /unbind explicitly.

Also add a single retry on the dispatch path (mirror
UnbindChannelSlashCommandHandler's PR #521 v4-pro review fix) so a
one-off Orleans hiccup doesn't downgrade an otherwise self-healable
binding to manual /unbind guidance.

Tests:
  - List_DegradesToUnbindGuidance_WhenSelfHealActorRuntimeMissing now
    asserts the reply contains "/unbind" and DOES NOT contain
    "已自动清理" — pinning the no-runtime degraded path.
  - List_DegradesToUnbindGuidance_WhenSelfHealDispatchKeepsThrowing
    is a new test exercising ThrowingActorRuntime (every CreateAsync
    throws) and asserts AttemptCount == 2 (retry-once contract) plus
    the degraded reply shape.

Verification:
  dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests --no-build (852/852)
  bash tools/ci/test_stability_guards.sh                     (passed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao eanzhao changed the base branch from dev to feature/lark-bot May 4, 2026 03:30
@eanzhao eanzhao merged commit 00fa7c7 into feature/lark-bot May 4, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant