Skip to content

fix(channel-runtime): resolve Lark DM receive_id_type and quiet best-effort reaction noise#403

Merged
eanzhao merged 5 commits intodevfrom
fix/2026-04-25_skill-runner-dm-and-reaction-noise
Apr 25, 2026
Merged

fix(channel-runtime): resolve Lark DM receive_id_type and quiet best-effort reaction noise#403
eanzhao merged 5 commits intodevfrom
fix/2026-04-25_skill-runner-dm-and-reaction-noise

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Apr 25, 2026

Problem & Approach

Two independent issues surfaced in the same /daily failure trace (see #402 (comment) for the original log triage). Both are scoped to Lark channel runtime and unrelated to the GPT-5 temperature fix in #402.

1. SkillRunner / human-interaction outbound DM rejected

SkillRunnerGAgent.SendOutputAsync and FeishuCardHumanInteractionPort.SendMessageAsync both hard-coded receive_id_type=chat_id when posting to open-apis/im/v1/messages. In DM flows, the OutboundConfig.ConversationId is the user open_id (ou_*), not a chat_id (oc_*), so every outbound message returned HTTP 400 from Lark.

Fix: Introduce LarkConversationTargets.ResolveReceiveIdType to derive receive_id_type from the conversation_id prefix:

  • ou_*open_id
  • on_*union_id
  • oc_* and anything else → chat_id (default)

Reuse from both call sites.

2. Reaction acknowledgment spams Warnings

ChannelConversationTurnRunner.TrySendImmediateLarkReactionAsync is a best-effort UX sugar that adds an "OK" emoji to the user's message as a delivery acknowledgment. When the Lark app lacks reaction permission (most common cause: Lark error code 231002 The operator has no permission to react on the specific message), it fails on every inbound message until ops fixes the app scope. The current LogWarning floods the run log with the same line.

Fix: Demote the proxy-error path to LogDebug so the signal stays discoverable when the channel is opted into verbose logging without spamming Warnings on every turn. The exception path stays at LogWarning because thrown exceptions indicate genuine network or code-level issues, not a recurring config gap.

Affected Paths

  • agents/Aevatar.GAgents.ChannelRuntime/LarkConversationTargets.cs (new helper)
  • agents/Aevatar.GAgents.ChannelRuntime/SkillRunnerGAgent.cs (SendOutputAsync)
  • agents/Aevatar.GAgents.ChannelRuntime/FeishuCardHumanInteractionPort.cs (SendMessageAsync)
  • agents/Aevatar.GAgents.ChannelRuntime/ChannelConversationTurnRunner.cs (reaction failure log level)
  • test/Aevatar.GAgents.ChannelRuntime.Tests/LarkConversationTargetsTests.cs (new helper unit tests)

Verification

dotnet build agents/Aevatar.GAgents.ChannelRuntime/Aevatar.GAgents.ChannelRuntime.csproj --nologo
dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo
bash tools/ci/test_stability_guards.sh

Results:

  • Build: 0 errors (existing CA warnings only).
  • Helper tests: 13 passed (LarkConversationTargetsTests).
  • Full ChannelRuntime suite: 357 passed, 0 failed.
  • Stability guard: passed.

Out of Scope

🤖 Generated with Claude Code

SkillRunnerGAgent.SendOutputAsync and FeishuCardHumanInteractionPort
both hard-coded receive_id_type=chat_id when posting to Lark IM. In DM
flows the OutboundConfig.ConversationId is a user open_id (ou_*), not a
chat_id, which made the Lark API reject every outbound message with
HTTP 400. Introduce LarkConversationTargets.ResolveReceiveIdType to
derive the correct receive_id_type from the conversation_id prefix
(ou_ -> open_id, on_ -> union_id, otherwise chat_id) and reuse it in
both call sites.

ChannelConversationTurnRunner's immediate Lark reaction is best-effort
acknowledgment; failure cases (most commonly Lark code 231002 missing
reaction permission) recur on every inbound message until ops adjusts
the app scope. Demote the proxy-error path from Warning to Debug to
keep the signal discoverable without spamming the run log on every
turn. The exception path stays at Warning because thrown exceptions
indicate genuine network or code-level issues.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.30%. Comparing base (25104c1) to head (1d92c53).
⚠️ Report is 6 commits behind head on dev.

@@           Coverage Diff           @@
##              dev     #403   +/-   ##
=======================================
  Coverage   70.30%   70.30%           
=======================================
  Files        1174     1174           
  Lines       84061    84061           
  Branches    11049    11049           
=======================================
  Hits        59095    59095           
  Misses      20698    20698           
  Partials     4268     4268           
Flag Coverage Δ
ci 70.30% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concerns and a minor nit on the test data. Fix is correct for the immediate /daily outage, posting these as inline so the file context is preserved.

// tenant-level config issue that recurs on every inbound message
// until ops fixes the app scope. Log at Debug so it stays
// discoverable when the channel is opted into verbose logging
// without spamming Warnings on every turn.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demotion is broader than the PR describes. The PR rationale targets the recurring 231002 The operator has no permission to react noise, but TryGetProxyError (line 987-1034) returns true for any non-zero code or any error field. Demoting the whole proxy-error branch to LogDebug means a future Lark API regression with a new error code (rate limit, bot kicked, message archived, NyxID proxy 5xx packaged into the same envelope, etc.) also drops to Debug and we lose visibility.

Suggest gating the demotion on the actual error code:

if (TryGetProxyError(response, out var code, out var detail))
{
    if (code == 231002)
        _logger.LogDebug(/* known recurring config gap */ ...);
    else
        _logger.LogWarning(/* anything else is a real signal */ ...);
}

Or keep the current shape but extract the code first and only demote the known-noisy allowlist. The exception path is already correctly kept at Warning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5da48f1. Extracted LarkProxyResponse.TryGetError so it returns the Lark business code, then gated the Debug demotion on code == LarkBotErrorCodes.NoPermissionToReact (231002) only. Anything else — Nyx envelope errors, unrecognized Lark codes — stays at LogWarning and now also logs the larkCode. The exception path remains at LogWarning.

if (trimmed.StartsWith("on_", StringComparison.Ordinal))
return "union_id";

return DefaultReceiveIdType;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent fallback to chat_id for unknown prefixes re-creates the same failure mode this PR is fixing. The /daily 400 outage was "receive_id_type=chat_id was assumed for an ou_*". After this fix we still default to chat_id for any non-ou_/non-on_ input. If NyxID relay later emits Lark email, user_id, or any new shape, outbound 400s come back silently — same symptom, same lack of log signal.

Two options, in order of preference:

  1. Quick: emit a LogDebug (or warn-once-per-process) when the fallback branch is taken, so we get a breadcrumb the next time format drift hits prod. Requires threading ILogger through the helper or returning a (type, fellThrough) tuple.

  2. Root fix (follow-up issue): lift receive_id_type to a typed field on the inbound ConversationReference / OutboundConfig, populated by NyxIdRelayTransport.ParseInboundAsync from whatever the relay actually claims, instead of inferring from a string prefix at the outbound site. CLAUDE.md "核心语义强类型...禁止塞入通用 bag / 禁止从前缀字符串猜语义" applies — conversation_id here is silently a (id, type) pair encoded into a single string.

Not blocking — the prefix mapping covers today's known relay shapes. Worth a TODO/issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took the root-fix path you suggested. Added typed lark_receive_id + lark_receive_id_type fields to SkillRunnerOutboundConfig, WorkflowAgentState/Init/Initialized, UserAgentCatalogEntry, and UserAgentCatalogUpsertCommand. AgentBuilderTool now populates them at delivery-target creation time via LarkConversationTargets.BuildFromInbound(chatType, conversationId, senderId), so the p2p flow pins the user ou_* open_id straight from SenderId instead of guessing from the conversation_id prefix. Outbound senders prefer the typed pair when both fields are present.

Kept the prefix heuristic (ResolveReceiveIdType) as a fallback only for legacy state persisted before these typed fields existed, and added option-1's breadcrumb on top of it: LarkConversationTargets.Resolve(...) returns a FellBackToPrefixInference flag and the call sites (SkillRunnerGAgent.SendOutputAsync, FeishuCardHumanInteractionPort.SendMessageAsync) emit LogDebug when that path runs. Format drift will surface there before it silently rejects in Lark.

[InlineData("on_union_user_token", "union_id")]
[InlineData("oc_group_chat_1", "chat_id")]
[InlineData("oc_dm_thread", "chat_id")]
public void ResolveReceiveIdType_ShouldMapKnownPrefixes(string conversationId, string expected)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: [InlineData("oc_dm_thread", "chat_id")] is technically correct (Lark DMs do have an underlying oc_* chat_id) but doesn't reflect this codebase's relay shape — NyxIdRelayTransport for Lark DMs puts the sender ou_* into Conversation.Id, never the DM thread's oc_*. Worth either renaming the data label (oc_group_chat_2 works) or adding a one-line comment so future readers don't infer that DMs flow as oc_* here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed oc_dm_threadoc_group_chat_2 in LarkConversationTargetsTests so the data label matches what this codebase's relay actually does (DMs flow as ou_* here, not oc_*). The new BuildFromInbound tests cover the real p2p case explicitly: chat_type=p2p with senderId="ou_user_1" produces (ReceiveId="ou_user_1", ReceiveIdType="open_id").

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这版方向是对的,但我看到两个会影响实际修复效果的问题,先留 comment。

if (string.IsNullOrEmpty(trimmed))
return DefaultReceiveIdType;

if (trimmed.StartsWith("ou_", StringComparison.Ordinal))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还是在发送时从 conversation_id 前缀推断 Lark 的 receive_id_type,但创建链路并没有保证这个字段一定是 Lark 的 receive id。/daily/social-media 仍把 ChannelInboundEvent.ConversationId 写进 OutboundConfig/UserAgentCatalog;NyxID relay payload 里真实用户 open_id 是单独的 SenderId,conversation id 可能是 route id 或 oc_* chat id,而且现有 p2p 测试也还断言 oc_* 会被保存。这样只要 DM agent 没预先存成 ou_*,这里仍会回落到 chat_id,原来的 400 还会继续出现。建议在创建 delivery target 时写入明确的 receive_id + receive_id_type(p2p 用 SenderId/open_id),不要在发送时从 conversation_id 猜。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的——这版改成在创建 delivery target 时直接写 lark_receive_id + lark_receive_id_typeAgentBuilderTool 现在用 LarkConversationTargets.BuildFromInbound(chatType, conversationId, senderId):p2p 取 SenderId/open_id,群/频道继续用 oc_*/chat_id

字段沿着 SkillRunnerOutboundConfig / WorkflowAgentState / WorkflowAgentInitializedEvent / UserAgentCatalogUpsertCommand / UserAgentCatalogEntry 一路落到 UserAgentCatalogGAgent.HandleUpsertAsync(用 MergeNonEmpty),然后 SkillRunnerGAgent.SendOutputAsyncFeishuCardHumanInteractionPort.SendMessageAsync 优先读这对 typed 字段。原 OutboundConfig.ConversationId 仍然保留作为 LLM metadata 输入路径,但发送侧不再依赖它推 receive_id_type

旧的前缀推断路径只为兼容已经持久化的旧 state 保留,并且 fallback 时会 LogDebug 一次,方便观察 format drift。p2p 现存测试断言 OutboundConfig.ConversationId == "oc_chat_1" 没有改动;新测试断言同时多一对 LarkReceiveId == "ou_user_1" / LarkReceiveIdType == "open_id",这正是 production p2p DM 的真实形态。

@@ -278,7 +279,7 @@ private async Task SendOutputAsync(string output, CancellationToken ct)
await client.ProxyRequestAsync(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProxyRequestAsync 不会把下游 Lark 错误自动变成异常:HTTP 非 2xx 会返回 {"error": true, ...} 字符串,Lark 常见应用错误也可能是 {"code": ..., "msg": ...}。这里忽略返回值会让实际发送失败的 runner 继续落 SkillRunnerExecutionCompletedEventTrySendFailureAsync 也会认为失败通知已经发出。这个 PR 正在修 rejected outbound DM,建议把结果解析成“Nyx error envelope 或 Lark code != 0 都失败”,并给 ou_* 成功路径和错误响应各补一个 call-site 测试。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

说得对,这个 PR 确实就在修 outbound DM 被拒,结果还把返回值丢了。

提取了 LarkProxyResponse.TryGetError 共享 helper,同时识别 Nyx envelope ({"error":...}) 和 Lark business code ({"code":<non-zero>,"msg":...}),两类失败都让 SkillRunnerGAgent.SendOutputAsyncInvalidOperationException。这样 HandleTriggerAsync 不会在 silent rejection 上落 SkillRunnerExecutionCompletedEvent,retry / TrySendFailureAsync 都会按真实失败走。FeishuCardHumanInteractionPort.SendMessageAsync 同步切到这个 helper(之前的 result.Contains("\"error\"") 也漏 code != 0 的情况)。

新增四个 call-site 测试,全在 SkillRunnerGAgentTests

  • SendOutputAsync_ShouldUseTypedReceiveTarget_WhenLarkReceiveIdIsPopulated — typed ou_user_1 / open_id 成功路径,断言 URL 与 body。
  • SendOutputAsync_ShouldFallBackToConversationIdPrefixInference_ForLegacyState — 旧 state 兜底路径仍然能跑通。
  • SendOutputAsync_ShouldThrow_WhenLarkBusinessCodeIsNonZero — Lark code=230002 抛异常,消息里带 code=230002invalid receive_id
  • SendOutputAsync_ShouldThrow_WhenNyxProxyEnvelopeReportsError — Nyx 包裹的 {"error":true,"message":"upstream timeout"} 抛异常。

eanzhao and others added 2 commits April 25, 2026 12:11
Five inline review comments from #403 turned into one consistent change:

* Lift the Lark `receive_id` / `receive_id_type` pair to typed proto fields
  on `SkillRunnerOutboundConfig`, `WorkflowAgentState/Init/Initialized`,
  `UserAgentCatalogEntry`, and `UserAgentCatalogUpsertCommand`. The fields
  are populated at delivery-target creation time in `AgentBuilderTool`
  (`/daily_report` and `/social-media` flows) using the inbound
  `chat_type` + `sender_id`, so p2p DMs now pin the user `ou_*` open_id
  rather than relying on whatever the relay happens to put into
  `conversation_id`. `SkillRunnerGAgent.SendOutputAsync` and
  `FeishuCardHumanInteractionPort.SendMessageAsync` use the typed pair
  verbatim when both fields are present, falling back to the prefix
  heuristic only for legacy state and emitting a `LogDebug` breadcrumb
  on the fallback path.

* Make `LarkConversationTargets.ResolveReceiveIdType` legacy-only and add
  `Resolve(typedId, typedType, legacyConversationId)` plus
  `BuildFromInbound(chatType, conversationId, senderId)`.

* Extract a shared `LarkProxyResponse.TryGetError` helper that recognises
  both Nyx error envelopes (`{"error":...}`) and Lark business errors
  (`{"code":<non-zero>,"msg":...}`), returning the Lark code so callers
  can branch on it. `ChannelConversationTurnRunner.TrySendImmediateLark
  ReactionAsync` now demotes to `LogDebug` only for the documented
  recurring-config-gap code 231002 (named in `LarkBotErrorCodes`) and
  keeps `LogWarning` for any other proxy / Lark error so future
  regressions stay visible.

* `SkillRunnerGAgent.SendOutputAsync` and
  `FeishuCardHumanInteractionPort.SendMessageAsync` now parse the proxy
  response and throw on Nyx envelope errors or Lark `code != 0`. The
  earlier code dropped both classes of failure on the floor, letting
  `HandleTriggerAsync` persist `SkillRunnerExecutionCompletedEvent` on
  silently-rejected outbound DMs.

* Test data nit: rename the misleading `oc_dm_thread` fixture to
  `oc_group_chat_2` (this codebase's relay pins DMs as `ou_*`, never
  `oc_*`).

* New / updated tests:
  - `LarkConversationTargetsTests`: cover `Resolve` typed-vs-legacy paths
    and `BuildFromInbound` p2p / group / missing-sender cases.
  - `SkillRunnerGAgentTests`: typed-`open_id` happy path, legacy
    fallback, Lark `code != 0` rejection, Nyx envelope rejection.
  - `AgentBuilderToolTests`: assert the typed `LarkReceiveId` /
    `LarkReceiveIdType` are populated for the p2p `/daily_report` flow.

Verification: ChannelRuntime suite 382 passed, Channel.Protocol suite 119
passed, Foundation.Runtime.Hosting 151 passed (15 infra-skipped),
`tools/ci/test_stability_guards.sh` and
`tools/ci/architecture_guards.sh` pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

复看 fix 后还有一个阻断点:typed Lark target 在 catalog 投影链路里被丢掉了。

// both fields are present, outbound senders use them verbatim instead of
// inferring receive_id_type from the conversation_id prefix. For p2p the
// creator stores the user open_id (`ou_*`) here, not the DM thread chat_id.
string lark_receive_id = 22;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里新增了 UserAgentCatalogEntry.lark_receive_id / lark_receive_id_type,但同一个 PR 没有把这两个字段加到 UserAgentCatalogDocumentUserAgentCatalogProjector.Materialize 也没有写,UserAgentCatalogQueryPort.ToEntry 也没有读回。FeishuCardHumanInteractionPort 的 target 是通过 UserAgentCatalogRuntimeQueryPort 从 projection document 重建出来的,所以 workflow/social-media 的 human-interaction 发送路径会把这两个 typed 字段丢掉,然后在 LarkConversationTargets.Resolve(...) 里继续回退到 ConversationId 前缀推断。也就是说 SkillRunner 自己发送修好了,但 catalog-backed human interaction 仍然可能用旧的 conversation_id/chat_id 路径。建议给 UserAgentCatalogDocument 增加同样的字段,并补齐 projector + query port 映射和投影测试。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你说得对——这正是我漏掉的地方。在 1d92c53 把 typed 字段补到 catalog projection 链路:

  • UserAgentCatalogDocument 新增 lark_receive_id (= 24) 与 lark_receive_id_type (= 25)。
  • UserAgentCatalogProjector.Materialize 把 entry 上的两个字段写进 document。
  • UserAgentCatalogQueryPort.ToEntry 把它们读回 entry,所以 FeishuCardHumanInteractionPort 通过 UserAgentCatalogRuntimeQueryPort 拿到的 entry 就带着 typed pair,不再回退到 conversation_id 推断。

补了两个测试:UserAgentCatalogProjectorTests.ProjectAsync_WithValidCommittedEvent_UpsertsDocument 现在断言 typed 字段被投影到 document;新增 ToEntry_ShouldRoundTripTypedLarkReceiveTarget_FromDocumentToEntry 锁住反向映射。

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed 5da48f17. Prior three comments are addressed cleanly — typed lark_receive_id/lark_receive_id_type lifted into the protos and threaded through AgentBuilder → Skill/Workflow → Catalog → Feishu sender, prefix inference is now legacy-only with a FellBackToPrefixInference breadcrumb, reaction log gated on LarkBotErrorCodes.NoPermissionToReact, proxy response parsed via shared LarkProxyResponse.TryGetError so silent rejections turn into InvalidOperationException and trigger the retry/failure path. Three new findings + one nit; none block the merge.

return new LarkReceiveTarget(trimmedConversation, DefaultReceiveIdType, FellBackToPrefixInference: false);
}

private static bool IsDirectMessage(string? chatType)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p2p + missing senderId silently regresses to the original bug. BuildFromInbound("p2p", "ou_user_1", senderId: "") falls through IsDirectMessage(...) && !IsNullOrEmpty(trimmedSender) to the else-branch and returns ("ou_user_1", "chat_id", FellBackToPrefixInference: false) — exactly the original /daily outage shape, plus the breadcrumb is suppressed (FellBack=false).

In practice the relay always emits Sender.PlatformId so this is unlikely, but it's the same silent-default anti-pattern the rest of the PR is aimed at. Two cheap options:

if (IsDirectMessage(chatType))
{
    if (!string.IsNullOrEmpty(trimmedSender))
        return new LarkReceiveTarget(trimmedSender, OpenIdReceiveIdType, FellBackToPrefixInference: false);

    // Confused inbound: chat_type=p2p but no sender. Don't write a typed target — let the
    // outbound resolver fall back to legacy + breadcrumb instead of pinning a wrong type.
    return new LarkReceiveTarget(string.Empty, string.Empty, FellBackToPrefixInference: false);
}

…then Resolve(typedId="", typedType="", legacyConversationId="ou_user_1") flips FellBack=true, callers LogDebug, and the prefix heuristic recovers the right open_id. Or set FellBack=true directly here. Either way the pathological case stays observable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — this was the silent-default anti-pattern I was supposed to be designing away from. Fixed in 1d92c53:

if (IsDirectMessage(chatType))
{
    return string.IsNullOrEmpty(trimmedSender)
        ? new LarkReceiveTarget(string.Empty, string.Empty, FellBackToPrefixInference: true)
        : new LarkReceiveTarget(trimmedSender, OpenIdReceiveIdType, FellBackToPrefixInference: false);
}

BuildFromInbound("p2p", "ou_user_1", senderId: "") now returns ("", "", FellBack=true). Resolve(typedId="", typedType="", legacyConversationId="ou_user_1") flips to the prefix path, recovers open_id for ou_*, and call sites emit the Debug breadcrumb — pathological case stays observable.

Two paired tests:

  • BuildFromInbound_ShouldReturnEmptyTypedPairWithFellBack_WhenP2pAndSenderIsMissing — pins the new return shape.
  • Resolve_ShouldRecoverOpenIdForP2pConfusedInbound_ViaPrefixInference — pins the recovery path so a future refactor that re-types the empty pair as chat_id immediately fails.

// Authoritative Lark outbound delivery target captured at create time. When
// both fields are present, SkillRunnerGAgent.SendOutputAsync uses them
// verbatim; conversation_id stays for LLM metadata propagation only.
string lark_receive_id = 7;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lark-named fields living on platform-agnostic state. UserAgentCatalogEntry, WorkflowAgentState, and SkillRunnerOutboundConfig all carry a platform field and are now permanently coupled to a Lark-shaped (lark_receive_id, lark_receive_id_type) pair. This works for Day One — only api-lark-bot reaches these protos — but the moment Telegram/Discord land we either bolt on telegram_chat_id, discord_channel_id, … or refactor.

For the next iteration, consider lifting to a typed OutboundTarget sub-message keyed by platform, e.g.:

message OutboundTarget {
  oneof target {
    LarkReceiveTarget lark = 1;
    TelegramChatTarget telegram = 2;
    // ...
  }
}

Matches CLAUDE.md "核心语义强类型 / 命名语义优先" better than a per-platform field-explosion. Not blocking — file naming honestly says these are Lark-only — but worth a follow-up issue so the next platform doesn't accrete fields.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — this is the right shape but explicitly non-blocking, and putting it in this PR would conflate the bugfix with a multi-platform refactor. Opened as aevatarAI/aevatar#408 so the next platform actually triggers the cleanup; the issue captures the proposed OutboundTarget oneof shape, the migration considerations (round-tripping through projection, deprecating the flat fields, reconciling with the existing platform discriminator), and "do this when a second platform's actual delivery code lands, not speculatively."

@@ -196,7 +197,12 @@ await skillRunnerActor.Received(1).HandleEventAsync(
e.Payload.Unpack<InitializeSkillRunnerCommand>().OutboundConfig.NyxProviderSlug == "api-lark-bot" &&
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test asymmetry: /daily_report p2p typed-field assertion but no parallel for /social-media. CreateSocialMediaAgentAsync (AgentBuilderTool.cs:387) calls the same BuildFromInbound(...) and writes the same typed fields onto InitializeWorkflowAgentCommand. If someone refactors that path later, only the daily-report test pins the contract. Worth a near-clone of this test for the social_media flow asserting LarkReceiveId == "ou_user_1" / LarkReceiveIdType == "open_id" lands on InitializeWorkflowAgentCommand. Cheap; same fixture infrastructure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1d92c53. Added [ChannelMetadataKeys.SenderId] = "ou_user_1" to the social_media test fixture and extended the InitializeWorkflowAgentCommand assertion with:

e.Payload.Unpack<InitializeWorkflowAgentCommand>().LarkReceiveId == "ou_user_1" &&
e.Payload.Unpack<InitializeWorkflowAgentCommand>().LarkReceiveIdType == "open_id"

Now a refactor of CreateSocialMediaAgentAsync that drops the BuildFromInbound(...) call (or accidentally points it at ConversationId instead of SenderId) fails this test alongside the daily_report mirror.

string.Equals(normalized, "dm", StringComparison.OrdinalIgnoreCase);
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: IsDirectMessage accepts p2p, direct_message, directmessage, dm. The only emitter in this repo is ChannelConversationTurnRunner.ResolveConversationChatType, which outputs exactly "p2p". The other three are unreachable defensive aliases — either drop them, or add a one-line comment naming where they're expected to land (Telegram bridge?) so future readers know they're not dead. (Pure cleanup, not blocking.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 1d92c53. IsDirectMessage is now a one-liner ordinal "p2p" check, with a comment naming ChannelConversationTurnRunner.ResolveConversationChatType as the single emitter so future readers don't reinstate the aliases as defensive padding:

// Only "p2p" is emitted by ChannelConversationTurnRunner.ResolveConversationChatType today,
// which is the single source for ChannelMetadataKeys.ChatType in this repo. Keep the check
// narrow until a second emitter (e.g. a Telegram bridge) actually lands.
private static bool IsDirectMessage(string? chatType) =>
    string.Equals(chatType?.Trim(), "p2p", StringComparison.Ordinal);

The previous [Theory] data rows for direct_message / directmessage / dm (treated as DM) moved into BuildFromInbound_ShouldUseConversationChatId_ForNonP2pChatTypes (treated as non-DM), so the new contract is locked in by tests on both sides.

Five new inline review comments after 5da48f1. Four addressed here, one
deferred to a follow-up issue.

* Catalog projection gap (r3141450544): the prior pass added
  `lark_receive_id` / `lark_receive_id_type` to `UserAgentCatalogEntry`
  but not to `UserAgentCatalogDocument`, so the projection round-trip
  through `UserAgentCatalogProjector.Materialize` and
  `UserAgentCatalogQueryPort.ToEntry` dropped them. Workflow / social-
  media human-interaction sends (which read through the projection)
  silently fell back to `conversation_id` prefix inference even after
  the SkillRunner direct-send path was fixed. Added the typed fields to
  the document, materializer, and `ToEntry`, with both projector and
  round-trip tests asserting they survive end-to-end.

* p2p + missing senderId silent regression (r3141451870):
  `BuildFromInbound("p2p", "ou_user_1", senderId: "")` previously
  returned `("ou_user_1", "chat_id", FellBack=false)` — exactly the
  original `/daily` outage shape, with the breadcrumb suppressed. Now
  returns an empty typed pair with `FellBack=true` so `Resolve(...)`
  flips to the legacy prefix path, recovers `open_id`, and emits the
  Debug breadcrumb. New paired tests cover the empty-typed-pair return
  and the `Resolve` recovery.

* `IsDirectMessage` dead aliases (r3141451876): the only emitter in
  this repo is `ChannelConversationTurnRunner.ResolveConversationChat
  Type`, which produces exactly `"p2p"`. Dropped the unreachable
  `direct_message` / `directmessage` / `dm` aliases (and the case-
  insensitive match) to a strict ordinal `"p2p"` check, with a comment
  naming the single emitter so future readers don't reinstate them as
  defensive padding.

* /social-media test asymmetry (r3141451874): added typed-field
  assertions to the `/social-media` `AgentBuilderTool` test mirroring
  the existing daily_report pin, so a refactor of
  `CreateSocialMediaAgentAsync` cannot drop the typed pair on
  `InitializeWorkflowAgentCommand` without a test catching it.

Out of scope: the OutboundTarget oneof refactor proposed in
r3141451872 is explicitly non-blocking per the reviewer; opened as
issue #408 for when a second channel platform lands.

Verification: ChannelRuntime suite 383 passed (was 382),
`tools/ci/test_stability_guards.sh` and
`tools/ci/architecture_guards.sh` pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed 1d92c53d. All four addressable prior comments landed cleanly; #408 captures the deferred OutboundTarget oneof with the right "do this when a second platform's actual delivery code lands" trigger. Verified end-to-end:

  • Catalog projection gap (r3141450544) — typed fields now mirror through UserAgentCatalogDocument (proto fields 24/25), UserAgentCatalogProjector.Materialize, and UserAgentCatalogQueryPort.ToEntry; both directions pinned by UserAgentCatalogProjectorTests (Materialize round-trip + standalone ToEntry test). FeishuCardHumanInteractionPort reads through the projection so workflow/social-media DM delivery now uses the typed pair, matching the SkillRunner direct-send path. Good catch — I missed this in my second pass; only checked the actor-state write, didn't follow into the read model.

  • p2p + missing senderId (r3141451870)BuildFromInbound now returns (empty, empty, FellBack=true) when chat_type is p2p but senderId is missing, exactly the suggested shape. Resolve(...) recovers open_id via the legacy prefix path and SkillRunnerGAgent/FeishuCardHumanInteractionPort fire the Debug breadcrumb. Paired test (Resolve_ShouldRecoverOpenIdForP2pConfusedInbound_ViaPrefixInference) pins the recovery contract end-to-end.

  • IsDirectMessage aliases (r3141451876) — narrowed to strict-ordinal "p2p" with a comment naming ChannelConversationTurnRunner.ResolveConversationChatType as the single emitter. Test data updated to assert P2P/direct_message/dm are now non-DM (route via oc_*/chat_id), so any future emitter that introduces a non-"p2p" casing will trip the test rather than silently land in the DM branch.

  • /social-media test asymmetry (r3141451874) — typed-field assertion mirrored onto the social_media AgentBuilderTool test against InitializeWorkflowAgentCommand. Both AgentBuilder entry points are now pinned.

  • OutboundTarget oneof (r3141451872) — deferred to #408, which captures the proto sketch, migration considerations (proto field deprecation, projection round-trip, reconciling with existing platform discriminator), and the right trigger (next platform's delivery code, not speculative).

CI: fast-gates, projection-provider-e2e, changes green; coverage-quality/console-web/kafka-transport-integration/event-sourcing-regression still in progress. Nothing in the diff blocks merge from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant