Implement issue #328 unified channel inbound backbone#346
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 906698b160
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codecov Report❌ Patch coverage is
@@ Coverage Diff @@
## dev #346 +/- ##
==========================================
- Coverage 69.73% 69.72% -0.02%
==========================================
Files 1139 1139
Lines 81202 81215 +13
Branches 10629 10634 +5
==========================================
- Hits 56630 56628 -2
- Misses 20470 20482 +12
- Partials 4102 4105 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 5 files with indirect coverage changes 🚀 New features to boost your workflow:
|
906698b to
614647a
Compare
eanzhao
left a comment
There was a problem hiding this comment.
I don't think this fully closes #328 on the latest head.
Two issue-level commitments are still missing:
- relay reply credentials are not actually threaded through the typed backbone; they now live in a process-local
messageId -> tokencache, which violates the repo's ban on middle-layer in-memory fact maps and makes deferred replies non-durable across restart/scale-out; - the relay auth boundary still does not implement the promised replay guard, and its HMAC verification does not match NyxID's current callback contract.
There are also still unresolved runtime issues in the existing Codex comments (DI option registration and prompt injection), so I don't think Closes #328 is accurate yet.
eanzhao
left a comment
There was a problem hiding this comment.
评审:#346 vs #328 实施完成度 + 架构合规
读完 diff、protos 和关键路径后,主线(统一 inbound、ConversationGAgent 单一事实源、ChatActivity 唯一契约、Lark legacy 退出)已经达成。下面是承诺核对 + 几个真实的架构问题。
✅ #328 已落地
- Phase 0:
Channel.Lark → Platform.Lark、agents/platforms/、aevatar.channels.slnf/aevatar.platforms.slnf分流 - Phase 1 契约:
ChatActivity.TransportExtras(nyx_message_id / nyx_agent_api_key_id / nyx_platform / nyx_conversation_id)OutboundDeliveryContext(reply_message_id / reply_access_token)NeedsLlmReplyEvent/LlmReplyReadyEventChannelBotRegistration加nyx_channel_bot_id/nyx_agent_api_key_id/nyx_conversation_route_id+IChannelBotRegistrationQueryByNyxIdentityPort
- Phase 2:
Aevatar.GAgents.Channel.NyxIdRelayproject(Transport / OutboundPort / CallbackPayload / ConversationTypeMap / AuthValidator with JWT+HMAC+message-id+OIDC/JWKS cache & key-miss refresh) - Phase 3:runner 改名
ChannelConversationTurnRunner;ResolveConversationChatTypeexhaustive 覆盖 4 个ConversationScope - Phase 4:
HandleRelayWebhookAsync改造成 auth → parse → dispatch → 202 - Phase 5:
LarkChannelAdapter.cs(-804) /LarkStreamingHandle.cs(-117) /LarkChannelServiceCollectionExtensions.cs(-32) /LarkWebhookRequest/Response.cs删除 - CI 门禁:
channel_relay_nyx_chat_direct_create_guard.sh已挂入architecture_guards.sh - ADR-0013 + canon 更新
⚠️ 架构 / 合规问题
1. 【严重】中间层进程内字典作为事实源 — 直接违反 CLAUDE.md "中间层状态约束(强制)"
agents/Aevatar.GAgents.ChannelRuntime/ChannelCallbackCredentialCache.cs:6 用 IMemoryCache 维护 correlationId → (UserAccessToken, ReplyToken) 映射,被 runner(ChannelConversationTurnRunner.cs:233 ResolveReplyToken)和 LLM inbox(ChannelLlmReplyInboxRuntime.cs:154 Peek)当 outbound 凭证事实源消费。一旦 cache miss outbound 直接 hard-fail(NyxIdRelayOutboundPort.cs:35-40 missing_reply_access_token)。
CLAUDE.md 明文:
禁止中间层维护 entity/actor/workflow-run/session 等 ID → 上下文/事实状态的进程内映射(Dictionary<>/ConcurrentDictionary<>/HashSet<>/Queue<>)。
InMemory 实现仅限开发/测试,不外溢到中间层业务语义。
NyxID #469 已经协商好"零持久化 + per-process cache"的安全 trade-off,但本仓 CLAUDE.md / ADR-0013 没有显式 carve-out。建议二选一:
- ADR-0013 里把 "短寿命 reply credential 走 process-local cache、跨节点 best-effort、进程重启=丢失 in-flight reply" 写明(含 multi-replica 时 callback 与 LLM 异步发生在不同节点的失败模型);
- 或在 CLAUDE.md "中间层状态约束" 节加显式例外("短寿命凭证、不进 event log / state / DB 时允许 process-local cache,TTL ≤ token exp")。
否则 arch-audit / refactor-team 后面会一直 flag 这块。
2. 【高】Cache TTL (30 min) 与 reply-token JWT exp (15 min) 不对齐 + 失败路径不消费
ChannelCallbackCredentialCache.cs:8 DefaultTtl = TimeSpan.FromMinutes(30),NyxID #469 reply-token exp = 15 min。中间 15 min 缓存里的 token 已过期但仍会被 outbound 取出投递,制造 401 噪音。
且 ConsumeCallbackCredentials 只在 outbound 成功路径被调(ChannelConversationTurnRunner.cs:245, 591),任意失败 entry 卡到自然过期。NyxID #469 明文 "Single-use consume(dispatcher 用完 immediately remove)"。
建议:TTL 默认对齐 token exp 且可配置;Take 时机改为 "发送动作完成后" 而非仅成功路径。
3. 【中】Phase 3 实现偏差:承诺 LlmReplyGeneratorActor,落地的是 IHostedService + 静态 stream id 订阅
ChannelLlmReplyInboxRuntime.cs:13 是单例 IHostedService,订阅一个静态字符串 "channel-runtime:llm-reply:inbox",不是 actor。issue 原话:
新
LlmReplyGeneratorActor订阅NeedsLlmReplyEvent,内部调NyxIdChatGAgent,产出LlmReplyReadyEvent
本仓事实模式(LarkConversationInboxRuntime 同款),所以不是本 PR 独创,但要 doc 一下:actor 边界 / 单线程不变量 / event-sourced 重放语义在这条路径上都不成立。IStreamProvider 默认实现是 InMemoryStreamProvider(src/Aevatar.Foundation.Runtime/Streaming/InMemoryStreamProvider.cs:15)— 多节点 / 主备切换 / host 重启都会丢 in-flight reply。
4. 【中】ConversationGAgent.HandleInboundActivityAsync 双写非原子
agents/Aevatar.GAgents.Channel.Runtime/Conversation/ConversationGAgent.cs:75-77:
await PersistDomainEventAsync(result.LlmReplyRequest); // → event store
await EnqueueDeferredLlmReplyAsync(result.LlmReplyRequest); // → inbox stream两步不在一个事务里。actor crash 在中间或 enqueue 抛异常被 EnqueueDeferredLlmReplyAsync 静默 catch + log + return(ConversationGAgent.cs:287-294),committed 事件不会重放进 inbox 触发 reply。CLAUDE.md 强制:
committed domain event 必须可观察:write-side 完成 committed event 后必须送入 projection 主链
正确形状应是:persist NeedsLlmReplyEvent → projection / event-stream subscriber 消费 committed event → 派发 LLM 调用 → 产出 LlmReplyReadyEvent。否则进程崩溃 / 重启 / 网络抖动都会卡死一次 LLM reply。
最低限度:(a) enqueue 失败时 emit ConversationContinueFailedEvent + retry policy;(b) 启动时扫描未完成 NeedsLlmReplyEvent → 重新 enqueue。
5. 【低】agents/Aevatar.GAgents.NyxidChat/NyxIdRelayOptions.cs:1-7 是空 back-compat 壳 — 违反 "删除优先"
public sealed class NyxIdRelayOptions : Aevatar.GAgents.Channel.NyxIdRelay.NyxIdRelayOptions;注释直接写 "Back-compat alias"。CLAUDE.md "空转发、重复抽象、无业务价值代码直接删除,不保留兼容空壳"。consumer 改 using 即可。
6. 【低】issue §B "relay 相关代码完全搬离 NyxidChat project" 未完成
NyxidChat 仍持有:NyxIdChatEndpoints.Relay.cs (218 lines), NyxIdRelayOptions.cs, NyxIdRelayModels.cs(含 RelayMessage 旧契约), NyxIdRelayPromptConfiguration.cs, NyxIdRelayReplies.cs, NyxIdRelayWorkflowCards.cs, Relay/INyxRelayBridgeIdempotencyGuard.cs。
endpoint 留 NyxidChat 合理(route 注册位置是 host 化选择),但 NyxIdChatEndpoints.Relay.cs:50-51 同时反序列化 NyxIdRelayCallbackPayload(新)和 RelayMessage(旧)只为 workflow card action 短路用 + BuildRelayMessage (line 179-215) 反向转换;应在 Channel.NyxIdRelay 里给 workflow card action 一条独立 typed contract,把 RelayMessage 类族删掉。
7. 【低】HTTP 层直接 IActorRuntime.CreateAsync<ConversationGAgent> + actor.HandleEventAsync,未走 IActorDispatchPort
agents/Aevatar.GAgents.NyxidChat/NyxIdChatEndpoints.Relay.cs:134, 146。CLAUDE.md "传输载体可替换:上层依赖投递契约(IActorDispatchPort),不依赖具体载体"。Studio 端已经走 dispatch port,channel relay 这条入口反而绕过;建议对齐。
8. 【低】NyxIdRelayAuthValidator 自称 "JWT + HMAC + replay 三重校验",实际 replay 只查 message-id header 存在 + 与 body 匹配,真正 replay-dedup 在 ConversationGAgent.ProcessedMessageIds
架构上单一 dedup 点是对的,但命名 / 注释要诚实:boundary 不做 replay 拦截,靠 actor 串行 + processed-id 去重。doc 写清楚就行,免得后人误读。
总结
ConversationGAgent 一统主干、ChatActivity 唯一契约、Lark legacy 退出 — 这三条核心目标是稳的。但 #1(cache 作事实源)+ #4(双写非原子)是真实的多节点 / crash-recovery 隐患,#3 偏差 + #2 TTL 漂移会放大 #1 的影响面。建议:
- 至少把 #1 / #3 写进 ADR-0013 "已知偏差 / 后续工作",明确 multi-replica 部署的失败模型;
- #4 在合并前修(很小,让 enqueue 失败 → emit
ConversationContinueFailedEvent走 actor 的失败路径); - #2 / #5 / #6 / #7 / #8 起 follow-up issue 跟踪。
#1 不修也至少要让 arch-audit 显式 ack,避免下次审计 noise。
614647a to
7749126
Compare
eanzhao
left a comment
There was a problem hiding this comment.
Most of the earlier blocking issues look fixed on the latest head. I found one new credential-boundary regression that I still consider blocking, so I’m leaving it inline.
eanzhao
left a comment
There was a problem hiding this comment.
Follow-up 评审:fix 复盘
读了 force-push 之后的两个 commit(1954c6fc + 77491265),分门别类讲一下。
✅ 已修:高/严重项
-
#1 ChannelCallbackCredentialCache 删了 —
IChannelCallbackCredentialCache.cs/ChannelCallbackCredentialCache.cs删除,凭证流改走TransportExtras.NyxUserAccessToken+OutboundDeliveryContext.ReplyAccessToken(chat_activity.proto:278, 264)。中间层进程内字典作为事实源这个 CLAUDE.md 直接违规消除。 -
#4 双写非原子修得很漂亮 — 真正按 actor 模型重写:
- 新 state 字段
pending_llm_reply_requests(conversation_state.proto:16)持有未完成 LLM 请求; - 新事件
DeferredLlmReplyDispatchRequestedEvent(conversation_events.proto:51-54)作为 self-durable timeout payload; HandleInboundActivityAsync:85-89persist +ScheduleSelfDurableTimeoutAsync入 inbox;HandleDeferredLlmReplyDispatchRequestedAsync:134-176消费 timeout,inbox 不可用 / enqueue 失败时自动 reschedule(DeferredLlmDispatchRetryDelay = 5s);OnActivateAsync:40-44+SchedulePendingLlmReplyDispatchesAsync:360-369在 actor 重激活时扫 pending 重排;HandleLlmReplyReadyAsync失败且可重试时也走ScheduleDeferredLlmReplyDispatchAsync重排(ConversationGAgent.cs:232-242);- State 在
ApplyTurnCompleted/ApplyContinueFailed终态时RemovePendingLlmReplyRequest清理(lines 402, 463)。
这是 CLAUDE.md "self continuation 事件化" + "single-thread fact source" 的标准形状。多节点 / 进程崩溃 / 重启都能 cover。
- 新 state 字段
-
#8 replay guard 真做了 — 新
INyxIdRelayReplayGuard+NyxIdRelayReplayGuard(agents/channels/Aevatar.GAgents.Channel.NyxIdRelay/NyxIdRelayReplayGuard.cs:10-51),ConcurrentDictionary + sweep + window;validatorValidateAsync做了 JWT → registration credential resolve → HMAC(per-registration secret,不再用全局_relayOptions.HmacSecret兜底优先级正确)→ timestamp window → replay claim 五重校验,命名和实现一致。HMAC secret 走INyxIdRelayRegistrationCredentialResolver从 projection 拿NyxAgentApiKeyHash,把"全局共享 secret"那个隐患也顺手收掉。 -
#2 TTL 漂移:cache 没了,议题自然 moot。
⚠️ 新增问题:凭证持久化与 NyxID #469 Aevatar 侧承诺冲突
Cache 改成 TransportExtras 字段流转之后,凭证现在落进了 event store 和 actor state:
TransportExtras.NyxUserAccessToken(用户 JWT)→ChatActivity→NeedsLlmReplyEvent.activity(持久化领域事件)→ConversationGAgentState.PendingLlmReplyRequests(持久化 actor state)。即使 turn 完成后 state 里清掉了,event store 里的NeedsLlmReplyEvent永久保留 JWT。OutboundDeliveryContext.ReplyAccessToken(reply_token)→ 同链路;额外经ChannelConversationTurnRunner.cs:243-247进入ConversationTurnCompletedEvent.outbound_delivery(conversation_events.proto:21)→ projection → 可能进 Elasticsearch。
NyxID #469 提案里 Aevatar 侧的明示承诺:
零持久化:token 不进任何 grain state / event log / DB / secret store / 文件 / config
Orleans event envelope 只含correlation_id,不带 token,token 放 process-local dict,用 correlation_id 配对
当前实现直接违反这两条。前一版 IMemoryCache 方案是为这两条承诺存在的,现在为了换"多节点可工作 + survive restart"把承诺扔了。
reply_token 短寿(15 min)+ 单用,影响窗口可控;但 user JWT 是用户身份 token,不是 per-callback bound token,进入事件存储就是潜在 audit / 数据泄露面。
建议(按代价排序):
- 最低成本:
ConversationTurnCompletedEvent.outbound_delivery.reply_access_token不应该进完成事件 — completed 事件只需要reply_message_id(审计用),token 是"为了发出 reply 的临时凭证",发完就没用,不该进 audit log / projection。在 ConversationGAgent 里持久化 completed 事件前 strip 一下 ReplyAccessToken(或者给 completed 事件用一个不含 token 的 OutboundReceipt 类型)。这条几行改动就能少一份 token 流到 projection。 - 中等成本:
NeedsLlmReplyEvent是为了 actor 重激活后能重排 LLM 调用。但 LLM 调用真正需要的是 NyxId access token;reply 是后续步骤,token 放在 NeedsLlmReplyEvent 里给 LLM 用,事实上是在 actor 状态里持有用户 JWT 直到 LLM 调用完成。可以考虑把 NyxIdAccessToken 从 NeedsLlmReplyEvent.activity.transport_extras 里拆出来,单独放NeedsLlmReplyEvent.short_lived_credentials(不进 ApplyLlmReplyRequested 的 state copy,仅用于一次 inbox dispatch)。 - 最大成本:完全回到 NyxID #469 的"零持久化"形态:actor 只存 correlation_id + 业务事实,凭证另存 process-local(接受 multi-replica / restart 失 reply 的代价)。
至少要在 ADR-0013 里把这条 trade-off 显式写明("Aevatar 选择把短寿 callback 凭证持久化到 event store / actor state,换取 multi-replica 可达性,代价是和 NyxID #469 的零持久化承诺出入")。
❌ 未修:低优项(你可能就当 follow-up 了)
下面这几条在我上一轮 review 里都列了,本次 fix 没动:
- #5
agents/Aevatar.GAgents.NyxidChat/NyxIdRelayOptions.cs仍是空 back-compat 壳,注释 "Back-compat alias" 一字未改。CLAUDE.md "删除优先" 仍违反。 - #6 NyxidChat 里
NyxIdChatEndpoints.Relay.cs、NyxIdRelayModels.cs(含RelayMessage)、NyxIdRelayPromptConfiguration.cs、NyxIdRelayReplies.cs、NyxIdRelayWorkflowCards.cs、Relay/还在;endpointNyxIdChatEndpoints.Relay.cs:49-50仍同时反序列化NyxIdRelayCallbackPayload+RelayMessage,BuildRelayMessage反向转换还在。 - #7 endpoint
NyxIdChatEndpoints.Relay.cs:129, 141仍actorRuntime.CreateAsync<ConversationGAgent>+actor.HandleEventAsync直调,未走IActorDispatchPort。 - #3 LLM reply 仍是
IHostedService(ChannelLlmReplyInboxRuntime),不是LlmReplyGeneratorActor。但:现在 actor 侧 pending state + durable timeout 已经接住"事实源唯一" + "重激活恢复",hosted service 退化为无状态 worker,工程上可以接受。可以在 ADR-0013 里说明这个 reduce-to-stateless-worker 的设计意图,免得后人疑惑。
总结
合并门槛已经清掉:#1 / #4 / #8 真修了,#2 议题消除。
唯一一条建议合并前再处理:ConversationTurnCompletedEvent.outbound_delivery 不要带 reply_access_token 进 projection — 这是个几行改的 net-positive,能立即缩小 token blast radius。其他凭证持久化细节起一个 follow-up issue 跟踪 + ADR-0013 加一段就够。
|
Addressed the latest follow-up review in
Validation run:
|
Review: PR 落地度 + 架构违规整体 Phase 0–5 承诺 基本都落地了(契约 / 三层 auth / 事件驱动 LLM reply / endpoint shim / CI 门禁 / ADR + supersede 都在),但有 3 条违反 Issue #328 Phase 0 premise 和 CLAUDE.md 架构规则 的问题必须处理,其中第 1 条是硬违规。 🔴 1.
|
Summary
Closes #328.
This PR completes the Aevatar-side implementation for the unified channel inbound backbone and removes the old Nyx relay side path that was bypassing
ConversationGAgent.What changed
Aevatar.GAgents.Channel.LarktoAevatar.GAgents.Platform.Larkand split solution filters intoaevatar.channels.slnfandaevatar.platforms.slnfAevatar.GAgents.Channel.NyxIdRelayas the channel-neutral Nyx relay transport and outbound portChatActivitywith typedTransportExtrasand thread outbound reply data throughOutboundDeliveryContextChannelBotRegistrationquery paths so relay traffic resolves registrations without depending onactivity.Bot.ValueLarkConversationTurnRunnerintoChannelConversationTurnRunnerChatActivity+202, with business orchestration handled byConversationGAgentNeedsLlmReplyEvent/LlmReplyReadyEvent+ inbox runtime)devNyxIdChatGAgentImpact
ConversationGAgentis again the single inbound fact source for relay trafficPlatform.Lark; relay transport concerns live underChannel.NyxIdRelayValidation
Passed locally:
Notes
dev.ChronoAIProject/NyxID#469.