Refactor workflow execution lifecycle and align MessageId semantics#5
Closed
Refactor workflow execution lifecycle and align MessageId semantics#5
Conversation
Enforce per-run execution invariants (owner/YAML fail-fast, RunId == execution actor ID, and metadata run_id consistency) and propagate MessageId on workflow completion to keep streaming/projection correlation stable. Also isolate per-run stats from long-lived workflow actors, add compatibility warnings for direct WorkflowGAgent entry, sync identifier/lifecycle docs, and add a destroy-race memo for follow-up soft-destroy/grace-period design.
This was referenced Apr 13, 2026
This was referenced Apr 21, 2026
eanzhao
added a commit
that referenced
this pull request
Apr 27, 2026
P1 — workflow termination on publish failure (reviewer's critical):
`PublishFailureAsync` emits `StepCompletedEvent { Success = false }`,
which the kernel routes through `TryRetryAsync` → `TryOnErrorAsync` →
fail. With no retry/on_error policy on `publish_to_twitter`, a Twitter
401/403/429/5xx terminated the entire workflow run as failed. Add
`on_error: { strategy: skip, default_output: "twitter_publish_failed" }`
to the YAML so the run advances to `done` cleanly; the module already
surfaces categorized errors to Lark independently.
#2 — Twitter v2 native error shape: `ClassifyTwitterResponse` now
recognizes the third response shape NyxID can forward verbatim:
`{ "title": "...", "detail": "...", "errors": [...] }` (Twitter's
native problem-details for content-policy / duplicate-tweet rejections).
Falls through to `twitter_publish_rejected` with the Twitter `message`
text in the Lark surfacing so users read the actual rejection reason.
#1 — Duplicate tweet risk: documented in code comment that
`POST /2/tweets` has no server-side dedup; the social_media template
intentionally has no `retry` policy on this step, and `on_error: skip`
advances rather than retrying. Authors customizing the YAML must keep
this invariant.
#3 — Removed redundant `nyxClient!` null-forgiving (no-op cleanup).
#4 — Renamed `ChannelMetadataKeys.LarkProxySlug` →
`LarkOutboundProxySlug` (`channel.lark.outbound_proxy_slug`) to
disambiguate "Lark API surface" from "NyxID provider routing".
#5 — Added xml-doc on `TrySendLarkAsync` documenting the dual-scope
api-key dependency (key must carry both api-twitter AND api-lark-bot
entitlements) so future callers don't silently break Lark surfacing
when narrowing the key's scope.
#6 — Added `RequiredServiceSlugs` field to `SocialMediaTemplateSpec`
for parity with `DailyReportTemplateSpec`; `CreateSocialMediaAgentAsync`
now reads slugs from the spec instead of inlining the list.
Tests:
- 3 new `ClassifyTwitterResponse` tests for the Twitter native error
shapes (errors-array, RFC-7807 title/detail-only, empty-object
unexpected-shape).
- Existing social_media test now also asserts `strategy: skip` lands in
the upserted YAML.
- 482 channel-runtime + 236 workflow.core tests pass; full solution
builds with 0 errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao
added a commit
that referenced
this pull request
Apr 27, 2026
…elper Resolves three followup architecture-review points on PR #451: ### Channel.Runtime drops AI / Workflow direct deps (review #2) `Channel.Runtime`'s csproj used to pull in `Aevatar.AI.Abstractions` and `Aevatar.Workflow.Application.Abstractions` because of two files that straddled the channel/AI and channel/workflow boundary: - `ChannelContextMiddleware` (an `ILLMCallMiddleware` impl) — moved to `Aevatar.GAgents.NyxidChat`, which is the only package that needs it and already references `AI.Abstractions`. NyxidChat SCE registers it for the LLM call pipeline; Channel.Runtime SCE no longer touches `ILLMCallMiddleware`. - `ChannelCardActionRouting` (builds `WorkflowResumeCommand`) — moved to `Aevatar.GAgents.NyxidChat` for the same reason. Its sole consumer (`ChannelConversationTurnRunner`) lives there too. `Channel.Runtime.csproj` now references only `Channel.Abstractions`, `Foundation.Abstractions`/`Core`, and the `CQRS.Projection.*` slice — matching the "channel-agnostic flow + projection infrastructure" charter from the RFC. Tests (`ChannelCardActionRoutingTests`) get the extra `using Aevatar.GAgents.NyxidChat;`. ### Extract Elasticsearch projection-store toggle helper (review #4) The `ResolveElasticsearchEnabled` + `BuildElasticsearchOptions` helper pair was duplicated three times (Channel.Runtime / Device / Scheduled SCEs) with slightly different log strings and `Console.Error.WriteLine` output. Centralized into `Aevatar.CQRS.Projection.Providers.Elasticsearch.DependencyInjection.ElasticsearchProjectionConfiguration` with two static helpers: - `IsEnabled(IConfiguration?, ILogger?, string? storeName)` — explicit flag → endpoints presence → false; logs a structured warning via `ILogger` (when supplied) instead of `Console.Error.WriteLine`. - `BindOptions(IConfiguration)` — typed binder for `ElasticsearchProjectionDocumentStoreOptions`. All three SCEs now call into this helper; per-package warning text is parameterized via `storeName`. Section path (`Projection:Document:Providers:Elasticsearch`) is exposed as a const so future call sites stay in sync. ### Followup points acknowledged but deferred - **Cross-package dep chain `NyxidChat → Authoring.Lark → Scheduled → Platform.Lark`** (review #1) — pre-existing arch debt that the split surfaced rather than introduced. Cleaner would be to invert via `IInboundFlowResolver` plug-ins so `ChannelConversationTurnRunner` doesn't reach into `AgentBuilderCardFlow` directly. Out of scope for the package split; tracking as a separate follow-up. - **Tombstone compactor "central coordinator"** (review #3) — `Channel.Runtime` defines `ITombstoneCompactionTarget` but does not reference `Device` / `Scheduled` at the csproj level; per-package targets register themselves through DI. The plug-in pattern is intentional and keeps the DAG one-way. - **`Scheduled` package name vs UserAgentCatalog content** (review #5) — `UserAgentCatalog` is the delivery-target registry that Scheduled agents read at execution time to route output, so co-locating it with `SkillRunnerGAgent` / `WorkflowAgentGAgent` is intentional. Renaming to `AgentCatalog` would split actors from their primary consumer; deferring. 473/473 ChannelRuntime.Tests pass; full slnx still only fails the same two pre-existing Mainnet hosting `BindAsync on IStudioMemberService` tests that reproduce on origin/dev. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Workflow Run Lifecycle(重构汇报稿)
执行摘要
这次重构的目标,是把“长期编排职责”和“单次执行职责”拆开,解决当前一个
WorkflowGAgent同时承担模板管理、运行执行、并发隔离带来的复杂度问题。目标方案是:
WorkflowGAgent作为长期编排实例,负责 workflow 定义、策略和入口。WorkflowExecutionGAgent作为一次 run 的执行实例,AgentId与RunId同值。MessageId保留为 AI 消息链路标识,不与RunId合并。这套模型的核心收益是:执行隔离更清晰、查询主键统一、并发场景更稳,且与现有 CQRS 投影链路兼容。
1. 背景与重构目标
1.1 当前痛点
RunId、ActorId、MessageId在语义上容易被混用,讨论成本高。1.2 重构目标
RunId成为唯一执行主键,查询和追踪统一。MessageId的消息链路能力,避免流式与投影能力回退。2. 目标架构
2.1 角色分工
WorkflowGAgent(long-lived)WorkflowExecutionGAgent(per-run)flowchart TB workflowAgent["WorkflowGAgent (long-lived)"] executionA["WorkflowExecutionGAgent (runA)"] executionB["WorkflowExecutionGAgent (runB)"] runId["RunId = ExecutionActorId"] sessionKey["MessageId (message chain key)"] workflowAgent --> executionA workflowAgent --> executionB executionA --> runId executionA --> sessionKey2.2 标识符职责边界
WorkflowGAgent.AgentIdRunIdRunIdExecutionActorIdWorkflowExecutionGAgent.AgentIdRunId同值MessageId3. 关键设计:Execution 如何定位长期 Workflow
核心原则:执行实例不做运行时搜索,编排层在创建时显式绑定 owner。
标准流程:
WorkflowGAgent(agentId命中则复用,否则创建并配置)。RunId。WorkflowExecutionGAgent,并强制executionActorId = runId。BindWorkflowAgentId(workflowAgentId)显式绑定 owner。ConfigureWorkflow(workflowYaml, workflowName)完成 execution 初始化。LinkAsync(workflowAgentId, executionActorId)。SendToAsync(workflowAgentId, evt)。设计约束:
workflowName扫描 owner。workflowAgentId在 run 生命周期内不可变。4. 一次请求的调用链与参数赋值
sequenceDiagram participant client as Client participant api as HostApi participant app as WorkflowApplication participant resolver as RunActorResolver participant execution as WorkflowExecutionGAgent participant role as RoleGAgent participant projection as ProjectionPipeline client->>api: "/api/chat(prompt, workflow, agentId?)" api->>app: "ExecuteAsync(request)" app->>resolver: "ResolveOrCreate(workflowAgent)" app->>app: "Generate runId" app->>execution: "Create(actorId = runId)" app->>execution: "BindWorkflowAgentId(workflowAgentId)" app->>execution: "ConfigureWorkflow(workflowYaml, workflowName)" app->>resolver: "LinkAsync(workflowAgentId, executionActorId)" app->>execution: "Dispatch run events" execution->>role: "ChatRequestEvent(messageId = runId:stepId)" role-->>execution: "TextMessageStart/Content/End(messageId)" execution-->>projection: "Events(runId, messageId)" projection-->>api: "Run report + stream frames"参数赋值表:
promptChatInput.Prompt原样传递ChatRequestEvent.PromptworkflowNameagentIdWorkflowGAgentrunIdexecutionActorIdexecutionActorId = runIdworkflowAgentIdthreadIdexecutionActorIdmessageIdchat-{guid};步骤runId:stepId(:attempt预留)5. 为什么
MessageId不能删RunId和MessageId不是同一层标识。RunId管执行实例,MessageId管消息会话链路。删除
MessageId后会直接影响:LLMCallModule依赖messageId关联_pending请求。AIChatMessageRunIdResolver通过messageId反解runId。msg:{messageId}无法稳定归并 start/content/end。message_id维度丢失。结论:
MessageId必须保留,但它不是执行主键。6. 落地步骤(建议汇报用)
阶段 1:模型落地
WorkflowExecutionGAgent,定义 owner 绑定字段(workflowAgentId)。executionActorId = runId。阶段 2:链路切换
WorkflowExecutionGAgent产生。threadId统一切到 execution actor id。阶段 3:验证与收口
WorkflowGAgent多 run)。MessageId维度)。/api/runs/{runId})。7. 风险与控制
workflowAgentIdthreadId = executionActorIdchat-{guid}与runId:stepId(:attempt预留)8. 验收标准
WorkflowExecutionGAgent,且AgentId == RunId。WorkflowGAgent并发多 run 时无串线。RUN_STARTED/RUN_FINISHED标识稳定。runId可稳定获取完整报告。MessageId的回归测试明确失败(作为保护性用例)。9. 代码锚点(便于评审)
src/Aevatar.Host.Api/Endpoints/ChatEndpoints.cssrc/Aevatar.Host.Api/Endpoints/ChatWebSocketRunCoordinator.cssrc/workflow/Aevatar.Workflow.Application/Runs/WorkflowChatRunApplicationService.cssrc/workflow/Aevatar.Workflow.Application/Runs/WorkflowRunActorResolver.cssrc/workflow/Aevatar.Workflow.Application/Orchestration/WorkflowExecutionRunOrchestrator.cssrc/workflow/Aevatar.Workflow.Core/WorkflowExecutionGAgent.cssrc/workflow/Aevatar.Workflow.Core/WorkflowDefinitionSnapshot.csRunId生成:src/workflow/Aevatar.Workflow.Application/Runs/WorkflowChatRunApplicationService.csRunId注入:src/workflow/Aevatar.Workflow.Application/Runs/WorkflowChatRequestEnvelopeFactory.csRunId提取:src/workflow/Aevatar.Workflow.Core/WorkflowGAgent.csMessageId规范:src/Aevatar.AI.Abstractions/ChatMessageKeys.cssrc/workflow/Aevatar.Workflow.Core/Modules/LLMCallModule.cs相关文档:
docs/IDENTIFIER_RELATIONSHIPS.md