Skip to content

Implement Channel.Runtime package (grains + middleware + inbox + observability)#279

Merged
eanzhao merged 8 commits intodevfrom
feat/2026-04-21_issue-258-channel-runtime
Apr 21, 2026
Merged

Implement Channel.Runtime package (grains + middleware + inbox + observability)#279
eanzhao merged 8 commits intodevfrom
feat/2026-04-21_issue-258-channel-runtime

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Apr 21, 2026

Closes #258 once merged through the channel-abstractions → dev chain.

Summary

Implements Aevatar.GAgents.Channel.Runtime package on top of the abstractions
merged in #272 (issue #257).

  • Grains
    • ConversationGAgent (GAgentBase<ConversationGAgentState>) — per-conversation
      single-activation actor keyed by ConversationReference.CanonicalKey.
      Authoritative dedup on ProcessedMessageIds / ProcessedCommandIds with a
      sliding-window cap (10 000). Turn runner seam (IConversationTurnRunner)
      keeps bot invocation out of the grain.
    • ChannelUserBindingGAgent — user credential + preferences split out of the
      legacy ChannelUserGAgent per RFC §5.2b.
    • IShardLeaderGrain — abstraction only; Discord gateway implementation
      deferred per issue scope.
  • Middleware PipelineChannelPipeline + MiddlewarePipelineBuilder
    compose IChannelMiddleware instances (ASP.NET-Core-style). Three
    standard middlewares shipped:
    • ConversationResolverMiddleware
    • LoggingMiddleware
    • TracingMiddleware
  • Durable inboxDurableInboxSubscriber bridges ChatActivity stream
    into the pipeline with RFC §5.8 / §9.5.2 semantics: return→commit,
    throw→redeliver, bounded Channel<T> (1000 / Wait / 500 ms).
  • ObservabilityChannelDiagnostics centralises the Aevatar.Channel
    ActivitySource, RFC §6.1 span names, and mandatory tag keys.
  • DIAddChannelRuntime() extension.
  • Protosconversation_state.proto, channel_user_binding.proto define
    state + command/event contracts per CLAUDE.md proto-first rule.

Issue #258 acceptance checklist

  • Aevatar.GAgents.Channel.Runtime.csproj 独立 build 通过
  • Orleans Streams subscription OnNextAsync 语义单测通过 (exercised via
    transport-agnostic IStream analog in DurableInboxSubscriberTests
    success-commit, throw-redeliver, buffer-full-timeout)
  • ConversationGAgent 的 dedup + atomic commit 测试通过 (TOCTOU scenario
    covered in ConversationGAgentDedupTests)
  • Observability spans 通过 tracing smoke test (ActivityListener-based in
    ChannelTracingSmokeTests; avoids pulling OpenTelemetry.Exporter.InMemory
    into the test project)

Test plan

  • dotnet build aevatar.slnx --nologo — 0 errors.
  • dotnet test test/Aevatar.GAgents.Channel.Protocol.Tests/Aevatar.GAgents.Channel.Protocol.Tests.csproj — 28 passed, 0 failed.
  • bash tools/ci/channel_mega_interface_guard.sh — ok.
  • bash tools/ci/architecture_guards.sh — passed.
  • bash tools/ci/test_stability_guards.sh — passed (no new polling waits).

Deviations / scope notes

  • Stream subscription is validated against the channel-runtime-facing contract
    (DurableInboxSubscriber.OnNextAsync / OnNextInlineAsync) rather than by
    spinning up Orleans streams — Orleans-specific wiring belongs in host-layer
    packages, not Channel.Runtime (matches "no Orleans package in Runtime"
    constraint). The semantics tested are the ones the observer contract must
    honor regardless of provider.
  • IShardLeaderGrain is intentionally abstraction-only.
  • Projection integration is achieved through the standard
    PersistDomainEventAsyncCommittedStateEventPublisher flow inherited from
    GAgentBase. No new projector is introduced — readmodels come in follow-up
    PRs for ChatHistory / UserMemory consumers.

Dependencies

Depends on #272 (issue #257 abstractions). PR is stacked on that base branch.

🤖 Generated with Claude Code

@eanzhao eanzhao changed the base branch from feat/2026-04-21_issue-257-channel-abstractions to dev April 21, 2026 07:16
eanzhao and others added 4 commits April 21, 2026 15:17
Scaffold proto schemas for ConversationGAgent state and ChannelUserBindingGAgent
state + commands/events, per issue #258 Runtime package deliverables.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ConversationGAgent owns per-conversation state keyed by canonical_key and
  enforces authoritative dedup + atomic turn commit per RFC §5.2b.
- ChannelUserBindingGAgent splits out user credential/preference binding from
  legacy ChannelUserGAgent; conversation state stays conversation-scoped.
- IConversationTurnRunner seam lets the grain defer bot turn execution to
  runtime without leaking pipeline or adapter details into the actor.
- IShardLeaderGrain defines the lease/fencing contract used by gateway
  adapters (Discord, WeChat gateway). Implementation deferred per issue.

Adds Foundation.Abstractions / Foundation.Core project refs and
M.E.DependencyInjection.Abstractions / Logging.Abstractions packages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sion

- ChannelPipeline composes IChannelMiddleware via ASP.NET-Core-style builder.
- ConversationResolverMiddleware / LoggingMiddleware / TracingMiddleware cover
  RFC §5.7 standard middleware set; Tracing attaches mandatory dimensions per
  §6.1 around channel.pipeline.invoke span.
- DurableInboxSubscriber provides OnNextAsync semantics per RFC §5.8 + §9.5.2
  (return→commit, throw→redeliver, bounded Channel<T> 1000/Wait/500ms timeout).
- ChannelDiagnostics centralises the Aevatar.Channel ActivitySource + span
  names + tag keys so middleware, grains, and adapters stay on one schema.
- AddChannelRuntime() DI extension registers middlewares + default runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cover the issue #258 acceptance criteria:
- ConversationGAgent dedup + atomic commit, including TOCTOU-style redeliver
  collapse, sliding-window cap, continue-command duplicate rejection, and
  failed-turn event emission.
- MiddlewarePipeline ordering, short-circuit, and exception propagation.
- DurableInboxSubscriber OnNext semantics: success-commit, throw-redeliver,
  buffer-full timeout triggers redelivery.
- TracingMiddleware emits channel.pipeline.invoke span with mandatory §6.1
  dimensions (activity_id, canonical_key, bot_instance_id, channel_id,
  retry_count, auth_principal) + sets Error status when downstream throws.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao eanzhao force-pushed the feat/2026-04-21_issue-258-channel-runtime branch from 5ee4196 to 8f1f804 Compare April 21, 2026 07:18
Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left three blocking comments on durable inbox delivery semantics, transient continue retry semantics, and the new JSON-backed user-binding state.

Comment thread agents/Aevatar.GAgents.Channel.Runtime/Inbox/DurableInboxSubscriber.cs Outdated
Comment thread agents/Aevatar.GAgents.Channel.Runtime/protos/channel_user_binding.proto Outdated
eanzhao and others added 2 commits April 21, 2026 15:33
Reviewer flagged three correctness issues; fix each at the source.

1. DurableInboxSubscriber: give every enqueued activity its own TCS so
   OnNextAsync only returns after the pipeline actually finishes. The
   observer's return → commit / throw → redeliver contract now holds even
   with the bounded working buffer (previously, OnNextAsync returned after
   enqueue, so Orleans could commit the offset before the pipeline ran or
   surface failures through the worker task instead of the observer task).

2. ConversationGAgent: only append to processed_command_ids on terminal
   (NotRetryable) continue failures. Retriable failures
   (retry_after_ms oneof) stay retriable so callers can re-dispatch the
   same logical command id after the back-off — previously the retry came
   back as DuplicateCommand.

3. ChannelUserPreferences: introduce a typed proto message and replace the
   opaque preferences_json string across state/command/event contracts.
   Internal actor state stays proto-typed per CLAUDE.md; downstream
   projections consume typed fields instead of parsing blobs.

Tests updated:
- Rewrite OnNextAsync buffer-full test to fire-and-forget the back-pressured
  sends (they now block on pipeline completion), add coverage for throw-path
  surfacing through the completion handshake and for dispose-time redelivery
  signalling.
- Add continue-command retriable-vs-permanent failure semantics tests that
  pin down the new processed_command_ids rules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao eanzhao marked this pull request as ready for review April 21, 2026 07:49
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a27381cbb0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +170 to +173
if (result.RetryAfter is { } retry)
failed.RetryAfterMs = (long)retry.TotalMilliseconds;
else
failed.NotRetryable = new Google.Protobuf.WellKnownTypes.Empty();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve transient retries when no retry_after is provided

This branch marks every failure without RetryAfter as not_retryable, regardless of FailureKind. That turns ConversationTurnResult.TransientFailure(...) (whose default retryAfter is null) into a terminal failure, and ApplyContinueFailed then consumes the command_id, so subsequent redispatches are rejected as duplicates instead of retrying. This breaks transient recovery paths whenever a runner reports a transient error but omits an explicit backoff value.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3056a7e. Retry policy now derives from FailureKind, not from whether RetryAfter was supplied: only PermanentAdapterError writes NotRetryable; every other kind stays retriable with retry_after_ms = supplied value (or 0 when omitted). Added HandleContinueCommandAsync_TransientFailureWithoutRetryAfter_StaysRetriable to pin down the default-backoff case.

Comment on lines +111 to +112
// Block until the worker has actually finished (successful → commit; throw → redeliver).
await pending.Completion.Task.ConfigureAwait(false);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail fast when OnNextAsync runs before Start

OnNextAsync always waits on pending.Completion, but that completion is only signaled by the background worker started via Start(). If a caller wires OnNextAsync as the stream handler (as documented) and forgets to call Start, the first delivery hangs indefinitely instead of throwing, which stalls durable-inbox consumption and checkpoint progress. The method should either auto-start the worker or throw immediately when _worker is not running.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3056a7e. Extracted EnsureWorkerStarted() (thread-safe CompareExchange singleton) and called it from both Start() and OnNextAsync(). First delivery auto-starts the worker so wiring OnNextAsync directly as a stream handler no longer hangs. Added OnNextAsync_AutoStartsWorker_WhenStartWasNotCalled to cover this path.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.58%. Comparing base (45a0de8) to head (33c57e4).
⚠️ Report is 9 commits behind head on dev.

@@            Coverage Diff             @@
##              dev     #279      +/-   ##
==========================================
- Coverage   68.59%   68.58%   -0.01%     
==========================================
  Files        1110     1110              
  Lines       78074    78074              
  Branches    10221    10221              
==========================================
- Hits        53554    53548       -6     
- Misses      20610    20614       +4     
- Partials     3910     3912       +2     
Flag Coverage Δ
ci 68.58% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

eanzhao and others added 2 commits April 21, 2026 16:00
Two additional issues surfaced after the earlier round:

- P1: ConversationGAgent derived retry policy from whether result.RetryAfter
  was supplied, not from result.FailureKind. A TransientFailure without an
  explicit backoff was therefore written as NotRetryable and consumed the
  command id. AssignRetryPolicy now only sets NotRetryable for
  PermanentAdapterError; every other kind stays retriable (retry_after_ms = 0
  when the runner omits a backoff).

- P2: DurableInboxSubscriber.OnNextAsync blocked on pending.Completion which
  only signals once the worker loop has started. Wiring OnNextAsync directly
  as a stream handler without calling Start() hung forever. Start is now a
  thin wrapper over EnsureWorkerStarted (thread-safe via CompareExchange) and
  OnNextAsync auto-starts the worker on first delivery.

Tests added:
- HandleContinueCommandAsync_TransientFailureWithoutRetryAfter_StaysRetriable.
- OnNextAsync_AutoStartsWorker_WhenStartWasNotCalled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao eanzhao merged commit 5b850e1 into dev Apr 21, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Channel RFC] Runtime package (grains / middleware / durable inbox integration)

1 participant