Skip to content

harden(webhook): nyxid-relay handler must accept-fast and persist to inbox before processing #449

@eanzhao

Description

@eanzhao

Architectural follow-up surfaced in docs/audit-scorecard/2026-04-27-daily-pipeline-architecture-review.md §A2. aevatar-side hardening for the fact that NyxID's callback delivery is fire-and-forget (no outbox / retry / DLQ on NyxID side — see ~/Code/NyxID/backend/src/services/channel_relay_service.rs:284-397).

Symptom

NyxIdChatEndpoints.HandleRelayWebhookAsync (NyxIdChatEndpoints.Relay.cs:28) does the following inline before returning:

  1. Read full request body
  2. Parse via NyxIdRelayTransport.Parse
  3. Validate JWT via NyxIdRelayAuthValidator.ValidateAsync
  4. Resolve canonical scope id (ResolveRelayScopeIdAsync)
  5. Normalize activity (Clone() etc.)
  6. Publish to ConversationGAgent inbox

Any exception in steps 2-5 returns 4xx to NyxID. NyxID records channel_messages.callback_status='failed' and never retries. The inbound message is permanently lost.

issue #398 is a direct symptom ("Lark relay callbacks never reach aevatar — no POST /api/webhooks/nyxid-relay on inbound messages"), though that one is mostly NyxID-side configuration. The aevatar-side issue is: even when NyxID does deliver, any blip in steps 2-5 of our handler is terminal.

Architectural violations

  • CLAUDE.md "事实源唯一" / "committed event 必须可观察" — the only "persistence" of an inbound message in aevatar today happens after we publish to ConversationGAgent inbox. Anything before that fails non-replayably.

Proposed direction

Two-phase webhook:

Phase 1 — accept: persist raw bytes + minimal metadata (message_id, headers) to a RelayInboundInboxGAgent (or an append-only document store) in O(1) write, then return 202. No parsing, no JWT validation, no normalization. Idempotent on message_id (NyxID supplies it in X-NyxID-Message-Id).

Phase 2 — process: async worker (Orleans grain timer / dedicated consumer actor) picks up rows from the inbox and runs the existing parse → JWT validate → scope resolve → normalize → publish-to-ConversationGAgent pipeline. Failures stay in the inbox (with attempt count + last error), can be replayed manually or dead-lettered.

Knock-on benefits:

What this does NOT solve:

Acceptance

  • Webhook returns 202 within <25ms p99 (just persist + ack).
  • No exception in parse / JWT / scope-resolve causes message loss.
  • Replaying a single inbox row produces identical downstream effects (idempotent by message_id).
  • Test: inject a parse failure mid-handler; verify the message is in the inbox with status=parse_failed, retryable.
  • Inbox storage retention policy defined (e.g. 7 days successful, 30 days failed).

Affected files

  • `agents/Aevatar.GAgents.NyxidChat/NyxIdChatEndpoints.Relay.cs` — phase-1 only
  • new: `agents/Aevatar.GAgents.NyxidChat/RelayInboundInboxGAgent.cs` (or non-actor inbox store)
  • new: phase-2 consumer (worker grain or dedicated dispatcher)
  • `channel_runtime_messages.proto` — inbox row contract

Related

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions