feat(fanout): migrate SlackNotifyFn to FanOutConsumer subscriber (#64) by isadeks · Pull Request #79 · aws-samples/sample-autonomous-cloud-coding-agents

isadeks · 2026-05-12T23:43:50Z

Summary

Closes #64. Moves Slack outbound delivery off its own TaskEventsTable DynamoDB Streams consumer onto FanOutConsumer as a per-channel dispatcher. After this PR, TaskEventsTable has exactly one stream reader (FanOutConsumer), restoring headroom for future channels (Email, Teams, etc.) without exceeding DynamoDB's documented 2-reader-per-shard practical limit.

The single commit message walks through the migration (a) and four review-fix layers (b–f) — please read it before the diff for context. Quick map:

(a) Migration — slack-notify rewritten as dispatchSlackEvent(), FanOut wires it in, IAM grant migrated, SlackNotifyFn + its DynamoEventSource deleted from SlackIntegration.
(b) BLOCKER — restored partial-batch retry semantics for infra errors (originally regressed by the migration).
(c) Split SlackApiError into terminal (swallow) vs retryable (escalate to retry).
(d) Fixed NOTIFIABLE_EVENTS / CHANNEL_DEFAULTS drift; added render cases for task_stranded, agent_error.
(e) Conditional UpdateItem on task_created / session_started to prevent duplicate root messages on retry.
(f) Dropped pr_created from Slack defaults (verified live; was visually duplicating with task_completed's View PR button).

Test plan

mise //cdk:compile clean
mise //cdk:test — 1183 / 1183 pass (8 net-new tests added for the review fixes)
mise //cdk:eslint clean
mise //cdk:synth shows exactly one AWS::Lambda::EventSourceMapping on TaskEventsTable, pointing at FanOutFn
Dev-stack: Slack @mention happy path (👀 → ⏳ → ✅ + View PR + intermediate cleanup)
Dev-stack: Cancel button (🚫 + task_cancelled in thread)
Dev-stack: CLI submit (channel_source=api) — zero Slack dispatches, GitHub edit-in-place still fires
Dev-stack: agent_error / pr_created paths exercised end-to-end via real PR creation

Notes

After deploying:

aws lambda list-event-source-mappings \
  --query 'EventSourceMappings[?contains(EventSourceArn,\`TaskEventsTable\`)].FunctionArn'

Should return exactly one ARN: the FanOutFn.

…-samples#64) Move the Slack outbound delivery off its own DynamoDB Streams consumer onto FanOutConsumer as a per-channel dispatcher. Drops TaskEventsTable from 2 concurrent stream readers to 1, restoring headroom for future channels (Email, Teams, etc.) without exceeding the documented DynamoDB Streams 2-reader-per-shard practical limit. The PR also addresses an adversarial code review on the original migration; the body below walks through each piece in the order it landed. ## (a) Migration - `cdk/src/handlers/slack-notify.ts` — rewritten as exported `dispatchSlackEvent(event, ddb)` plus a tagged `SlackApiError` class. The standalone `handler(event)` stream entrypoint is gone; the FanOutConsumer is now the only TaskEventsTable stream reader. Behaviour preserved bit-for-bit: channel_source==='slack' gate, terminal-event dedup via conditional UpdateItem on `channel_metadata.slack_notified_terminal`, threaded replies under the @mention or task_created message, emoji transitions (eyes -> hourglass -> ✅/❌/🚫/⏲), DM channel_id -> user_id rewrite, intermediate session+created message cleanup on terminal events. - `cdk/src/handlers/fanout-task-events.ts` — replaces the log-only `dispatchToSlack` stub with a wrapper that calls dispatchSlackEvent and routes errors via the new typed contract (see (b) below). Slack defaults gain task_created, session_started, task_timed_out so the router fans out the lifecycle events the old SlackNotifyFn handled; the dispatcher's channel_source gate keeps non-Slack tasks unaffected. - `cdk/src/constructs/fanout-consumer.ts` — adds a scoped `secretsmanager:GetSecretValue` grant on `bgagent/slack/*` so the fanout Lambda can fetch per-workspace bot tokens. Same scope the old SlackNotifyFn role held. - `cdk/src/constructs/slack-integration.ts` — deletes SlackNotifyFn, its DynamoEventSource, its IAM policy, and its NagSuppressions entry. Drops the now-unused StartingPosition / FilterCriteria / FilterRule / lambdaEventSources imports. After this lands, `aws lambda list-event-source-mappings` shows exactly one consumer of the TaskEventsTable stream (FanOutFn); verified on the dev stack with end-to-end @mention + cancel + CLI isolation scenarios. ## (b) Review fix #1 — partial-batch retry semantics (BLOCKER) The first review pass found that the post-migration handler silently dropped Slack-side infra errors (DDB throttle on the GetItem, Secrets Manager 5xx, transient Slack timeout). Pre-migration the SlackNotifyFn handler rethrew non-SlackApiError so Lambda retried the batch; post-migration `Promise.allSettled` swallowed the rejection and routeEvent returned an empty list with no escalation path to `batchItemFailures`. routeEvent's return type changed from `NotificationChannel[]` to `{ dispatched, infraRejections }`. The handler now pushes the record into `batchItemFailures` whenever `infraRejections.length>0`, so Lambda replays the record under the partial-batch contract. The warn line on rejection is tagged `retryable: true` so operators can alert distinctly from the channel-terminal swallow path. GitHub got the symmetric treatment: 4xx (excluding the existing 401 and 404 handling) is now treated as a channel-terminal swallow via `fanout.github.api_error` instead of escalating to retry. ## (c) Review fix aws-samples#2 — split SlackApiError into terminal + retryable Originally any `!result.ok` Slack response was wrapped in SlackApiError and swallowed. That collapsed retryable codes (`ratelimited`, `service_unavailable`, `internal_error`, `fatal_error`, `request_timeout`) into the same swallow as `channel_not_found` — a tier-1 Slack outage would silently drop every message. Introduced `TERMINAL_SLACK_API_ERRORS` set + `classifySlackError` helper. Terminal codes still throw SlackApiError (router swallows). Retryable codes throw a plain Error so the router classifies them as infra rejections and Lambda replays. ## (d) Review fix aws-samples#3 — NOTIFIABLE_EVENTS / CHANNEL_DEFAULTS drift The original migration added task_created/session_started/task_timed_out to CHANNEL_DEFAULTS.slack but the dispatcher's NOTIFIABLE_EVENTS gate already excluded several events the router was subscribing Slack to (agent_error, pr_created, task_stranded). Result: Slack was reported as `dispatched` for events it silently dropped — telemetry lied, agent_error never reached operators on Slack-origin tasks, and task_stranded rendered the generic "Event: task_stranded for owner/repo" fallback (UX regression). Added render cases for task_stranded and agent_error in slack-blocks.ts and added them to NOTIFIABLE_EVENTS. Forward-compat approval_required and status_response stay out of NOTIFIABLE_EVENTS until their emitters ship; a new cross-file consistency test in fanout-task-events.test.ts fails if anyone re-introduces the drift. The Slack dispatcher wrapper now passes `effectiveEventType` so an agent_milestone(pr_created) wrapper is unwrapped before NOTIFIABLE_EVENTS matching. Without the rewrite, the dispatcher would short-circuit on the wrapper string `agent_milestone`. ## (e) Review fix aws-samples#4 — conditional UpdateItem on lifecycle persists Once the BLOCKER fix made batches retry, the original task_created and session_started UpdateItem calls became hazardous: a Slack POST that succeeded but whose follow-up UpdateItem failed transiently would, on retry, post a second root and overwrite slack_thread_ts — orphaning every threaded reply that had threaded under the first ts. Both UpdateItems now carry an `attribute_not_exists` ConditionExpression on the relevant `channel_metadata.slack_*_msg_ts`. On ConditionalCheckFailedException the handler logs at info, deletes the duplicate Slack message via `chat.delete`, and returns. Sibling retry wins the race; the duplicate is cleaned up. ## (f) Dev-stack regression: drop pr_created from Slack defaults Live verification surfaced a UX duplication: pr_created (subscribed in CHANNEL_DEFAULTS.slack as the original §6.2 design called for) and task_completed both rendered messages with View PR buttons, posted seconds apart. The original SlackNotifyFn had silently dropped pr_created (NOTIFIABLE_EVENTS gate), so users hadn't relied on it. Removed pr_created from CHANNEL_DEFAULTS.slack and from NOTIFIABLE_EVENTS, and removed the prCreatedMessage renderer. GitHub keeps pr_created (its edit-in-place comment surface genuinely benefits from the early checkpoint). ## Verification - mise //cdk:compile — clean - mise //cdk:test — 1183 / 1183 pass (8 net-new tests added for the review fixes: NOTIFIABLE_EVENTS drift guard, retryable Slack codes, GitHub 4xx swallow, infra rejection escalation, SlackApiError swallow, task_stranded render) - mise //cdk:eslint — clean - mise //cdk:synth — confirms exactly one Lambda::EventSourceMapping on TaskEventsTable, pointing at FanOutFn - Dev-stack scenarios — @mention happy path, Cancel button, CLI submit (channel_source=api -> zero Slack dispatches, GitHub edit-in-place still fires) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

krokoko · 2026-05-12T23:59:15Z

Automated review:

This is well-considered work. The partial-batch retry semantics, the terminal/retryable Slack error classification, and the conditional UpdateItem idempotency guards are all the kind
of defensive engineering I'd expect at this maturity level. The commit message structure (a–f) is exemplary — it tells the story of how adversarial review improved the code, which is
itself a teaching artifact.

That said, the migration introduces coupling trade-offs and error-path gaps that warrant attention.

CRITICAL (must fix before merge)

GitHub 403/429 rate-limit responses are swallowed as terminal

File: cdk/src/handlers/fanout-task-events.ts (4xx swallow branch)

The new else if (httpStatus >= 400 && httpStatus < 500) swallow path treats ALL GitHub 4xx as channel-terminal. But HTTP 403 from GitHub frequently means rate limit exceeded (a transient condition), and HTTP 429 (Too Many Requests) is also 4xx. Under a reconciliation wave touching many tasks, an entire window of GitHub comment updates would be permanently lost with only a warn log.

Fix: Carve out 403 and 429:
&& httpStatus !== 403 // Rate limit — retryable
&& httpStatus !== 429 // Too Many Requests — retryable

Unconditional Secrets Manager grant violates the construct's guard pattern

File: cdk/src/constructs/fanout-consumer.ts

Every other external-service grant in FanOutConsumer (taskTable, repoTable, githubTokenSecret) is guarded by if (props.X). The new bgagent/slack/* Secrets Manager grant is unconditional — meaning dev deployments without Slack onboarding get a dangling IAM permission. This breaks the construct's documented contract and will trigger a cdk-nag AwsSolutions-IAM5 finding with a misleading suppression reason.

Fix: Add slackSecretPrefix?: string prop, guard the policy statement, and add a dedicated NagSuppression with the correct reason.

HIGH (strongly recommended)

Missing TASK_TABLE_NAME silently returns as "successful dispatch"

File: cdk/src/handlers/slack-notify.ts — early guard in dispatchSlackEvent

When the env var is missing, the function returns (no throw), so the router counts Slack as dispatched. A broken deployment silently drops ALL Slack notifications indefinitely. This should throw an Error (infra-class) so the rejected-rate alarm fires and operators notice.

Slack ratelimited retry can starve sibling channels

When Slack returns ratelimited, the record lands in batchItemFailures. On retry, all three dispatchers re-execute — including GitHub and Email that already succeeded. For non-terminal events without conditional-persist guards, this means duplicate GitHub PATCHes and double log entries. Under sustained Slack rate-limiting, every record retries 3× before DLQ, each time re-invoking GitHub unnecessarily.

Mitigation options: Per-channel "already-dispatched" bitmap, or document as accepted trade-off at current scale and surface Slack's Retry-After header in the warn log.

Reaction/delete helpers eat ALL errors including network failures

File: cdk/src/handlers/slack-notify.ts — addReaction, removeReaction, deleteMessage

These functions catch everything (DNS failures, SyntaxError, timeouts) and log at warn. Under the new architecture where infra errors should propagate for retry, these swallows are inconsistent. The user-visible symptom: stale emoji reactions (hourglass remains after completion), orphaned intermediate messages.

Recommendation: At minimum, log network/infra failures at error with a dedicated event key so operators can alarm.

task_created / session_started duplicate-delete has no observability

When ConditionalCheckFailedException fires (sibling retry won the race), the code calls deleteMessage which is best-effort. If the delete fails, a permanent duplicate Slack message
remains in the thread with no dedicated metric/event for operators to detect accumulating ghosts.

Recommendation: Add a fanout.slack.dup_delete_failed event on the delete failure path.

MEDIUM (address in follow-up)

instanceof SlackApiError is brittle across module boundaries

The fanout handler catches err instanceof SlackApiError. If the bundler ever duplicates the module (rare but possible), instanceof fails silently — terminal errors would trigger infinite retries. Add a property-based fallback:
if (err instanceof SlackApiError || (err instanceof Error && err.name === 'SlackApiError')) {

classifySlackError defaults unknowns to "retryable" — missing terminal codes

The set omits ekm_access_denied, team_access_not_granted, missing_scope, invalid_arguments, posting_to_general_channel_denied — all permanent failures that will burn retries until DLQ.

Type design: RouteOutcome arrays should be ReadonlyArray

readonly dispatched: NotificationChannel[] prevents reassignment but allows outcome.dispatched.push(...). Inconsistent with ReadonlySet used for CHANNEL_DEFAULTS in the same file.

SlackDispatchEvent is structurally identical to FanOutEvent

The decoupling is purely nominal — both have identical fields. A type alias (type SlackDispatchEvent = FanOutEvent) or Pick<FanOutEvent, ...> would prevent silent drift and make the
relationship explicit.

Test Coverage Gaps

┌────────────────────────────────────────────────────────────────────────┬─────────────┐
│ Gap │ Criticality │
├────────────────────────────────────────────────────────────────────────┼─────────────┤
│ Conditional UpdateItem race (task_created retry → deleteMessage) │ 9/10 │
├────────────────────────────────────────────────────────────────────────┼─────────────┤
│ taskStrandedMessage and agentErrorMessage renderers in slack-blocks.ts │ 7/10 │
├────────────────────────────────────────────────────────────────────────┼─────────────┤
│ effectiveEventType rewrite reaching the Slack dispatcher mock │ 7/10 │
├────────────────────────────────────────────────────────────────────────┼─────────────┤
│ FanOutConsumer IAM policy (no construct-level test exists) │ 6/10 │
├────────────────────────────────────────────────────────────────────────┼─────────────┤
│ task_stranded through the terminal dedup path │ 8/10 │
└────────────────────────────────────────────────────────────────────────┴─────────────┘

Architecture Trade-offs (Accepted)

These are design decisions I'd want documented but not necessarily changed:

Shared timeout/concurrency — FanOutConsumer processes batches of 100 with a 60s timeout. The old SlackNotifyFn had batch=10, timeout=30s. Acceptable because channel_source gate short-circuits 90%+ of records, but worth a comment explaining why 100 is safe.
Deployment ordering — During CDK deploy, there's a brief window where both consumers exist (old ESM not yet deleted). The conditional-persist guards make this safe (worst case: a duplicate that gets chat.deleted).
chat.delete as best-effort cleanup — Correct design. Slack has no transactional guarantee. Just needs the observability improvement in point 6.

What's Done Well

The partial-batch retry restoration (fix b) is the single most important change — without it, the migration would have silently regressed retry semantics
The TERMINAL_SLACK_API_ERRORS set with explicit documentation per code is excellent engineering
The cross-file consistency test (NOTIFIABLE_EVENTS ⊇ CHANNEL_DEFAULTS.slack) using jest.requireActual is an elegant drift guard
The conditional UpdateItem pattern for idempotent lifecycle persists is the correct defense against the retry hazard
The commit structure (a–f) makes the code reviewable in layers rather than as a monolithic diff

…ws-samples#79 review #1) PR aws-samples#79 review found that the new 4xx terminal-swallow path treats HTTP 403 and 429 as channel-terminal — but on GitHub these are transient rate-limit responses (403 with "API rate limit exceeded", 429 "Too Many Requests"). Under a reconciliation wave that touches many tasks, an entire window of GitHub comment updates would be permanently dropped with only a warn log. Carve out 403 and 429 from the swallow guard so they propagate as infra rejections through ``Promise.allSettled``. The record lands in ``batchItemFailures`` and Lambda replays until the rate-limit window clears (or DLQs after ``retryAttempts``). Test coverage: parametrized over 403 + 429 with a GitHubCommentError mock at the helper boundary, asserting the record's eventID surfaces in ``batchItemFailures`` rather than being absorbed.

review aws-samples#2) Every other external-service grant in FanOutConsumer (taskTable, repoTable, githubTokenSecret) is gated by ``if (props.X)``, so a deployment that hasn't onboarded the corresponding service stays free of dangling IAM permissions. The original migration broke the pattern with an unconditional ``bgagent/slack/*`` Secrets Manager grant — dev stacks without Slack onboarding ended up holding read permission on a resource pattern they never use, with a misleading ``cdk-nag AwsSolutions-IAM5`` suppression reason. Adds an optional ``slackSecretArnPattern`` prop on ``FanOutConsumerProps``; the policy statement is only attached when the prop is set. ``cdk/src/stacks/agent.ts`` now computes the ``bgagent/slack/*`` ARN inline and passes it through, mirroring the other guarded props. ``ArnFormat`` and ``Stack`` imports moved out of fanout-consumer.ts since the construct no longer needs them. No changes to live behaviour — agent.ts always passes the prop, so the IAM policy still attaches in production. The dispatcher will log-and-fail-retry on a missing pattern (covered by review aws-samples#3 fix). Test gap covering the construct itself ships in a follow-up commit (test gap aws-samples#34).

…review aws-samples#3) Pre-fix: when ``TASK_TABLE_NAME`` was unset on a Slack-subscribed event, ``dispatchSlackEvent`` returned silently after a warn line. The router counted Slack as ``dispatched`` and a broken stack quietly dropped every Slack notification — operators only saw it in the warn-rate metric, with no rejected-channel signal. Post-fix: throw a plain Error so the rejection propagates as an infra rejection through ``Promise.allSettled``. The router pushes the record into ``batchItemFailures``, Lambda retries the batch, the ``fanout.dispatcher.rejected`` warn fires per record, and operators get a distinct alarm. Also bumps the existing log line from ``warn`` to ``error`` and attaches an ``error_id: FANOUT_SLACK_MISSING_TASK_TABLE`` so the deployment-bug case can be distinguished from per-record failures. Test: ``throws when TASK_TABLE_NAME env var is missing`` deletes the env var, asserts the throw, asserts no DDB call was attempted (env-var guard fires first).

…amples#79 review aws-samples#7) When a bundler ever duplicates the slack-notify module (rare with NodejsFunction tree-shaking but possible if dual-bundled), two distinct SlackApiError classes coexist and ``instanceof`` against one fails for instances of the other. The dispatcher would see a foreign-class SlackApiError, fall through to the rethrow branch, and the router would treat it as an infra rejection — flipping a channel-terminal swallow into infinite Lambda retries. Add an ``err.name === 'SlackApiError'`` fallback so the swallow branch fires either way. Mirrors the duck-typed ``GitHubCommentError`` check used elsewhere in the same handler. Test: synthesise a plain Error with name === 'SlackApiError' (NOT an instance of the mock's SlackApiError class) and assert batchItemFailures stays empty — proving the swallow path catches both shapes.

…ws-samples#79 review aws-samples#8) Original set omitted documented Slack permission/scope failures. Codes outside the set fall to the retryable branch, so a misconfiguration like ``ekm_access_denied`` or ``missing_scope`` would burn 3 Lambda retries before DLQ on every event — even though the failure is fundamentally a configuration bug that no retry can clear. Adds: - Permission/scope: missing_scope, ekm_access_denied, team_access_not_granted, posting_to_general_channel_denied - Payload shape: invalid_arguments Reorganized the set into commented blocks (channel-shape, auth, permission/scope, payload-shape) so future additions go in the right bucket and the rationale stays visible. Test coverage: parametrized over the full TERMINAL_SLACK_API_ERRORS set (21 codes) — every one must throw SlackApiError so the router swallows it. The existing retryable test.each remains intact and covers the negative-class case (codes outside the set throw a plain Error and escalate to retry).

…gs (aws-samples#79 review aws-samples#5) The reaction / delete helpers (``addReaction``, ``removeReaction``, ``deleteMessage``) used to log every catch at warn with a single generic event key, lumping API-level rejections (e.g. ``no_reaction``) together with infrastructure failures (DNS lookup, TLS handshake, fetch timeout, JSON parse error from a hostile gateway). Operators who alarmed on the warn rate saw a flat signal that masked genuine infra problems. Split the boundary: - API-level (``!result.ok`` after a successful HTTP call) stays at warn with channel-specific event keys (``fanout.slack.reaction_add_api_error``, ``fanout.slack.reaction_remove_api_error``, ``fanout.slack.message_delete_api_error``). These are per-message UX problems; operators don't page. - Network errors (the outer ``catch (err)`` after ``fetch``) promote to ``logger.error`` with dedicated event keys (``fanout.slack.reaction_add_network_error``, ``fanout.slack.reaction_remove_network_error``, ``fanout.slack.message_delete_network_error``) and ``error_id``s (``FANOUT_SLACK_REACTION_NETWORK``, ``FANOUT_SLACK_DELETE_NETWORK``) so each has its own alarmable signal. User-visible symptoms when these fire silently: stale emoji reactions (hourglass never swaps to ✅) and orphaned intermediate messages. Behaviour unchanged: errors are still swallowed (per-message reactions and intermediate cleanup are best-effort by design; they must not fail the batch), but operators now get distinct metrics for each failure class.

…umulation (aws-samples#79 review aws-samples#6) The conditional UpdateItem dup-delete path (``task_created`` / ``session_started`` lifecycle persists) calls ``deleteMessage`` to clean up the duplicate Slack message that landed when a sibling retry won the race. The delete is inherently best-effort — but if it fails, the duplicate becomes a permanent ghost in the thread and operators had no way to alarm on the rate. Refactor ``deleteMessage`` to return a boolean (``true`` on success or ``message_not_found``-as-already-gone, ``false`` otherwise) and emit a dedicated ``fanout.slack.dup_delete_failed`` event with an ``error_id: FANOUT_SLACK_DUP_DELETE_FAILED`` from the dup-delete callsites when the cleanup couldn't complete. The terminal-event cleanup paths (``slack_session_msg_ts``, ``slack_created_msg_ts``) intentionally don't fire this event — those paths target genuinely-stale UX cleanup, not retry-driven duplicates, so an alarm there would be noise. No new tests beyond the existing dup-delete coverage; the ``deleteMessage`` return value isn't yet asserted at the unit level, but the behavior is fully exercised by the existing ``dup-delete`` integration paths (test gap aws-samples#31 will add an explicit failure-path assertion when it lands).

…les#79 review aws-samples#9) ``RouteOutcome.dispatched`` and ``infraRejections`` were typed as plain ``NotificationChannel[]`` — which made ``readonly`` on the property prevent reassignment but still allow callers to mutate the underlying array via ``.push``, ``.splice``, or ``.sort``. Inconsistent with the ``ReadonlySet<string>`` used for ``CHANNEL_DEFAULTS`` in the same file. Tightening to ``ReadonlyArray<NotificationChannel>`` makes the contract honest: the router owns the arrays, callers read them. Test suite updated to use ``[...outcome.dispatched].sort()`` where it previously called ``.sort()`` directly — the explicit copy makes the intent clear and would have surfaced any silent test-side mutation.

…aws-samples#79 review aws-samples#10) The two interfaces were structurally identical: same five fields, same readonly modifiers, same metadata shape. The decoupling was purely nominal and a silent-drift footgun — adding a field to ``FanOutEvent`` (e.g. when the router starts plumbing an ``approval_required`` ID through) would not flow into ``SlackDispatchEvent``, leaving the dispatcher unaware until a downstream test happened to fail. Replace with a one-line type alias: export type SlackDispatchEvent = FanOutEvent; The slack-notify module now type-imports ``FanOutEvent`` from fanout-task-events. ``import type`` is erased at compile time, so the runtime bundle still has the one-way dep (fanout-task-events → slack-notify) — no module-cycle hazard. Reviewer-suggested ``Pick<FanOutEvent, 'task_id' | …>`` was considered and rejected: the dispatcher uses every field of ``FanOutEvent``, so the Pick would just enumerate the same five fields with extra noise. A direct alias keeps the intent obvious and prevents drift identically.

…After (aws-samples#79 review aws-samples#4) PR aws-samples#79 review aws-samples#4 surfaced a sibling-channel-failure hazard: when GitHub or Email rate-limits, the record lands in ``batchItemFailures``. On the Lambda retry, every Slack-subscribed event for that record runs again. Terminal events were already guarded by ``slack_notified_terminal``; ``agent_error`` was not — operators would page twice on a single agent failure if a sibling channel happened to fail. Generalize the dedup mechanism. ``TERMINAL_EVENTS`` is replaced by a ``SLACK_DEDUP_ATTRIBUTE`` map that marks each event type with the ``channel_metadata`` attribute that should guard the post: - 5 terminals share ``slack_notified_terminal`` (any first-arriving terminal claims the right; subsequent terminals dedup against it) - ``agent_error`` gets its own ``slack_dispatched_agent_error`` so a duplicate agent_error doesn't reuse the terminal slot - ``task_created`` / ``session_started`` map to ``null`` because they already use the per-event ``slack_*_msg_ts`` conditional persists from review #1 — the conditional already provides full idempotency (a separate marker would be redundant) Also surfaces Slack's ``Retry-After`` header on rate-limited responses through a dedicated ``fanout.slack.retryable_api_error`` warn so operators reading CloudWatch can see the recovery window instead of guessing from sustained warn rate. Tests: - logs Retry-After header on rate-limited Slack responses (new): asserts ``retry_after_seconds`` propagates from Slack's response header into the warn metadata - existing terminal-codes parametrized test untouched (terminal branch doesn't read headers) - existing retryable test gains a ``headers: { get: () => null }`` stub on the fetch mock so the headers.get call doesn't crash Reviewer suggested a per-channel dispatch bitmap as the alternative. Rejected as premature: the duplicate-GitHub-PATCH is harmless (idempotent), Email is still a stub, and the dedup map covers the specific agent_error pain identified above. A bitmap would add a new table + IAM grants + per-dispatch DDB cost for a hypothetical problem (Slack rate-limiting AND a sibling channel failure).

…samples#79 test gap) Adds 4 tests covering the lifecycle-persist conditional path that review fix #1 introduced and review fix aws-samples#6 hardened. Pre-PR-aws-samples#79 the only ConditionalCheckFailed coverage was the terminal-dedup path; the new lifecycle-persist + dup-delete code lacked direct assertions and was flagged 9/10 criticality by the reviewer. - task_created persist ConditionalCheckFailed → posts duplicate then deletes it: pins the cleanup behaviour that prevents ghost task_created posts in the channel - session_started persist ConditionalCheckFailed → posts duplicate then deletes it: parallel coverage for the other lifecycle attribute (slack_session_msg_ts) - dup-delete failure emits fanout.slack.dup_delete_failed with error_id: pins the operator-alarm signal added in review fix aws-samples#6; asserts both the event key and the FANOUT_SLACK_DUP_DELETE_FAILED error_id propagate - chat.delete returning message_not_found is treated as success (no dup_delete_failed): negative-class assertion. Prevents false-positive alarms when the race resolves cleanly (the duplicate was already deleted by a prior retry). The ghost / message_not_found tests use ``fetchMock.mockImplementation`` URL-routing rather than ``.mockResolvedValueOnce`` chains because ``updateReaction`` issues 2-3 reaction-API fetches between chat.postMessage and chat.delete; routing by URL keeps the test focused on the load-bearing chat.delete behaviour without coupling to reaction call order.

…s#79 test gap aws-samples#32) Pre-PR-aws-samples#79 the new ``taskStrandedMessage`` and ``agentErrorMessage`` helpers in slack-blocks.ts had no direct unit tests. Reviewer flagged this as a 7/10 gap because the renderers carry the prior_status / error_type / message_preview metadata threaded through from the event source — silent drift in the metadata field names would produce ugly fallback messages in production. Adds 5 tests: - task_stranded WITH metadata renders the prior_status parenthetical (``Task stranded for org/repo (last status: RUNNING)``) so operators can tell at a glance whether the task hung in HYDRATING vs RUNNING — without the parenthetical the reviewer's "generic Event: ..." UX regression would resurface. - task_stranded WITHOUT metadata still renders cleanly (legacy events written before the reconciler started stamping metadata must not crash or leak ``undefined``). - agent_error with full metadata (error_type + message_preview) renders the rotating_light, type, and preview. - agent_error WITHOUT metadata stays sensible — no leaked ``undefined`` strings or empty ``_Type:_`` line. - agent_error truncates a 500-char message_preview to keep Slack channel UX readable.

…ples#79 test gap aws-samples#33) Pre-PR-aws-samples#79 review-fix aws-samples#4 there was no direct test for the ``slack_dispatched_agent_error`` dedup attribute or its interaction with the existing ``slack_notified_terminal`` slot. A future refactor that collapsed the two slots — or renamed one of them — would silently break the sibling-channel-failure-retry guarantee that fix aws-samples#4 added. Adds 4 tests: - ``agent_error claims its own dedup attribute``: pins the UpdateExpression and ConditionExpression strings so a refactor that renames the attribute breaks loudly. - ``agent_error retry hits the dedup guard``: end-to-end scenario matching review aws-samples#4 — task already has ``slack_dispatched_agent_error: true``, retry must short-circuit before chat.postMessage. Without the guard, a second rotating_light fires. - ``terminal dedup attribute is per-class``: a flaky task_completed-then-task_failed sequence dedups against the same ``slack_notified_terminal`` slot. Catches the regression where the orchestrator emits both terminal types and we'd otherwise post both ✅ and ❌. - ``agent_error and terminals use distinct dedup slots``: the important negative — having ``slack_dispatched_agent_error`` set must NOT shadow a subsequent ``task_completed``. Pins the slot separation so a future merge into a single slot can't silently drop terminals after an agent_error.

…es#79 test gap aws-samples#34) The construct shipped on issue aws-samples#64 with no unit-level coverage of its IAM contract. The only synth-level signal lived inside ``slack-integration.test.ts`` ("0 EventSourceMapping") which proved the migration didn't regress the OTHER construct. Reviewer flagged this 6/10 — and the gap is what allowed review aws-samples#2 (unconditional Slack secret grant) to slip through in the first place. Adds 6 tests: - ``attaches a single DynamoEventSource on the TaskEventsTable stream``: pins the architectural invariant — issue aws-samples#64 was fundamentally about reaching exactly-one stream reader. Adding a second consumer must fail this test loudly. - ``creates a DLQ for the fanout Lambda``: pins retention period + presence; a DLQ-less deployment would silently drop poison-pill records past retryAttempts. - ``omits the bgagent/slack/* grant when slackSecretArnPattern is not provided``: the review aws-samples#2 invariant. Iterates every IAM::Policy and asserts NONE of them grant secretsmanager:* on a bgagent/slack/* ARN. A regression that re-introduces the unconditional grant breaks this test. - ``attaches the bgagent/slack/* grant only when slackSecretArnPattern is provided``: the positive case. Pins the grant shape (action, effect, resource pattern). - ``passes TASK_TABLE_NAME env var when taskTable is provided``: review aws-samples#3 dependency — the dispatcher throws on missing env. - ``omits TASK_TABLE_NAME env var when taskTable is not provided``: graceful degrade for dev stacks that haven't onboarded the TaskTable yet (matches the construct's documented contract).

test gap aws-samples#35) The reconciler at handlers/reconcile-stranded-tasks.ts:170 emits BOTH ``task_stranded`` and ``task_failed`` for a heartbeat-expired task — one for the operator signal, one to drive the FAILED status transition. Pre-PR-aws-samples#79 this pair had no test coverage; reviewer flagged this 8/10 because the visible failure mode (a paired "Task stranded" + "Task failed" double-page in Slack) would surface in production but be silent in CI. Adds 2 tests: - ``task_stranded posts and writes the terminal dedup marker on first arrival``: pins that task_stranded participates in the shared terminal slot and renders the warning message with metadata. Catches a regression that omits task_stranded from the dedup map entirely. - ``task_stranded after a sibling task_failed dedups``: the operational scenario — task_failed already claimed ``slack_notified_terminal``; the subsequent task_stranded must short-circuit before chat.postMessage. Without this guard, operators get the double-page the reviewer warned about.

…an-message race Live observation during PR aws-samples#79 review verification: the same Slack @mention happy path sometimes leaves the 🚀 task_created message in the thread (orphaned beside the ✅ task_completed) and sometimes deletes it cleanly. The race window: 1. ``task_created`` stream batch posts the rocket message and persists ``slack_created_msg_ts`` via the conditional UpdateItem introduced in PR aws-samples#79 review fix #1. 2. ``task_completed`` stream batch fires ~30s later. Its initial GetItem races the prior UpdateItem and sees a stale ``channel_metadata`` WITHOUT ``slack_created_msg_ts``. 3. The terminal cleanup branch checks ``channelMeta.slack_created_msg_ts`` — undefined — silently skips the chat.delete. The rocket message stays in the thread. Add a fresh GetItem inside the TERMINAL_EVENTS cleanup branch, after the dedup UpdateItem has linearized our view of the table. Any prior ``slack_*_msg_ts`` writes are visible by then, so the cleanup fires correctly. On a re-read failure (DDB throttle / transient blip) we fall back to the dispatch-entry snapshot and emit ``fanout.slack.cleanup_reread_failed`` so operators can alarm on the rate. Pre-existing race (the unconditional UpdateItem in pre-PR-aws-samples#79 was the same shape — wrote, GetItem on the next batch could miss it). PR aws-samples#79 doesn't introduce it but doesn't fix it either; this commit does, since the live screenshot evidence appeared during review verification. Tests: - ``terminal cleanup re-reads TaskRecord``: scripts a stale dispatch-entry GetItem followed by a fresh re-read GetItem with ``slack_created_msg_ts`` present; asserts chat.delete fires against the freshly-read ts. - ``terminal cleanup falls back to dispatch-entry snapshot when re-read fails``: defense-in-depth — DDB throttle on the re-read must not break terminal delivery; cleanup uses the entry snapshot and emits the fallback warn.

isadeks · 2026-05-13T21:29:16Z

Addressed all 10 findings + 5 test gaps. Each in its own commit citing the review item — 16 commits total on top of the original migration.

Critical / High fixes (numbers refer to the review-comment ordering, not GitHub issues):

Finding 1 GitHub 403/429 carved out from terminal-swallow — retry path restored. The blanket 4xx swallow would have permanently dropped entire reconciliation waves under sustained rate-limiting.
Finding 2 Slack secret grant now guarded by an slackSecretArnPattern prop (matches the construct's other guarded grants — taskTable, repoTable, githubTokenSecret).
Finding 3 Missing TASK_TABLE_NAME now throws (was silently returning + counting as dispatched, so a broken stack would drop every Slack notification).
Finding 4 Generalized the dedup mechanism into a SLACK_DEDUP_ATTRIBUTE map covering agent_error (was unguarded — sibling-channel-failure retries would double-page operators). Also surfaces Slack's Retry-After header in the warn log.
Finding 5 Reaction/delete network errors promoted to error with dedicated event keys + error_ids. API-level rejections stay at warn.
Finding 6 New fanout.slack.dup_delete_failed event (with FANOUT_SLACK_DUP_DELETE_FAILED error_id) when the conditional-persist dup-delete path fails, so operators can alarm on accumulating ghost messages.
Finding 7 SlackApiError now matched by class OR err.name (defense against bundler module duplication).
Finding 8 Extended TERMINAL_SLACK_API_ERRORS with ekm_access_denied, missing_scope, team_access_not_granted, posting_to_general_channel_denied, invalid_arguments. Reorganized into commented blocks.
Finding 9 RouteOutcome arrays tightened to ReadonlyArray<NotificationChannel>.
Finding 10 SlackDispatchEvent is now a type alias of FanOutEvent so the contract can't drift silently.

Test gaps:

Gap 1 Conditional UpdateItem race + dup-delete: 4 tests covering the ghost-message paths and the dup_delete_failed alarm signal.
Gap 2 Renderers for task_stranded and agent_error: 5 tests pinning metadata propagation.
Gap 3 Agent_error dedup + slot isolation: 4 tests pinning the dedup map's separation between terminal and agent_error slots.
Gap 4 New cdk/test/constructs/fanout-consumer.test.ts: pins the single-ESM invariant, the conditional Slack secret grant, and the env-var wiring.
Gap 5 task_stranded through dedup: 2 tests for the reconciler-twin double-page scenario.

Plus one extra: commit 9ff9b45 fixes an orphan-message race I caught during live re-verification — the 🚀 Task submitted message sometimes lingered after fast tasks because the terminal-cleanup branch read slack_created_msg_ts from a stale dispatch-entry snapshot. Re-reading the TaskRecord before cleanup (after the dedup write linearizes the view) closes the window. Two tests for happy + fallback paths.

Verification:

1240/1240 unit tests
ESLint clean, synth shows exactly one AWS::Lambda::EventSourceMapping on TaskEventsTable
Dev-stack re-deployed and all 5 review-test scenarios re-verified live, including the orphan-cleanup fix

Ready for re-review.

isadeks requested a review from a team as a code owner May 12, 2026 23:43

Merge branch 'main' into feat/migrate-slack-to-fanout

7a93a4e

krokoko and others added 17 commits May 12, 2026 20:21

Merge branch 'main' into feat/migrate-slack-to-fanout

8615b56

Merge branch 'main' into feat/migrate-slack-to-fanout

29dbaa3

krokoko approved these changes May 13, 2026

View reviewed changes

krokoko added this pull request to the merge queue May 13, 2026

Merged via the queue into aws-samples:main with commit 9592796 May 13, 2026
2 checks passed

isadeks mentioned this pull request May 14, 2026

feat(linear): v1.1 polish — pre-container feedback, state-on-start, sweep, nits #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(fanout): migrate SlackNotifyFn to FanOutConsumer subscriber (#64)#79

feat(fanout): migrate SlackNotifyFn to FanOutConsumer subscriber (#64)#79
krokoko merged 20 commits into
aws-samples:mainfrom
isadeks:feat/migrate-slack-to-fanout

isadeks commented May 12, 2026 •

edited

Loading

Uh oh!

krokoko commented May 12, 2026

Uh oh!

isadeks commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

isadeks commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Notes

Uh oh!

krokoko commented May 12, 2026

Uh oh!

isadeks commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

isadeks commented May 12, 2026 •

edited

Loading

isadeks commented May 13, 2026 •

edited

Loading